# Script 1. Corpus Creation and Word2Vec

This script takes the original Diorisis xml files, builds the corpus for the VSM, trains the data with the Word2Vec model, and performs a dimensionality reduction using t-SNE. It outputs a csv file containing a list of Greek models from the corpus, with corresponding x and y coordinates.

This script only needs to be run once, unless you make changes to which xml files are used for the corpus, or to the parameters for the Word2Vec model. 

First, load the dependencies. If any dependencies are not downloaded onto your computer, use pip to install.

In [None]:
## !pip install cltk ##

from cltk.corpus.greek.beta_to_unicode import Replacer
from cltk.corpus.utils.formatter import tonos_oxia_converter
from cltk.stop.greek.stops import STOPS_LIST
from glob import glob
from xml.etree.ElementTree import parse

import re
import os
import os.path
import pandas as pd

## Step 1: Construct the Corpus
This code takes the Koine Greek texts sourced from Diorisis Corpus, taking the lemma entries and appending them into a list. The relevant xml files can be found at https://figshare.com/articles/dataset/The_Diorisis_Ancient_Greek_Corpus/6187256. The texts used in this project are speicifed in Appendix A of the written thesis.

### Stopwords

This csv file lists all the stopwords we wish to exclude from the corpus. The file can be downloaded from the Word2Vec_koine_greek GitHub repository.

In [None]:
new_stops = os.path.join("Desktop/Word2Vec_koine_greek-master", "new_stops.csv")

f = open(new_stops)

X = pd.read_csv(f, delimiter=",", )

X.head()
df = pd.DataFrame(X, columns=['Add Stops'])
new_list = df['Add Stops'].values.tolist()

In [None]:
## for testing purposes ##
print(new_list)

### XML Parser

In [None]:
# Parsing XML

xml_files = glob('Desktop/greek_corpus/*.xml')
replacer = Replacer()
corpus = []
for xml in xml_files:
    with open(xml, 'r') as x:
        tree = parse(x)
        root = tree.getroot()
        for sentence in root.iter('sentence'):
            sentences = []
            for word in sentence.iter('word'):
                for lemma in word.iter('lemma'):
                    entry = lemma.get('entry')
                    if entry is None:
                        entry = replacer.beta_code(word.get('form'))
                        sentences.append(entry)
                    elif tonos_oxia_converter(entry) not in new_list:
                        sentences.append(entry)
            if len(sentences) > 0:
                corpus.append(sentences)
    x.close()


## print(corpus) ## Testing purposes ##

## Step 2: Run the Word2Vec Model
The following scripts takes the preprocessed corpus and trains the data with Word2Vec.

In [None]:
# dependencies
from __future__ import absolute_import, division, print_function
import codecs
import glob
import multiprocessing
import gensim.models.word2vec as w2v
import sklearn.manifold

### Configure the model's parameters

In [None]:
# This code defines the hyperparameter
# Dimensionality of the resulting word vectors. 
# The more vectors, the more computaionally extensive to train, but also more accurate.
num_features = 500

# Minimum word count threshold.
min_word_count = 10

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length. Note that Munson (2017: 17) says context_size is optimized at 12 for Greek.
context_size = 2

# Downsample setting for frequent words.
#rate 0 and 1e-5 
#how often to use
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible. This is a random number generator
seed = 1

In [None]:
greek2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [None]:
greek2vec.build_vocab(corpus)

In [None]:
token_count = sum([len(sentence) for sentence in corpus])
print('The corpus contains {0:,} tokens'.format(token_count))

In [None]:
%%time
#train model on sentneces, this may take a while to process
greek2vec.train(corpus, total_examples=len(corpus), epochs=100)

### Save and Load Model

In [None]:
#save model
if not os.path.exists("Desktop/Word2Vec_koine_greek-master"):
    os.makedirs("Desktop/Word2Vec_koine_greek-master")

In [None]:
greek2vec.save(os.path.join("Desktop/Word2Vec_koine_greek-master", "greek2vec.w2v"))

In [None]:
#load model
greek2vec = w2v.Word2Vec.load(os.path.join("Desktop/Word2Vec_koine_greek-master", "greek2vec.w2v"))

## Step 3: Perform Dimensionality Reduction with t-SNE

In [None]:
#squash dimensionality to 2-dimensions
#https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)
#, perplexity=20)

In [None]:
#put it all into a giant matrix
all_word_vectors_matrix = greek2vec.wv.syn0

In [None]:
%%time
#train t sne
all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)