## Word2vec model for GOBBYKID

In order to display values representing words similarities, we first need to create some models to train with respect to our corpora. Such models will be able to use <i>word embeddings</i> in order to retrieve similarities between the words of each corpus. This will be accomplished thanks to the mapping of such words into of feature-vectors.</br>
The aim of this section is to develop some different <i><u>word2vec</u></i> models, each with different parameters.</br>
At the end of the training session, we will qualitatively decide which one performs better in the definition of word similarities and save such model (obviously, the models will be two for each try: one for the female authors corpus and the other one for the male authors corpus). 

-------------

### Setup

#### Install required libraries

To have an overview of the libraries used in this project, you can check the [README file](README.md).

Once followed the instructions provided in the above mentioned file, you can import the libraries, as well as the functions defined in the [normalizations functions file](normalization_functions.py).

In [1]:
from functions.normalization_functions import *
import csv

---------

### Reading the corpora

Seen that the purpose of this project is the one defined in the [README file](README.md), we have obviously decided to train our model on the basis of the two corpora that we aim at analyzing.

In order to create a w2v model, we first need to create the two corpora for which we want to build the model, then we will store all the urls of the texts contained inside them into two different lists.</br>
Such lists will be passed as input to a function developed to store all the tokens of a corpus inside a list. During this step, texts will also need some preprocessing operation that will be performed by the same <span style="color:#89FC00"><i>list_builder</i></span> function, contained inside the [normalization functions file](normalization_functions.py), that will also create the two lists of which we were talking before.

In [2]:
f_directory = "assets/Raw corpora/F/"
m_directory = "assets/Raw corpora/M/"
f_corpus = create_corpus(f_directory)
m_corpus = create_corpus(m_directory)

In [3]:
# Female authors corpus
f_authors_texts = list()
for url in f_corpus.fileids():
    if url != '.DS_Store':
        f_authors_texts.append(f_directory+url)

# Male authors corpus
m_authors_texts = list()
for url in m_corpus.fileids():
    if url != '.DS_Store':
        m_authors_texts.append(m_directory+url)
        

Now, as we said above, we will go on by creating the two lists over which we will train the models.

In [4]:
f_tokens = list_builder(f_authors_texts)

100%|██████████| 157/157 [01:25<00:00,  1.83it/s]


In [5]:
m_tokens = list_builder(m_authors_texts)

100%|██████████| 109/109 [01:26<00:00,  1.26it/s]


---------

### Models

Once we have the two lists, we need to initialize the two different models thanks to the Gensim library.</br>
Each model will be used for the respective corpus on which it has been trained.

<b><u>Model testing</u></b>

To this end, we will create six groups of models and see which between these will perform better.</br>
The parameters will be different for each group, and now we will provide a brief overiview of what we want to do.</br>
In the word2vec model we may use 2 different methodologies to retrieve/produce word embeddings:
- <span style="color:#EE6352">CBOW (Continuous Bag Of Words)</span> - this first methodology operates by trying to solve a sort of "fake problem": given a context, the model tries to predict a target missing word. This is the base on which the model is trained in CBOW;
- <span style="color:#EE6352">Skip Grams</span> - the methodology operates in an opposite way, it tries to solve a similar problem but, this time, the input is the target word and the output will be its context.

These are two different ways to train a word2vec model.</br></br>
Together with these different ways to compute embeddings, we will try to change the <i>context-window size</i>, that is basically the amount of words that are part of the context we aim at analyzing (the size represents the amount of words on the left side and on the right side of the target word; e.g for a size of 3, the 3 words on the left and the 3 words on the right of a target word will be considered).</br>
The different window sizes are computed after reading the article [Dependency-Based Word Embeddings](https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf), which states that <i>larger windows tend to capture more topic/domain information, while smaller windows tend to capture more information about the word itself</i>.</br>
        To this end, we will produce three groups of models for each methodology, and these will have small window size, medium window size and large window size.

Finally, we will use a <i>vector_size=80</i> (it means we will consider 80 features to describe the dimension of each word), this choice is based on the size of our corpus.

<span style="color:#F0A202"><i>PAY ATTENTION: In order to use the "workers" parameter defined inside each w2v model object, you need first to install <b>Cython</b>. Such parameter operates by deciding how many core of the CPU will be used for operations concerning the model. If Cython is not installed, the model will use by default one single core, therefore the operations will take way too much time to be performed.</i></span>

-------

#### CBOW Models

<b><span style="color:#EE6352">Small window</span></b>

The first operation to perform is the construction of a model. </br>
Inside the model, some parameters can be specified. For instance, we will work only with:
- <i>window</i>: defines the window size;
- <i>min_count</i>: defines the minnimum length of sentences to be considered;
- <i>workers</i>: defines the CPU cores that will compute calculations for the model (my CPU has 8 cores, if you have a CPU with less cores, please modify this value);
- <i>sg</i>: can assume the value of 1 or 0 (the default value), if its value is 1, then it will use skip grams to train the model, if the value is 0 it will train the model on a CBOW approach.

In [6]:
f_sw_cbow_model = gensim.models.Word2Vec(
    window=3,
    min_count=2, #This parameter do not consider sentences with "x" words or less, where x is the integer value specified
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=0,
    vector_size=100
)

In [7]:
m_sw_cbow_model = gensim.models.Word2Vec(
    window=3,
    min_count=2,
    workers=8, 
    sg=0,
    vector_size=100
)

The next step consists in building <i>vocabularies</i>, that is a collection of all the single tokens for each one of our models.</br>
In order to build vocabularies we will use the <span style="color:#89FC00">build_vocab()</span> method, provided by Gensim.

In [8]:
f_sw_cbow_model.build_vocab(f_tokens, progress_per=1000)

In [9]:
m_sw_cbow_model.build_vocab(m_tokens, progress_per=1000)

-------

<b><span style="color:#EE6352">Medium window</span></b>

In [10]:
f_mw_cbow_model = gensim.models.Word2Vec(
    window=6,
    min_count=2, 
    workers=8,
    sg=0,
    vector_size=100
)

In [11]:
m_mw_cbow_model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=8, 
    sg=0,
    vector_size=100
)

In [12]:
f_mw_cbow_model.build_vocab(f_tokens, progress_per=1000)

In [13]:
m_mw_cbow_model.build_vocab(m_tokens, progress_per=1000)

--------

<b><span style="color:#EE6352">Large window</span></b>

In [14]:
f_lw_cbow_model = gensim.models.Word2Vec(
    window=9,
    min_count=2, 
    workers=8, 
    sg=0,
    vector_size=100
)

In [15]:
m_lw_cbow_model = gensim.models.Word2Vec(
    window=9,
    min_count=2,
    workers=8,
    sg=0,
    vector_size=100
)

In [16]:
f_lw_cbow_model.build_vocab(f_tokens, progress_per=1000)

In [17]:
m_lw_cbow_model.build_vocab(m_tokens, progress_per=1000)

----------

#### Skip Gram Models

While CBOW model is better in capturing syntactic relationships, the Skip Gram model preforms better in extracting semantic relationships.

<b><span style="color:#EE6352">Small window</span></b>

In [18]:
f_sw_skip_model = gensim.models.Word2Vec(
    window=3,
    min_count=2,
    workers=8,
    sg=1,
    vector_size=100
)

In [19]:
m_sw_skip_model = gensim.models.Word2Vec(
    window=3,
    min_count=2,
    workers=8,
    sg=1,
    vector_size=100
)

In [20]:
f_sw_skip_model.build_vocab(f_tokens, progress_per=1000)

In [21]:
m_sw_skip_model.build_vocab(m_tokens, progress_per=1000)

-------

<b><span style="color:#EE6352">Medium window</span></b>

In [22]:
f_mw_skip_model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=8,
    sg=1,
    vector_size=100
)

In [23]:
m_mw_skip_model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=8, 
    sg=1,
    vector_size=100
)

In [24]:
f_mw_skip_model.build_vocab(f_tokens, progress_per=1000)

In [25]:
m_mw_skip_model.build_vocab(m_tokens, progress_per=1000)

--------

<b><span style="color:#EE6352">Large window</span></b>

In [26]:
f_lw_skip_model = gensim.models.Word2Vec(
    window=9,
    min_count=2, 
    workers=8,
    sg=1,
    vector_size=100
)

In [27]:
m_lw_skip_model = gensim.models.Word2Vec(
    window=9,
    min_count=2,
    workers=8,
    sg=1,
    vector_size=100
)

In [28]:
f_lw_skip_model.build_vocab(f_tokens, progress_per=1000)

In [29]:
m_lw_skip_model.build_vocab(m_tokens, progress_per=1000)

--------

### Train the models

The next step consists on training the different groups of models that we have built. We will do it thanks to the Gensim <span style="color:#89FC00">.train()</span> method.

<b><span style="color:#EE6352">Small window</span></b>

In [30]:
f_sw_cbow_model.train(f_tokens, total_examples=f_sw_cbow_model.corpus_count, epochs=f_sw_cbow_model.epochs)
m_sw_cbow_model.train(m_tokens, total_examples=m_sw_cbow_model.corpus_count, epochs=m_sw_cbow_model.epochs)

(24429604, 28045975)

In [31]:
f_sw_skip_model.train(f_tokens, total_examples=f_sw_skip_model.corpus_count, epochs=f_sw_skip_model.epochs)
m_sw_skip_model.train(m_tokens, total_examples=m_sw_skip_model.corpus_count, epochs=m_sw_skip_model.epochs)

(24431481, 28045975)

--------------

<b><span style="color:#EE6352">Medium window</span></b>

In [32]:
f_mw_cbow_model.train(f_tokens, total_examples=f_mw_cbow_model.corpus_count, epochs=f_mw_cbow_model.epochs)
m_mw_cbow_model.train(m_tokens, total_examples=m_mw_cbow_model.corpus_count, epochs=m_mw_cbow_model.epochs)

(24429861, 28045975)

In [33]:
f_mw_skip_model.train(f_tokens, total_examples=f_mw_skip_model.corpus_count, epochs=f_mw_skip_model.epochs)
m_mw_skip_model.train(m_tokens, total_examples=m_mw_skip_model.corpus_count, epochs=m_mw_skip_model.epochs)

(24427717, 28045975)

--------

<b><span style="color:#EE6352">Large window</span></b>

In [34]:
f_lw_cbow_model.train(f_tokens, total_examples=f_lw_cbow_model.corpus_count, epochs=f_lw_cbow_model.epochs)
m_lw_cbow_model.train(m_tokens, total_examples=m_lw_cbow_model.corpus_count, epochs=m_lw_cbow_model.epochs)

(24430289, 28045975)

In [35]:
f_lw_skip_model.train(f_tokens, total_examples=f_lw_skip_model.corpus_count, epochs=f_lw_skip_model.epochs)
m_lw_skip_model.train(m_tokens, total_examples=m_lw_skip_model.corpus_count, epochs=m_lw_skip_model.epochs)

(24431336, 28045975)

--------

### Comparison between models

<b><span style="color:#EE6352">Small window</span></b>

In [36]:
DataFrame(f_sw_cbow_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,Quaker,0.844315
1,classic,0.842884
2,tribes,0.832781
3,Danish,0.824343
4,BOAT,0.819641
5,clique,0.818624
6,Canadian,0.816332
7,cactus,0.815886
8,Brave,0.814691
9,Flag,0.814354


In [37]:
DataFrame(m_sw_cbow_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,Inca's,0.721102
1,maimed,0.712571
2,Sirayo,0.712122
3,Highland,0.70865
4,peasant,0.698647
5,stalwart,0.698023
6,ruthless,0.689938
7,chief's,0.689508
8,male,0.685276
9,Bushman,0.681512


In [38]:
DataFrame(f_sw_skip_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,unexceptionable,0.825607
1,memoirs,0.81912
2,Templar,0.816198
3,thrift,0.814672
4,Glamorganshire,0.81461
5,estimable,0.813471
6,irreproachable,0.811247
7,gentlefolk,0.808343
8,barrister,0.806344
9,tribes,0.805601


In [39]:
DataFrame(m_sw_skip_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,Circassian,0.779847
1,male,0.774135
2,peasant,0.736671
3,farmer's,0.733759
4,buxom,0.723407
5,gypsy,0.719501
6,disreputable,0.718396
7,youthful,0.718127
8,comely,0.715622
9,host's,0.711693


--------------

<b><span style="color:#EE6352">Medium window</span></b>

In [40]:
DataFrame(f_mw_cbow_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,apparel,0.829114
1,handsomest,0.820353
2,pernicious,0.820321
3,development,0.819603
4,irreproachable,0.815382
5,culture,0.809343
6,typical,0.808305
7,male,0.799931
8,memoirs,0.796302
9,duchy,0.794159


In [41]:
DataFrame(m_mw_cbow_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,stature,0.697793
1,warrior,0.696483
2,Samoan,0.687135
3,male,0.686008
4,youthful,0.676058
5,stalwart,0.674166
6,populace,0.672295
7,Circassian,0.66593
8,negro,0.663751
9,haughty,0.66341


In [42]:
DataFrame(f_mw_skip_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,irreproachable,0.81202
1,thrift,0.807616
2,earldom,0.805731
3,eminent,0.79655
4,thrifty,0.795568
5,exclusiveness,0.790358
6,male,0.78465
7,designing,0.784201
8,adhere,0.783271
9,bestows,0.78266


In [43]:
DataFrame(m_mw_skip_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,Circassian,0.761525
1,male,0.741537
2,tenanted,0.736093
3,peasant,0.731912
4,comely,0.723324
5,buxom,0.722297
6,females,0.720777
7,Zara,0.710981
8,Hanyfa,0.70871
9,tutors,0.706595


--------

<b><span style="color:#EE6352">Large window</span></b>

In [44]:
DataFrame(f_lw_cbow_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,typical,0.790219
1,male,0.78758
2,irreproachable,0.769901
3,showy,0.76674
4,pretensions,0.764564
5,average,0.751816
6,graces,0.74073
7,Spanish,0.740185
8,Waylands,0.738091
9,Perier,0.731368


In [45]:
DataFrame(m_lw_cbow_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,elegantly,0.68856
1,male,0.687236
2,refined,0.67752
3,singularly,0.666218
4,stature,0.663521
5,youthful,0.663403
6,exhibiting,0.662219
7,personage,0.660776
8,Oriental,0.660769
9,bridegroom,0.660482


In [46]:
DataFrame(f_lw_skip_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,irreproachable,0.77967
1,unexceptionable,0.771792
2,thrifty,0.767325
3,radical,0.760079
4,dependents,0.756742
5,hoydenish,0.754764
6,efficient,0.754558
7,gentility,0.753061
8,attributes,0.751853
9,bias,0.7513


In [47]:
DataFrame(m_lw_skip_model.wv.most_similar("female")).rename(columns={0:"Word", 1:"Vector similarity"})

Unnamed: 0,Word,Vector similarity
0,Circassian,0.753265
1,male,0.735146
2,peasant,0.727973
3,host's,0.722724
4,Villiers,0.718827
5,negress,0.716875
6,harem,0.71501
7,Zara,0.713954
8,damsel,0.71156
9,tutors,0.705417


--------

### Conclusions

Finally, once we have looked at all these models, we came out that for our kind of analysis it would be better to use a <span style="color:#F0A202"><i>skip-gram based model</i></span> in order to highlight the semantical similarities between words.

Moreover, for what concerns the window of words over which is better to train the model, we prefer to use a <span style="color:#F0A202"><i>large window size</i></span> in order to avoid information too much related with the target word that wouldn't probably be so useful to our aims.

In order to maintain such models we will save them in the [Models directory](assets/Models/):

In [48]:
f_lw_skip_model.save("assets/Models/F/f_embeddings_pre.model")
m_lw_skip_model.save("assets/Models/M/m_embeddings_pre.model")

If you want to re-use the model you just need to load it thanks to the <span style="color:#89FC00">load()</span> method.</br>
```
    model = gensim.models.Word2Vec.load("model_path.model")
```

### Final storage

In order to be able to visualize the embedding space outside this notebook we are going to save the vectors (basically the two models vocabularies) produced as a TSV files, and the words related as metadata files (again, in TSV format).

In [49]:
with open('assets/Visualizations datasets/Word embedding space/Models/f_embeddings.tsv', 'w') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    words = f_lw_skip_model.wv.key_to_index
    for word in words:
        vector = f_lw_skip_model.wv.get_vector(word).tolist()
        row = [word] + vector
        writer.writerow(row)
        
with open('assets/Visualizations datasets/Word embedding space/Metadata/f_metadata.tsv', 'w') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\n')
    words = f_lw_skip_model.wv.key_to_index
    for word in words:
        row = [word]
        writer.writerow(row)

In [50]:
with open('assets/Visualizations datasets/Word embedding space/Models/m_embeddings.tsv', 'w') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    words = m_lw_skip_model.wv.key_to_index
    for word in words:
        vector = m_lw_skip_model.wv.get_vector(word).tolist()
        row = [word] + vector
        writer.writerow(row)
        
with open('assets/Visualizations datasets/Word embedding space/Metadata/m_metadata.tsv', 'w') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\n')
    words = m_lw_skip_model.wv.key_to_index
    for word in words:
        row = [word]
        writer.writerow(row)

For more information about the visualization that we decided to use for word embeddings, take a look at the README file.

In [51]:
words = f_lw_skip_model.wv.key_to_index
w = []
for k in words:
    w.append(k)
print(len(w))
vector = f_lw_skip_model.wv.get_vector('the').tolist()
len(vector)

44148


100

In [52]:
words = m_lw_skip_model.wv.key_to_index
w = []
for k in words:
    w.append(k)
print(len(w))
vector = m_lw_skip_model.wv.get_vector('the').tolist()
len(vector)

50333


100

-----