## Word2vec model for GOBBYKID

In order to display values representing words similarities, we first need to create some models to train with respect to our corpora. Such models will be ablo to use <i>word embeddings</i> in order to retrieve similarities between the words of each corpus. This will be accomplished thanks to the mapping of such words into of feature-vectors.</br>
The aim of this section is to develop some different <i><u>word2vec</u></i> models, each with different parameters.</br>
At the end of the training session, we will qualitatively decide which one performs better in the definition of word similarities and save such model (obviously, the models will be two for each try: one for the female authors corpus and the other one for the male authors corpus). 

---------

### Setup

#### Install required libraries

To have an overview of the libraries used in this project, you can check the [README file](README.md).

Once followed the instructions provided in the above mentioned file, you can import the libraries, as well as the functions defined in the [normalizations functions file](normalization_functions.py).

In [1]:
from normalization_functions import *

---------

### Reading the corpora

Seen that the purpose of this project is the one defined in the [README file](README.md), we have obviously decided to train our model on the basis of the two corpora that we aim at analyzing.

In order to create a w2v model, we first need to create the two corpora for which we want to build the model, then we will store all the urls of the texts contained inside them into two different lists.</br>
Such lists will be passed as input to a function developed to store all the tokens of a corpus inside a list. During this step, texts will also need some preprocessing operation that will be performed by the same <span style="color:#89FC00"><i>list_builder</i></span> function, contained inside the [normalization functions file](normalization_functions.py), that will also create the two lists of which we were talking before.

In [2]:
f_directory = "Raw/F/"
m_directory = "Raw/M/"
f_corpus = create_corpus(f_directory)
m_corpus = create_corpus(m_directory)

In [3]:
# Female authors corpus
f_authors_texts = list()
for url in f_corpus.fileids():
    if url != '.DS_Store':
        f_authors_texts.append(f_directory+url)

# Male authors corpus
m_authors_texts = list()
for url in m_corpus.fileids():
    if url != '.DS_Store':
        m_authors_texts.append(m_directory+url)

These are the urls extracted:

In [4]:
print("URLS of female authors texts:", f_authors_texts, '\n')
print("URLS of male authors texts:", m_authors_texts, '\n')

URLS of female authors texts: ['Raw/F/1839_sinclair-holiday-house-a-series-of-tales.txt', 'Raw/F/1841_martineau-the-settlers-at-home.txt', 'Raw/F/1857_browne-grannys-wonderful-chair.txt', 'Raw/F/1857_tucker-the-rambles-of-a-rat.txt', 'Raw/F/1862_ewing-melchiors-dream-and-other-tales.txt', 'Raw/F/1869_ewing-mrs-overtheways-remembrances.txt', 'Raw/F/1869_ewing-the-land-of-lost-toys.txt', 'Raw/F/1870_ewing-the-brownies-and-other-tales.txt', 'Raw/F/1872_craik-the-adventure-of-a-brownie.txt', 'Raw/F/1872_de-la-ramee-a-dog-of-flanders.txt', 'Raw/F/1873_ewing-a-flat-iron-for-a-farthing.txt', 'Raw/F/1875_craik-the-little-lame-prince-and-his-traveling-cloack.txt', 'Raw/F/1876_ewing-jan-of-the-windmill.txt', 'Raw/F/1876_ewing-six-to-sixteen-a-story-for-girls.txt', 'Raw/F/1877_ewing-a-great-emergency-and-other-tales.txt', 'Raw/F/1877_molesworth-the-cuckoo-clock.txt', 'Raw/F/1877_sewell-black-beauty.txt', 'Raw/F/1879_ewing-jackanapes-daddy-darwins-dovecot-and-other-stories.txt', 'Raw/F/1882_ewing-

Now, as we said above, we will go on by creating the two lists:

In [53]:
f_tokens = list_builder(f_authors_texts)

In [7]:
m_tokens = list_builder(m_authors_texts)

---------

### Models

Once we have the two lists, we need to initialize the two different models thanks to the Gensim library.</br>
Each model will be used for the respective corpus on which it has been trained.

To this end, we will create six groups of models and see which between these will perform better.</br>
The parameters will be different for each group, and now we will provide a brief overiview of what we want to do.</br>
In the word2vec model we may use 2 different methodologies to retrieve/produce word embeddings:
- <span style="color:#EE6352">CBOW (Continuous Bag Of Words)</span> - this first methodology operates by trying to solve a sort of "fake problem": given a context, the model tries to predict a target missing word. This is the base on which the model is trained in CBOW;
- <span style="color:#EE6352">Skip Grams</span> - the methodologies operates in an opposite way, in fact it tries to solve a similar problem but, this time, the input is the target word and the output will be its context.

These are two different ways to train a word2vec model.</br></br>
Together with these different ways to compute embeddings, we will try to change the <i>context-window size</i>, that is basically the amount of words that are part of the context we aim at analyzing (the size represents the amount of words on the left side and on the right side of the target word; e.g for a size of 3, the 3 words on the left and the 3 words on the right of a target word will be considered).</br>
The different window sizes are computed after reading the article [Dependency-Based Word Embeddings](https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf), which states that <i>larger windows tend to capture more topic/domain information, while smaller windows tend to capture more information about the word itself</i>.</br>
        To this end, we will produce three groups of models for each methodology, and these will have small window size, medium window size and large window size.

<span style="color:#F0A202"><i>PAY ATTENTION: In order to use the "workers" parameter defined inside each w2v model object, you need first to install <b>Cython</b>. Such parameter operates by deciding how many core of the CPU will be used for operations concerning the model. If Cython is not installed, the model will use by default one single core, therefore the operations will take way too much time to be performed.</i></span>

-------

#### CBOW Models

<b><span style="color:#EE6352">Small window</span></b>

The first operation to perform is the construction of a model. </br>
Inside the model, some parameters can be specified. For instance, we will work only with:
- <i>window</i>: defines the window size;
- <i>min_count</i>: defines the minnimum length of sentences to be considered;
- <i>workers</i>: defines the CPU cores that will compute calculations for the model (my CPU has 8 cores, if you have a CPU with less cores, please modify this value);
- <i>sg</i>: can assume the value of 1 or 0 (the default value), if its value is 1, then it will use skip grams to train the model, if the value is 0 it will train the model on a CBOW approach.

In [8]:
f_sw_cbow_model = gensim.models.Word2Vec(
    window=3,
    min_count=2, # This parameter do not consider sentences with "x" words or less, where x is the integer value specified
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=0
)

In [9]:
m_sw_cbow_model = gensim.models.Word2Vec(
    window=3,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=0
)

The next step consists in building <i>vocabularies</i>, that is a collection of all the single tokens for each one of our models.</br>
In order to build vocabularies we will use the <span style="color:#89FC00">build_vocab()</span> method, provided by Gensim.

In [10]:
f_sw_cbow_model.build_vocab(f_tokens, progress_per=1000)

In [11]:
m_sw_cbow_model.build_vocab(m_tokens, progress_per=1000)

-------

<b><span style="color:#EE6352">Medium window</span></b>

In [12]:
f_mw_cbow_model = gensim.models.Word2Vec(
    window=6,
    min_count=2, # This parameter do not consider sentences with "x" words or less, where x is the integer value specified
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=0
)

In [13]:
m_mw_cbow_model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=0
)

In [14]:
f_mw_cbow_model.build_vocab(f_tokens, progress_per=1000)

In [15]:
m_mw_cbow_model.build_vocab(m_tokens, progress_per=1000)

--------

<b><span style="color:#EE6352">Large window</span></b>

In [16]:
f_lw_cbow_model = gensim.models.Word2Vec(
    window=9,
    min_count=2, # This parameter do not consider sentences with "x" words or less, where x is the integer value specified
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=0
)

In [17]:
m_lw_cbow_model = gensim.models.Word2Vec(
    window=9,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=0
)

In [18]:
f_lw_cbow_model.build_vocab(f_tokens, progress_per=1000)

In [19]:
m_lw_cbow_model.build_vocab(m_tokens, progress_per=1000)

----------

#### Skip Gram Models

<b><span style="color:#EE6352">Small window</span></b>

In [20]:
f_sw_skip_model = gensim.models.Word2Vec(
    window=3,
    min_count=2, # This parameter do not consider sentences with "x" words or less, where x is the integer value specified
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

In [21]:
m_sw_skip_model = gensim.models.Word2Vec(
    window=3,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

In [22]:
f_sw_skip_model.build_vocab(f_tokens, progress_per=1000)

In [23]:
m_sw_skip_model.build_vocab(m_tokens, progress_per=1000)

-------

<b><span style="color:#EE6352">Medium window</span></b>

In [24]:
f_mw_skip_model = gensim.models.Word2Vec(
    window=6,
    min_count=2, # This parameter do not consider sentences with "x" words or less, where x is the integer value specified
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

In [25]:
m_mw_skip_model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

In [26]:
f_mw_skip_model.build_vocab(f_tokens, progress_per=1000)

In [27]:
m_mw_skip_model.build_vocab(m_tokens, progress_per=1000)

--------

<b><span style="color:#EE6352">Large window</span></b>

In [28]:
f_lw_skip_model = gensim.models.Word2Vec(
    window=9,
    min_count=2, # This parameter do not consider sentences with "x" words or less, where x is the integer value specified
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

In [29]:
m_lw_skip_model = gensim.models.Word2Vec(
    window=9,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

In [30]:
f_lw_skip_model.build_vocab(f_tokens, progress_per=1000)

In [31]:
m_lw_skip_model.build_vocab(m_tokens, progress_per=1000)

--------

### Train the models

The next step consists on training the different groups of models that we have built. We will do it thanks to the Gensim <span style="color:#89FC00">.train()</span> method.

<b><span style="color:#EE6352">Small window</span></b>

In [32]:
f_sw_cbow_model.train(f_tokens, total_examples=f_sw_cbow_model.corpus_count, epochs=f_sw_cbow_model.epochs)
m_sw_cbow_model.train(m_tokens, total_examples=m_sw_cbow_model.corpus_count, epochs=m_sw_cbow_model.epochs)

(6226776, 7220155)

In [33]:
f_sw_skip_model.train(f_tokens, total_examples=f_sw_skip_model.corpus_count, epochs=f_sw_skip_model.epochs)
m_sw_skip_model.train(m_tokens, total_examples=m_sw_skip_model.corpus_count, epochs=m_sw_skip_model.epochs)

(6225343, 7220155)

--------------

<b><span style="color:#EE6352">Medium window</span></b>

In [34]:
f_mw_cbow_model.train(f_tokens, total_examples=f_mw_cbow_model.corpus_count, epochs=f_mw_cbow_model.epochs)
m_mw_cbow_model.train(m_tokens, total_examples=m_mw_cbow_model.corpus_count, epochs=m_mw_cbow_model.epochs)

(6226185, 7220155)

In [35]:
f_mw_skip_model.train(f_tokens, total_examples=f_mw_skip_model.corpus_count, epochs=f_mw_skip_model.epochs)
m_mw_skip_model.train(m_tokens, total_examples=m_mw_skip_model.corpus_count, epochs=m_mw_skip_model.epochs)

(6226273, 7220155)

--------

<b><span style="color:#EE6352">Large window</span></b>

In [36]:
f_lw_cbow_model.train(f_tokens, total_examples=f_lw_cbow_model.corpus_count, epochs=f_lw_cbow_model.epochs)
m_lw_cbow_model.train(m_tokens, total_examples=m_lw_cbow_model.corpus_count, epochs=m_lw_cbow_model.epochs)

(6225367, 7220155)

In [37]:
f_lw_skip_model.train(f_tokens, total_examples=f_lw_skip_model.corpus_count, epochs=f_lw_skip_model.epochs)
m_lw_skip_model.train(m_tokens, total_examples=m_lw_skip_model.corpus_count, epochs=m_lw_skip_model.epochs)

(6227745, 7220155)

--------

### Comparison between models

<b><span style="color:#EE6352">Small window</span></b>

In [59]:
f_sw_cbow_model.wv.most_similar("cat")

[('pony', 0.8841539621353149),
 ('sailor', 0.8484433889389038),
 ('pawnshop', 0.8472279906272888),
 ('witch', 0.8382288217544556),
 ('ticket', 0.828952431678772),
 ('rat', 0.827035665512085),
 ('funny', 0.8247242569923401),
 ('stride', 0.8243754506111145),
 ('hump', 0.8238370418548584),
 ('locket', 0.8225745558738708)]

In [60]:
m_sw_cbow_model.wv.most_similar("cat")

[('rabbit', 0.7644575238227844),
 ('snake', 0.7404806613922119),
 ('mouse', 0.7385162711143494),
 ('duck', 0.7225423455238342),
 ('dog', 0.6943410634994507),
 ('tiger', 0.6769894957542419),
 ('weasel', 0.6688323616981506),
 ('frog', 0.6562343239784241),
 ('bird', 0.6529768705368042),
 ('madman', 0.6491336822509766)]

In [61]:
f_sw_skip_model.wv.most_similar("cat")

[('mare', 0.7182251811027527),
 ('charger', 0.6970164775848389),
 ("'who", 0.6866154670715332),
 ('mask', 0.6820776462554932),
 ('tugged', 0.6818537712097168),
 ('snout', 0.6755131483078003),
 ('gipsy', 0.6753441095352173),
 ('basil', 0.6703851819038391),
 ('billy', 0.6690394878387451),
 ('parenthesis', 0.6674005389213562)]

In [63]:
m_sw_skip_model.wv.most_similar("cat")

[('frog', 0.7005339860916138),
 ('rabbit', 0.6855319738388062),
 ('mouse', 0.6796906590461731),
 ('weasel', 0.67022705078125),
 ('panther', 0.6669593453407288),
 ('gardener', 0.666776716709137),
 ('cheshire', 0.6644233465194702),
 ('madman', 0.6584529876708984),
 ('chuchundra', 0.656449019908905),
 ('porpoise', 0.6506761908531189)]

--------------

<b><span style="color:#EE6352">Medium window</span></b>

In [64]:
f_mw_cbow_model.wv.most_similar("cat")

[('basil', 0.8505427837371826),
 ('pony', 0.8295732736587524),
 ('hump', 0.8141831755638123),
 ('carp', 0.8107138872146606),
 ("m'swyne", 0.8095114827156067),
 ('mustache', 0.8079072833061218),
 ('scotchwoman', 0.8060427308082581),
 ('cripple', 0.8038398027420044),
 ('tombstone', 0.8025147914886475),
 ('locket', 0.802374005317688)]

In [65]:
m_mw_cbow_model.wv.most_similar("cat")

[('mouse', 0.7959019541740417),
 ('rabbit', 0.706752598285675),
 ('bird', 0.7050092220306396),
 ('hoarse', 0.6649045944213867),
 ('duchess', 0.6546960473060608),
 ('gluck', 0.6543200612068176),
 ('sigh', 0.6348352432250977),
 ('una', 0.6316945552825928),
 ('dog', 0.6301178932189941),
 ('madman', 0.6276779174804688)]

In [66]:
f_mw_skip_model.wv.most_similar("cat")

[("'n'", 0.6549415588378906),
 ('snout', 0.6533805727958679),
 ('puppy', 0.6421769857406616),
 ('basil', 0.6416086554527283),
 ('magpie', 0.638731062412262),
 ('hostler', 0.6283052563667297),
 ('quicker', 0.6271837949752808),
 ("'who", 0.6242798566818237),
 ('bitten', 0.623835027217865),
 ('squeaking', 0.6227161288261414)]

In [67]:
m_mw_skip_model.wv.most_similar("cat")

[('mouse', 0.6640286445617676),
 ('cheshire', 0.6524803042411804),
 ('parrot', 0.6407116651535034),
 ('buzzy', 0.634723961353302),
 ('weasel', 0.6270034313201904),
 ('pummy', 0.6223182082176208),
 ('panther', 0.6167601346969604),
 ('barked', 0.6111574172973633),
 ('chuchundra', 0.6072313189506531),
 ('terrier', 0.6021856665611267)]

--------

<b><span style="color:#EE6352">Large window</span></b>

In [68]:
f_lw_cbow_model.wv.most_similar("cat")

[('gemman', 0.8393821120262146),
 ('suspiciously', 0.8381377458572388),
 ('sigh', 0.833740234375),
 ('mustache', 0.8319226503372192),
 ('pining', 0.829613208770752),
 ('witch', 0.8292569518089294),
 ('twinkle', 0.8275833129882812),
 ('encampment', 0.8221360445022583),
 ('pony', 0.8209250569343567),
 ('sharply', 0.8182221055030823)]

In [69]:
m_lw_cbow_model.wv.most_similar("cat")

[('mouse', 0.7600972056388855),
 ('rabbit', 0.7179290652275085),
 ('sigh', 0.6581215858459473),
 ('lobster', 0.6536357402801514),
 ('una', 0.6477910876274109),
 ('frog', 0.6341004371643066),
 ('remorsefully', 0.6334105730056763),
 ('madman', 0.6331332921981812),
 ('gluck', 0.6304264068603516),
 ('dog', 0.6173111200332642)]

In [70]:
f_lw_skip_model.wv.most_similar("cat")

[('puppy', 0.6028403639793396),
 ('purring', 0.5921505093574524),
 ('puss', 0.5890118479728699),
 ('snout', 0.5772577524185181),
 ('tiger', 0.5758661031723022),
 ('mouse', 0.5753436088562012),
 ('pheasant', 0.573943555355072),
 ('hurrah', 0.5694153904914856),
 ('squeaking', 0.5672130584716797),
 ('worries', 0.5607874989509583)]

In [71]:
m_lw_skip_model.wv.most_similar("cat")

[('cheshire', 0.7050946950912476),
 ('barked', 0.6400287747383118),
 ('parrot', 0.6369614005088806),
 ('buzzy', 0.6364743709564209),
 ('purr', 0.6242054104804993),
 ('mew', 0.6205807328224182),
 ('mouse', 0.6187389492988586),
 ('frog', 0.6104247570037842),
 ('snarl', 0.6101961135864258),
 ('raving', 0.6100438237190247)]