## Word2vec model for GOBBYKID

In order to display values representing words similarities, we first need to create some models to train with respect to our corpora. Such models will be ablo to use <i>word embeddings</i> in order to retrieve similarities between the words of each corpus. This will be accomplished thanks to the mapping of such words into of feature-vectors.

### Setup

#### Install required libraries

In order to be able to define a w2v (Word2Vec) model, we first need to install some libraries.</br>
In particular, we will need to install [Gensim](https://radimrehurek.com/gensim/index.html), [Pandas](https://pandas.pydata.org), libraries thanks to the pip command.

Install the latest version of gensim:

```bash
    pip install --upgrade gensim
```

Or, if you have instead downloaded and unzipped the [source tar.gz]
package:

```bash
    python setup.py install
```

Once done that, you can import the libraries, as well as the functions defined in the [w2v functions file](w2v_functions.py).

In [1]:
import gensim
import pandas as pd
from w2v_functions import *

### Reading and Exploring the Dataset
We decide to train our model on the basis of our corpora.

The first operation is concerned with reading all texts and processing them as sentences.</br>
In order to do that we first need to create the two corpora for which we want to build the model, then we will store all the urls of the texts contained inside them into 2 different lists to pass in input to a function developed to store all the tokens of a corpus inside a list.

In [2]:
f_directory = "Raw/F/"
m_directory = "Raw/M/"
f_corpus = create_corpus(f_directory)
m_corpus = create_corpus(m_directory)

In [3]:
f_authors_texts = list()
f_titles = list()
for url in f_corpus.fileids():
    if url != '.DS_Store':
        f_authors_texts.append(f_directory+url)

m_authors_texts = list()
m_titles = list()
for url in m_corpus.fileids():
    if url != '.DS_Store':
        m_authors_texts.append(m_directory+url)

print("URLS of female authors texts:", f_authors_texts, '\n')
print("URLS of male authors texts:", m_authors_texts, '\n')

URLS of female authors texts: ['Raw/F/1857_grannys_wonderful_chair.txt', 'Raw/F/1857_the_rambles_of_a_rat.txt', 'Raw/F/1869_little_women.txt', 'Raw/F/1869_mrs_overtheways_remembrances.txt', 'Raw/F/1872_a_dog_of_flanders.txt', 'Raw/F/1877_black_beauty.txt', 'Raw/F/1877_the_cuckoo_clock.txt', 'Raw/F/1886_little_lord_fauntleroy.txt', 'Raw/F/1899_the_story_of_the_treasure_seekers.txt', 'Raw/F/1902_the_tale_of_peter_rabbit.txt', 'Raw/F/1903_rebecca_of_sunnybrook_farm.txt', 'Raw/F/1908_anne_of_green_gables.txt', 'Raw/F/1911_the_secret_garden.txt'] 

URLS of male authors texts: ['Raw/M/1857_tom_browns_school_days.txt', 'Raw/M/1865_alices_adventures_in_wonderland.txt', 'Raw/M/1869_david_copperfield.txt', 'Raw/M/1871_at_the_back_of_the_north_wind.txt', 'Raw/M/1876_the_adventures_of_tom_sawyer.txt', 'Raw/M/1883_treasure_island.txt', 'Raw/M/1888_the_happy_prince_and_other_tales.txt', 'Raw/M/1894_the_jungle_book.txt'] 



In [4]:
f_tokens = list_builder(f_authors_texts)

In [5]:
m_tokens = list_builder(m_authors_texts)

Once we have the two lists, we need to initialize 2 different models thanks to the Gensim library.</br>
Each model will be used for the respective corpus on which it has been trained.

<span style="color:red">In order to use the "workers" parameter, you need first to install cython</span>

In [17]:
f_model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

In [18]:
m_model = gensim.models.Word2Vec(
    window=6,
    min_count=2,
    workers=8, #This parameter is intended for CPU cores, if you have a CPU with less than 8 cores please modify the value
    sg=1
)

Now, we need to build the vocabularies of single tokens for each one of our models.

In [19]:
f_model.build_vocab(f_tokens, progress_per=1000)

In [20]:
m_model.build_vocab(m_tokens, progress_per=1000)

In [10]:
f_model.corpus_count

43108

The next step consists on training the two models:

In [21]:
f_model.train(f_tokens, total_examples=f_model.corpus_count, epochs=f_model.epochs)

(2254977, 2632805)

In [22]:
m_model.train(m_tokens, total_examples=m_model.corpus_count, epochs=m_model.epochs)

(2218105, 2600725)

Now we will compute some random words as examples:

In [23]:
f_model.wv.most_similar("girl")

[('child', 0.7626883387565613),
 ('creature', 0.7604007720947266),
 ('accomplished', 0.7479668855667114),
 ('passenger', 0.7471694946289062),
 ('soul', 0.7455093860626221),
 ('alois', 0.7447825074195862),
 ("'my", 0.7436867356300354),
 ('chap', 0.741326093673706),
 ('stupid', 0.7399416565895081),
 ('goose', 0.739003598690033)]

In [None]:
f_model.wv.most_similar("girl")

[('passenger', 0.7342532277107239),
 ('delighted', 0.7327626943588257),
 ('accomplished', 0.7324975728988647),
 ('doll', 0.7275893092155457),
 ('dearest', 0.7269617915153503),
 ('cordelia', 0.7266648411750793),
 ('child', 0.7213857769966125),
 ("'my", 0.7212626338005066),
 ('lavander', 0.7201829552650452),
 ('alois', 0.7198037505149841)]

In [24]:
m_model.wv.most_similar("girl")

[('swallow', 0.8773272633552551),
 ('prince', 0.8648238778114319),
 ('emily', 0.8594668507575989),
 ('creature', 0.8564527034759521),
 ('minnie', 0.8562754988670349),
 ('angel', 0.8558725118637085),
 ('child', 0.8552056550979614),
 ('blossom', 0.8499689102172852),
 ('hans', 0.8358373045921326),
 ('tender', 0.8356804251670837)]

In [None]:
m_model.wv.most_similar("girl")

[('prince', 0.8320460915565491),
 ('minnie', 0.8278048634529114),
 ('angel', 0.8260985016822815),
 ('blossom', 0.819741427898407),
 ('murmured', 0.8171412944793701),
 ('emily', 0.8107228875160217),
 ('swallow', 0.8087517023086548),
 ('lover', 0.807963490486145),
 ('fisherman', 0.8067976236343384),
 ('child', 0.8024742603302002)]