**Generate word embedding from text**

In this notebook I have used the text from game of thrones books to create word embeddings using Word2Vec algorithm. 

Gensim is used to access Word2Vec model and further to save and load word embedding in flask file. The end goal is to get most similar words for a given text.

Word embeddings are vectore representations of words in text that capture some context of the words. Unlike bag of words representation which result in large sparse vectors word embeddings are an improvement . 

Next step in this project is to plot the embeddings and observe the similar words that are placed nearer in the vector space. 

In [1]:
pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
import os,nltk
from nltk import sent_tokenize
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Generate words from text**

Here we traverse through the 5 copies of the text and process them into words that can be passed into word2vec model

In [9]:
all_words = []
stops = stopwords.words("english")
for filename in os.listdir('Data'):
  f = open(os.path.join('Data',filename),encoding='ISO-8859-1')
  text = f.read()
  raw_sent = sent_tokenize(text)

  for sent in raw_sent:
    all_words.append(simple_preprocess(sent))

In [11]:
from gensim.models import Word2Vec

**Parameters for word2vec model**

**size**: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).

**window**: (default 5) The maximum distance between a target word and words around the target word.

**min_count**: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.

**workers**: (default 3) The number of threads to use while training.

**sg**: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

In [12]:
model_got = Word2Vec(all_words,min_count=5,window=10)

To get the embedding vector for a token 
 

In [14]:
model_got['daenerys']

  """Entry point for launching an IPython kernel.


array([ 0.00550811,  0.1979751 ,  0.45436585, -0.17187944, -1.5095897 ,
       -0.12588347,  0.5069664 , -0.472151  ,  0.5028591 , -0.52353543,
       -0.43614462, -1.0732169 , -0.1719547 ,  1.0185014 , -0.08529336,
        0.76177436,  0.65787834, -0.96632373,  0.2659897 ,  0.34751108,
       -0.3727839 , -0.44544104, -0.58361703, -0.4645536 ,  1.0248919 ,
       -0.18955937,  1.0321758 ,  0.31095877,  1.0237343 ,  0.6839877 ,
       -0.60695994, -0.5354355 , -1.2755035 ,  0.7971668 , -0.6465591 ,
        0.08155406,  0.8134394 ,  0.6021161 , -1.0907316 ,  0.81290597,
       -0.9022983 , -0.5896426 ,  0.6111805 ,  0.17473985,  0.06486788,
        0.2494802 ,  0.8075841 ,  0.23277532, -0.70357394,  0.00439382,
       -1.2981724 , -0.36704057, -0.25806957,  0.36551657,  0.08756131,
        0.9101219 ,  0.4230666 ,  0.49521732,  0.568394  , -0.24985427,
        1.3264594 , -0.4367455 , -0.09210678, -0.8864302 , -0.01741638,
       -0.11204147,  0.01746313, -0.00854641,  0.0116835 , -0.05

In [15]:
model_got.most_similar('daenerys')

  """Entry point for launching an IPython kernel.


[('stormborn', 0.7624735832214355),
 ('targaryen', 0.7419289350509644),
 ('queen', 0.7079192996025085),
 ('princess', 0.6999870538711548),
 ('myrcella', 0.6607555747032166),
 ('elia', 0.6533308029174805),
 ('dorne', 0.6479859352111816),
 ('viserys', 0.6382139921188354),
 ('margaery', 0.6318345665931702),
 ('unburnt', 0.6315217018127441)]

In [17]:
#model_got.wv.save_word2vec_format('model_got.bin')

To save the model in binary format. I exported this binary file to flask app in the later steps.

In [18]:
model_got.save('model_embeddings.bin')