# How to apply word embedding using Gensim

## Word embedding

The input machine learning model are matrices or vector but in natural language processing we only have documents so we need to convert a unit of document, word, to the vector and a document to a matrix

### One-hot vector - the naive method

Simplest method we can apply first is one-hot vector, the idea of the method is very simple. Example, we have a vocabulary with size V, then we consider each word is a vector in V-dimension, only one target element being 1 and the others being 0. Let's look at this sample

In [7]:
import numpy as np
vocab = ['I', "love", "cats", "and", "dogs"] # assum the we only have small vocabulary
vocab_size = len(vocab)

for text_idx, word in enumerate(vocab):
    one_hot = np.zeros(vocab_size)
    one_hot[text_idx] = 1
    print("Word: ", word)
    print("Vector: ", one_hot)

Word:  I
Vector:  [1. 0. 0. 0. 0.]
Word:  love
Vector:  [0. 1. 0. 0. 0.]
Word:  cats
Vector:  [0. 0. 1. 0. 0.]
Word:  and
Vector:  [0. 0. 0. 1. 0.]
Word:  dogs
Vector:  [0. 0. 0. 0. 1.]


### Some issue with one-hot
- First, you cannot infer any relationship between two words given their one-hot representation. In previous example, the word, the similarity between each words is equal and we cannot infer any information between words
- Second, we are wasting a lot of space for 0 element. Look at previous example, there are only 5 words in vocabularty and we need spent a vector with 5-dimension.

## Word2Vec with Gensim

There are two types of Word2Vec, Skip-gram and Continuous Bag of Words (CBOW). Let's me show both of method using gensim

### Skip-gram
For skip-gram, the input is the target word, while the outputs are the words surrounding the target words with pre-define window size. For each example, the sentence "I love cats and dogs" and windown size = 2, if the input is "love" then output is "I", "cats", "and". 
### Continuous Bag of Words (CBOW) 
It is  very similar to skip-gram, except that it swaps the input and output. The idea is that given a context, we want to know which word is most likely to appear in it. For example, given words "I", "cats" "and", the output is "love"

I will show you how to perform word embedding with Gensim
Install gensim:
 - Open terminal
 - type: pip install gensim

In [12]:
import numpy as np
import os
from random import shuffle
import re
import urllib.request
import zipfile
import lxml.etree
#download the data
urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")
# extract subtitle
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))

In [17]:
# Size of input text
len(input_text)

24222849

In [18]:
# First 100 charaters of input text
input_text[:100]

"Here are two reasons companies fail: they only do more of the same, or they only do what's new.\nTo m"

In [19]:
# Add some code to preprocessing data
# remove parenthesis 
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)
# store as list of sentences
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
# store as list of lists of words
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Let's see the sentences_ted array

In [21]:
# len
len(sentences_ted)

266694

In [25]:
# first 10 elements
sentences_ted[:2]

[['here',
  'are',
  'two',
  'reasons',
  'companies',
  'fail',
  'they',
  'only',
  'do',
  'more',
  'of',
  'the',
  'same',
  'or',
  'they',
  'only',
  'do',
  'what',
  's',
  'new'],
 ['to',
  'me',
  'the',
  'real',
  'real',
  'solution',
  'to',
  'quality',
  'growth',
  'is',
  'figuring',
  'out',
  'the',
  'balance',
  'between',
  'two',
  'activities',
  'exploration',
  'and',
  'exploitation']]

This is the form that is ready to be fed into the Word2Vec model defined in Gensim. Word2Vec model can be easily trained with one line as the code below.

In [26]:
from gensim.models import Word2Vec
model_ted = Word2Vec(sentences=sentences_ted, size=100, window=5, min_count=5, workers=4, sg=0)

Some important parameter:
- sentences: list of sentences in all document
- size: the dimensionality of the embedding vector
- window: the number of context words you are looking at
- min_count: tells the model to ignore words with total count less than this number.
- workers: the number of threads being used
- sg: whether to use skip-gram or CBOW

Let's discovery some feature of Gensim model

In [28]:
# find the most similar of the "man"
model_ted.wv.most_similar("man")

[('woman', 0.8459959030151367),
 ('guy', 0.8118977546691895),
 ('boy', 0.765205979347229),
 ('girl', 0.7609017491340637),
 ('lady', 0.7598025798797607),
 ('kid', 0.725189208984375),
 ('soldier', 0.7227253317832947),
 ('gentleman', 0.7090329527854919),
 ('poet', 0.6967312693595886),
 ('friend', 0.6757165193557739)]

In [31]:
# the vector represent for the word "man"
man_vector = model_ted.wv['man']

In [34]:
man_vector.shape # It is equal the "size" parameter in the model

(100,)

In [36]:
man_vector

array([ 7.8068745e-01, -4.5165992e-01,  1.1014861e+00, -9.3772238e-01,
        1.6919795e-01, -1.9153500e+00,  2.4239585e-01,  1.5283470e-02,
        1.9348862e+00, -1.2310663e+00, -1.8260899e+00,  2.2298059e-01,
       -7.5558901e-01, -1.9010946e+00, -1.8604414e+00, -4.2120236e-01,
       -6.2933797e-01, -8.3188128e-01, -2.2161929e-02,  4.7784784e-01,
       -2.0343399e+00, -1.5837117e+00,  1.8361408e-02, -9.8428136e-01,
       -1.0060577e+00,  4.3728873e-01,  9.1070540e-02, -1.5524529e+00,
        4.5749509e-01,  1.7482825e-01,  2.0538163e+00, -1.9110399e+00,
        2.0106401e+00, -1.6175275e-01, -2.1399779e+00,  1.4197608e+00,
        9.6490607e-03, -2.4535669e-01, -1.3951262e+00, -5.0295997e-01,
        7.5743324e-01, -7.4039704e-01, -1.1575367e+00, -7.0345587e-01,
       -9.2523569e-01, -1.9069070e-01, -3.4693712e-03,  2.6064306e-01,
        9.4525474e-01,  1.8784159e+00, -8.4275073e-01,  1.2200514e+00,
        1.1355350e+00,  1.4757791e+00, -2.6515985e+00,  1.3076806e-01,
      

The biggest issue of Word2Vec is that it is not able to represent words that do not appear in the training dataset. Example let's see what the vector represent for the word "I", we will get the error "word 'I' not in vocabulary" because we remove this word in preprocess steps. We will overcome this issue by training with larger vocabulary

In [38]:
model_ted.wv['I']

KeyError: "word 'I' not in vocabulary"

Referenece link
 - https://radimrehurek.com/gensim/
 - https://radimrehurek.com/gensim/install.html
 - https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c
 - https://radimrehurek.com/gensim/tut1.html
 - https://www.depends-on-the-definition.com/guide-to-word-vectors-with-gensim-and-keras/