## <center>Word Vector (a.k.a Word Embedding) </center> 

### 1.1 Word2Vector
 - Vector representation of words (i.e. word vectors) learned using neural network
   - e.g. "apple" : [0.35, -0.2, 0.4, ...], 'mongo':  [0.32, -0.18, 0.5, ...]
   - Interesting properties of word vectors:
    * **Words with similar semantics have close word vectors**
    <img src="https://www.kdnuggets.com/images/cartoon-espresso-word2vec.jpg" width="50%">
    https://www.kdnuggets.com/2017/04/cartoon-word2vec-espresso-cappuccino.html
    * **Composition**: e.g. vector("woman")+vector("king")-vector('man') $\approx$ vector("queen")
 - Models:
   - **CBOW** (Continuous Bag of Words): Predict a target word based on context
     - e.g. the fox jumped over the lazy dog
     - Assuming symmetric context with window size 3, this sentence can create training samples: 
       - ([-, fox], the) 
       - ([the, jumped], fox) 
       - ([fox, over], jumped)
       - ([jumped, the], over) 
       - ...
       
       <img src="cbow.png" width="50%">
       source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
   - **Skip Gram**: predict context based on target words
   
        <img src="skip_gram.png" width="50%">
        source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

In [1]:
# set up interactive shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Exercise 1.1 Train your word vector

import pandas as pd
import nltk,string

# Load data
data=pd.read_csv('amazon_review_large.csv')
data.columns=['label','text']
data.head()

# tokenize each document into a list of unigrams
# strip punctuations and leading/trailing spaces from unigrams
# only unigrams with 2 or more characters are taken
sentences=[ [token.strip(string.punctuation).strip() \
             for token in nltk.word_tokenize(doc.lower()) \
                 if token not in string.punctuation and \
                 len(token.strip(string.punctuation).strip())>=2]\
             for doc in data["text"]]
print(sentences[0:2])

Unnamed: 0,label,text
0,2,This is a little longer and more detailed than...
1,1,Only Michelle Branch save this album!!!!All gu...
2,2,"A surprisingly good book, given its inherently..."
3,2,"This is a wonderful, quiet and relaxing CD tha..."
4,1,The lights that I received are absolutely not ...


[['this', 'is', 'little', 'longer', 'and', 'more', 'detailed', 'than', 'the', 'first', 'two', 'books', 'in', 'the', 'series', 'however', 'have', 'enjoyed', 'each', 'new', 'aspect', 'of', 'the', 'exciting', 'fantasy', 'universe'], ['only', 'michelle', 'branch', 'save', 'this', 'album', 'all', 'guys', 'play', 'along', 'with', 'unenthusiastic', 'beat', 'even', 'karl']]


In [3]:
# Train your own word vectors using gensim

# gensim.models is the package for word2vec
# check https://radimrehurek.com/gensim/models/word2vec.html
# for detailed description

from gensim.models import word2vec
import logging
import pandas as pd

# print out tracking information
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', \
                    level=logging.INFO)

# min_count: words with total frequency lower than this are ignored
# size: the dimension of word vector
# window: context window, i.e. the maximum distance 
#         between the current and predicted word 
#         within a sentence (i.e. the length of ngrams)
# workers: # of parallel threads in training
# for other parameters, check https://radimrehurek.com/gensim/models/word2vec.html
wv_model = word2vec.Word2Vec(sentences, \
            min_count=5, vector_size=200, \
            window=5, workers=4 )

2023-04-24 20:25:53,651 : INFO : collecting all words and their counts
2023-04-24 20:25:53,651 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-04-24 20:25:53,709 : INFO : PROGRESS: at sentence #10000, processed 712099 words, keeping 36835 word types
2023-04-24 20:25:53,767 : INFO : collected 55006 word types from a corpus of 1424497 raw words and 20000 sentences
2023-04-24 20:25:53,767 : INFO : Creating a fresh vocabulary
2023-04-24 20:25:53,789 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 12138 unique words (22.07% of original 55006, drops 42868)', 'datetime': '2023-04-24T20:25:53.789491', 'gensim': '4.3.0', 'python': '3.10.9 (main, Jan 11 2023, 09:18:18) [Clang 14.0.6 ]', 'platform': 'macOS-13.0-arm64-arm-64bit', 'event': 'prepare_vocab'}
2023-04-24 20:25:53,789 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1362451 word corpus (95.64% of original 1424497, drops 62046)', 'datetime': '2023-04-24T20:25

In [4]:
# test word2vec model

print("Top 5 words similar to word 'sound'")
wv_model.wv.most_similar('sound', topn=5)

print("Top 5 words similar to word 'sound' but not relevant to 'film'")
wv_model.wv.most_similar(positive=['sound','music'], \
                         negative=['film'], topn=5)

print("Similarity between 'movie' and 'film':")
wv_model.wv.similarity('movie','film') 

print("Similarity between 'movie' and 'city':")
wv_model.wv.similarity('movie','city') 

print("Word does not match with others in the list of \
['sound', 'music', 'graphics', 'actor', 'book']:")
wv_model.wv.doesnt_match(["sound", "music", \
                          "graphics", "actor", "book"])

print("Word vector for 'movie':")
wv_model.wv['movie']

Top 5 words similar to word 'sound'


[('metal', 0.7495978474617004),
 ('band', 0.7354978919029236),
 ('rock', 0.734889030456543),
 ('sounds', 0.728912353515625),
 ('lyrics', 0.7257303595542908)]

Top 5 words similar to word 'sound' but not relevant to 'film'


[('rock', 0.8026798963546753),
 ('pop', 0.7725368142127991),
 ('lyrics', 0.7640302181243896),
 ('songs', 0.723604679107666),
 ('dance', 0.7107803821563721)]

Similarity between 'movie' and 'film':


0.9240011

Similarity between 'movie' and 'city':


0.011460363

Word does not match with others in the list of ['sound', 'music', 'graphics', 'actor', 'book']:


'book'

Word vector for 'movie':


array([-0.6477697 , -0.08716024, -0.7663413 ,  0.39492884,  0.41591638,
       -0.34538504,  0.8292576 ,  0.78898036, -1.7558291 ,  0.36882758,
        0.25195068,  0.11592049, -0.25626892, -0.11291727, -2.0660846 ,
       -0.07905724,  0.6604024 , -2.2833545 , -1.0439785 , -1.9629272 ,
        0.22974417,  1.662248  ,  0.7451703 , -0.24151595, -0.8185099 ,
        1.1783727 , -0.7498035 , -1.4821293 ,  1.8805578 ,  0.5167119 ,
        0.12485462,  0.5936655 ,  0.24084656,  0.5730325 , -1.9867804 ,
        2.1124964 , -2.386052  , -0.5164091 , -1.4818395 , -0.36165473,
        0.7513265 , -0.7891387 ,  0.9208886 , -0.23379344, -0.11015029,
        0.3218885 , -1.2769501 , -0.2992271 ,  0.46782425,  0.88528115,
       -1.3294902 , -0.5242811 ,  0.44653422,  0.15804431, -0.37007388,
       -0.89891547, -0.01448711, -1.2937596 ,  0.03077303,  0.9326934 ,
        0.9574108 ,  0.35496035,  0.348438  , -0.26373976, -2.8090274 ,
        0.37032035, -0.11138176, -0.75362706, -0.99798024,  0.99

### 1.2. Pretrained Word Vectors
- Google published pre-trained 300-dimensional vectors for 3 million words and phrases that were trained on Google News dataset (about 100 billion words)(https://code.google.com/archive/p/word2vec/)
- GloVe (Global Vectors for Word Representation): Pretained word vectors from different data sources provided by Standford https://nlp.stanford.edu/projects/glove/
- FastText by Facebook https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
- Contextualized BERT Embeddings

In [None]:
#!pip install embeddings

In [1]:
# Load glove embedding (trained from wikipedia)

from embeddings import GloveEmbedding, FastTextEmbedding
import numpy as np

g = GloveEmbedding('wikipedia_gigaword', d_emb=100, show_progress=True)


In [2]:
# Retrieve the vectors for all the words 

wv_dim = 100

words = ['sound', 'music', 'graphics', 'actor', 'book']
vectors = np.zeros((len(words), wv_dim))

for i, w in enumerate(words):
    wv = g.lookup(w)
    if wv:
        vectors[i] = wv

# show vector of 'sound'
print(vectors[0])

[-0.31154001  0.028206    0.79364997 -0.45298001  0.45264    -0.098082
  0.015137   -0.42804    -0.31727999  0.23111001 -0.005192   -0.13428
 -0.57582003  0.26732999  0.46024001 -0.37432    -0.26615    -0.43922001
  0.48126999  0.38863999  0.29482001 -0.17739999  0.53140002 -1.2586
 -0.156       0.21075    -0.46983999  0.43877     0.32132     0.19392
 -0.037995    0.42976001 -0.44771999  0.81957     0.034205   -0.86826998
 -0.19284999 -0.30478999  0.68098998 -1.2529      0.27992001 -0.56071001
 -0.81357998  0.76863003 -0.26093999 -0.26960999  0.30458    -0.15528999
  0.22773001 -1.15760005 -0.16664    -0.16254     0.16587999  0.20457
  0.32515001 -2.54590011  0.7597      0.005298    1.30369997  0.14820001
  0.36361     1.20210004 -0.87075001 -0.016171    0.13203    -0.28979999
  0.42447999 -0.04421    -0.12926    -0.60518003  0.056072    0.33465999
  0.85614997 -0.66035998  0.77411002 -0.21391     0.14384    -0.12709001
  0.038184   -0.62795001  0.14861    -0.36488    -0.19199     0.51

In [3]:
# show similarity between words

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(vectors)

array([[1.        , 0.68927389, 0.41086313, 0.3166927 , 0.32444362],
       [0.68927389, 1.        , 0.31041093, 0.43786326, 0.54355895],
       [0.41086313, 0.31041093, 1.        , 0.15638935, 0.27588037],
       [0.3166927 , 0.43786326, 0.15638935, 1.        , 0.34439448],
       [0.32444362, 0.54355895, 0.27588037, 0.34439448, 1.        ]])

### 1.3. How to use word vectors in classification?

`Convolutional Neural Network`
<img src="https://machinelearningmastery.com/wp-content/uploads/2017/10/Depiction-of-the-multiple-channel-convolutional-neural-network-for-text.png" width ="100%">

`Recurrent Neural Network`

<img src="https://raw.githubusercontent.com/graviraja/100-Days-of-NLP/master/assets/images/applications/sentiment/simple.gif" width = "90%">


### 1.4. More Advanced Transformer Model for Contextualized Embeddings


<img src="https://sp-ao.shortpixel.ai/client/q_glossy,ret_img,w_780/https://conscient.ai/wp-content/uploads/2019/09/2-4.jpg" width = "90%">
