## <center>Word Vector (a.k.a Word Embedding) </center> 

### 1.1 Word2Vector
 - Vector representation of words (i.e. word vectors) learned using neural network
   - e.g. "apple" : [0.35, -0.2, 0.4, ...], 'mongo':  [0.32, -0.18, 0.5, ...]
   - Interesting properties of word vectors:
    * **Words with similar semantics have close word vectors**
    <img src="https://www.kdnuggets.com/images/cartoon-espresso-word2vec.jpg" width="50%">
    https://www.kdnuggets.com/2017/04/cartoon-word2vec-espresso-cappuccino.html
    * **Composition**: e.g. vector("woman")+vector("king")-vector('man') $\approx$ vector("queen")
 - Models:
   - **CBOW** (Continuous Bag of Words): Predict a target word based on context
     - e.g. the fox jumped over the lazy dog
     - Assuming symmetric context with window size 3, this sentence can create training samples: 
       - ([-, fox], the) 
       - ([the, jumped], fox) 
       - ([fox, over], jumped)
       - ([jumped, the], over) 
       - ...
       
       <img src="cbow.png" width="50%">
       source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
   - **Skip Gram**: predict context based on target words
   
        <img src="skip_gram.png" width="50%">
        source: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

In [1]:
# set up interactive shell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Exercise 1.1 Train your word vector

import pandas as pd
import nltk,string

# Load data
data=pd.read_csv('amazon_review_large.csv')
data.columns=['label','text']
data.head()

# tokenize each document into a list of unigrams
# strip punctuations and leading/trailing spaces from unigrams
# only unigrams with 2 or more characters are taken
sentences=[ [token.strip(string.punctuation).strip() \
             for token in nltk.word_tokenize(doc.lower()) \
                 if token not in string.punctuation and \
                 len(token.strip(string.punctuation).strip())>=2]\
             for doc in data["text"]]
print(sentences[0:2])

Unnamed: 0,label,text
0,2,This is a little longer and more detailed than...
1,1,Only Michelle Branch save this album!!!!All gu...
2,2,"A surprisingly good book, given its inherently..."
3,2,"This is a wonderful, quiet and relaxing CD tha..."
4,1,The lights that I received are absolutely not ...


[['this', 'is', 'little', 'longer', 'and', 'more', 'detailed', 'than', 'the', 'first', 'two', 'books', 'in', 'the', 'series', 'however', 'have', 'enjoyed', 'each', 'new', 'aspect', 'of', 'the', 'exciting', 'fantasy', 'universe'], ['only', 'michelle', 'branch', 'save', 'this', 'album', 'all', 'guys', 'play', 'along', 'with', 'unenthusiastic', 'beat', 'even', 'karl']]


In [3]:
# Train your own word vectors using gensim

# gensim.models is the package for word2vec
# check https://radimrehurek.com/gensim/models/word2vec.html
# for detailed description

from gensim.models import word2vec
import logging
import pandas as pd

# print out tracking information
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', \
                    level=logging.INFO)

# min_count: words with total frequency lower than this are ignored
# size: the dimension of word vector
# window: context window, i.e. the maximum distance 
#         between the current and predicted word 
#         within a sentence (i.e. the length of ngrams)
# workers: # of parallel threads in training
# for other parameters, check https://radimrehurek.com/gensim/models/word2vec.html
wv_model = word2vec.Word2Vec(sentences,vector_size=200,min_count=5, window=5, workers=4 )

2021-12-15 14:56:26,648 : INFO : collecting all words and their counts
2021-12-15 14:56:26,649 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-12-15 14:56:26,753 : INFO : PROGRESS: at sentence #10000, processed 712099 words, keeping 36835 word types
2021-12-15 14:56:26,859 : INFO : collected 55006 word types from a corpus of 1424497 raw words and 20000 sentences
2021-12-15 14:56:26,860 : INFO : Creating a fresh vocabulary
2021-12-15 14:56:26,910 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 12138 unique words (22.0666836345126%% of original 55006, drops 42868)', 'datetime': '2021-12-15T14:56:26.910345', 'gensim': '4.1.2', 'python': '3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}
2021-12-15 14:56:26,912 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 1362451 word corpus (95.64435727137368%% of original 1424497, drops 

In [4]:
# test word2vec model

print("Top 5 words similar to word 'sound'")
wv_model.wv.most_similar('sound', topn=5)

print("Top 5 words similar to word 'sound' but not relevant to 'film'")
wv_model.wv.most_similar(positive=['sound','music'], \
                         negative=['film'], topn=5)

print("Similarity between 'movie' and 'film':")
wv_model.wv.similarity('movie','film') 

print("Similarity between 'movie' and 'city':")
wv_model.wv.similarity('movie','city') 

print("Word does not match with others in the list of \
['sound', 'music', 'graphics', 'actor', 'book']:")
wv_model.wv.doesnt_match(["sound", "music", \
                          "graphics", "actor", "book"])

print("Word vector for 'movie':")
wv_model.wv['movie']

Top 5 words similar to word 'sound'


[('metal', 0.7502920627593994),
 ('rock', 0.7445167303085327),
 ('beats', 0.7220752835273743),
 ('band', 0.7205601930618286),
 ('music', 0.718316912651062)]

Top 5 words similar to word 'sound' but not relevant to 'film'


[('rock', 0.8084051609039307),
 ('pop', 0.754722535610199),
 ('lyrics', 0.7281469106674194),
 ('dance', 0.7133382558822632),
 ('guitar', 0.7117939591407776)]

Similarity between 'movie' and 'film':


0.9291705

Similarity between 'movie' and 'city':


0.03990957

Word does not match with others in the list of ['sound', 'music', 'graphics', 'actor', 'book']:


'book'

Word vector for 'movie':


array([-1.9398369 , -0.26979092, -0.32208118,  0.4484339 , -0.48991072,
       -0.6396415 ,  1.1778913 ,  1.6906852 , -0.7643984 , -0.01410676,
       -0.29183614,  0.21045597,  0.4607867 , -0.5672841 , -1.5674282 ,
        0.34530285, -0.06660659, -1.359021  , -0.60741585, -1.2620932 ,
        0.5776676 ,  0.91510093,  0.5721214 ,  0.76056534, -0.80147374,
        1.1796416 , -0.60035974, -0.5805315 ,  1.4532441 , -0.44645214,
       -0.01378889,  0.58523166,  0.6169707 ,  0.42398548, -1.6191436 ,
        1.8783761 , -2.2982483 ,  0.22456075, -0.47139382,  0.5092724 ,
        0.47057706, -0.6111662 ,  0.92382914, -0.0413982 ,  0.36029813,
        0.7504785 , -0.06272805, -0.4524339 ,  0.5114244 ,  0.47090596,
       -1.4466591 , -0.68480176, -0.07489015, -0.4293492 ,  0.5721016 ,
       -1.260567  , -1.1860507 , -1.1670321 , -0.35003388,  1.5703514 ,
        1.0453962 ,  0.5311113 ,  0.77049315, -0.03823433, -2.3480654 ,
        0.11311954,  0.12703004, -0.2617225 , -1.4608952 ,  0.80

### 1.2. Pretrained Word Vectors
- Google published pre-trained 300-dimensional vectors for 3 million words and phrases that were trained on Google News dataset (about 100 billion words)(https://code.google.com/archive/p/word2vec/)
- GloVe (Global Vectors for Word Representation): Pretained word vectors from different data sources provided by Standford https://nlp.stanford.edu/projects/glove/
- FastText by Facebook https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
- Contextualized BERT Embeddings

In [19]:
# Exercise 1.2: Use pretrained word vectors

# download the bin file for pretrained word vectors
# from above links, e.g. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
# Warning: the bin file is very big (over 2G)
# You need a powerful machine to load it

import gensim

model = gensim.models.KeyedVectors.\
load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) 

model.wv.most_similar(positive=['women','king'], \
                      negative='man')

2021-12-07 14:38:45,682 : INFO : loading projection weights from GoogleNews-vectors-negative300.bin


FileNotFoundError: [Errno 2] No such file or directory: 'GoogleNews-vectors-negative300.bin'

### 1.3. How to use word vectors in classification?

`Convolutional Neural Network`
<img src="https://machinelearningmastery.com/wp-content/uploads/2017/10/Depiction-of-the-multiple-channel-convolutional-neural-network-for-text.png" width ="100%">

`Recurrent Neural Network`

<img src="https://raw.githubusercontent.com/graviraja/100-Days-of-NLP/master/assets/images/applications/sentiment/simple.gif" width = "90%">


### 1.4. More Advanced Transformer Model for Contextualized Embeddings


<img src="https://sp-ao.shortpixel.ai/client/q_glossy,ret_img,w_780/https://conscient.ai/wp-content/uploads/2019/09/2-4.jpg" width = "90%">


Thank you and have a wonderful winter break!

<img src="" width="50%">

<img src="https://www.kdnuggets.com/images/cartoon-machine-learning-vacation.jpg" width='60%'>


Also, welcome to join BIA-667 Deep Learning class in the Fall!

<img src="https://www.qubole.com/wp-content/uploads/2018/08/1-400x387.png" src="20%">