# Table of Contents

### I. Download Pre-trained Word Embeddings
> ##### 1. Google's pre-trained Word2Vec
> ##### 2. Stanford NLP's pretrained GloVe
> ##### 3. Facebook's fastText
### II. Comparing Word Embedding Models
> ##### 1. Loading embeddings into Gensim
> ##### 2. Word representations
> ##### 3. Top similar words
> ##### 4. Contextual Relationship Between Words
### III. Train Word2Vec model from scratch
> ##### 1. Load Dataset
> ##### 2. Create Embeddings

# I. Download Pre-trained Word Embeddings

In [None]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


##1. Google's pre-trained Word2Vec

Google has released a pre-trained Word2Vec model that has the advantage of being trained on **Google's News data set of 3 million words**. You can __download__ the word2vec embeddings from this [link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).

__Installation__

 - Make sure you have downloaded it in the same folder where this Jupyter notebook is residing.
 
 - Once you have finished downloading, you need to decompress the file and store in the same directory as the jupyter notebook

In [None]:
!wget "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!gunzip GoogleNews-vectors-negative300.bin.gz

--2020-08-30 10:12:33--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.185.189
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.185.189|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-08-30 10:14:10 (16.3 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



##2. Stanford NLP's pretrained GloVe

Stanford NLP's GloVe is trained on different datasets. Smallest is trained on Wikipedia and Gigawords dataset containing **6 Billion tokens and a vocabulary of around 400,000 words**.

__Installation__

 - Download the GloVe model from [Glove6B.zip](http://nlp.stanford.edu/data/glove.6B.zip). 

 - Extract the zip file and store in the same sirectory as the jupyter notebook
 - Once you have extracted the file, you will see that there are multiple text files
     1. **glove.6B.50d.txt**  - Contains 50 dimension vectors for each word of the vocabulary.
     2. **glove.6B.100d.txt** - Contains 100 dimension vectors for each word of the vocabulary.
     3. **glove.6B.200d.txt** - Contains 200 dimension vectors for each word of the vocabulary.
     4. **glove.6B.300d.txt** - Contains 300 dimension vectors for each word of the vocabulary.

In [None]:
!wget "http://nlp.stanford.edu/data/glove.6B.zip"
!unzip glove.6B.zip

--2020-08-30 10:14:58--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-08-30 10:14:59--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-08-30 10:14:59--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

##3. Facebook's fastText

Facebook's fastText pre-trained model is trained on Wikipedia, UMBC webbase corpus and statmt.org news datasets. It contains around **16 Billion tokens and has a vocabulary of around 1 million words.**

__Installation__

 - Download the embeddings from this [link](https://fasttext.cc/docs/en/english-vectors.html)
 
 - Since we are working with wiki-news-300d-1M.vec, we recommend you to do so as well
 
 - Once you have finished downloading, you need to decompress the file and store in the same directory as the jupyter notebook

In [None]:
!wget "https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip"
!unzip wiki-news-300d-1M.vec.zip

--2020-08-30 10:22:38--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2020-08-30 10:23:34 (11.8 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


#II. Comparing word embedding models

In [None]:
# Importing libraries
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

##1. Loading embeddings into Gensim

In [None]:
# Path to word2vec bin file
file_path = "GoogleNews-vectors-negative300.bin"

# Load into gensim
w2vec = KeyedVectors.load_word2vec_format(file_path, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
# Path to glove file
glove_input_file = 'glove.6B.300d.txt'

# Path to Word2Vec format output file
glove_word2vec_output_file = 'glove.6B.300d.word2vec.txt'

# Save in Word2vec format
glove2word2vec(glove_input_file, glove_word2vec_output_file)

# Load into gensim
glove = KeyedVectors.load_word2vec_format(glove_word2vec_output_file, binary=False)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
# Path to fasttext vector file
file_path = 'wiki-news-300d-1M.vec'

# Load into gensim
ft = KeyedVectors.load_word2vec_format(file_path, binary=False)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


##2. Word representations

In [None]:
# Word2vec representation
w2vec['king']

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [None]:
# Glove embeddings
glove['king']

array([ 0.0033901, -0.34614  ,  0.28144  ,  0.48382  ,  0.59469  ,
        0.012965 ,  0.53982  ,  0.48233  ,  0.21463  , -1.0249   ,
       -0.34788  , -0.79001  , -0.15084  ,  0.61374  ,  0.042811 ,
        0.19323  ,  0.25462  ,  0.32528  ,  0.05698  ,  0.063253 ,
       -0.49439  ,  0.47337  , -0.16761  ,  0.045594 ,  0.30451  ,
       -0.35416  , -0.34583  , -0.20118  ,  0.25511  ,  0.091111 ,
        0.014651 , -0.017541 , -0.23854  ,  0.48215  , -0.9145   ,
       -0.36235  ,  0.34736  ,  0.028639 , -0.027065 , -0.036481 ,
       -0.067391 , -0.23452  , -0.13772  ,  0.33951  ,  0.13415  ,
       -0.1342   ,  0.47856  , -0.1842   ,  0.10705  , -0.45834  ,
       -0.36085  , -0.22595  ,  0.32881  , -0.13643  ,  0.23128  ,
        0.34269  ,  0.42344  ,  0.47057  ,  0.479    ,  0.074639 ,
        0.3344   ,  0.10714  , -0.13289  ,  0.58734  ,  0.38616  ,
       -0.52238  , -0.22028  , -0.072322 ,  0.32269  ,  0.44226  ,
       -0.037382 ,  0.18324  ,  0.058082 ,  0.26938  ,  0.3620

In [None]:
# fastText embeddings
ft['king']

array([ 1.082e-01,  4.450e-02, -3.840e-02,  1.100e-03, -8.880e-02,
        7.130e-02, -6.960e-02, -4.770e-02,  7.100e-03, -4.080e-02,
       -7.070e-02, -2.660e-02,  5.000e-02, -8.240e-02,  8.480e-02,
       -1.627e-01, -8.510e-02, -2.950e-02,  1.534e-01, -1.828e-01,
       -2.208e-01,  2.430e-02, -9.210e-02, -1.089e-01, -1.009e-01,
       -1.190e-02,  3.770e-02,  2.038e-01,  7.200e-02,  2.020e-02,
        2.798e-01,  1.150e-02, -1.510e-02,  1.037e-01,  4.000e-04,
       -1.040e-02,  1.960e-02,  1.265e-01,  8.280e-02, -1.369e-01,
        1.070e-01,  1.270e-01, -3.490e-02, -6.830e-02, -1.140e-02,
        3.370e-02,  1.260e-02,  7.920e-02,  4.400e-02, -2.530e-02,
        4.890e-02, -7.850e-02, -6.259e-01, -9.720e-02,  1.654e-01,
       -5.780e-02, -4.370e-02,  4.090e-02, -1.820e-02, -1.891e-01,
        2.770e-02, -1.460e-02, -5.310e-02,  4.260e-02,  4.900e-03,
        4.000e-03,  1.423e-01, -9.750e-02, -3.500e-03,  9.630e-02,
       -1.900e-03, -1.466e-01, -1.662e-01,  6.650e-02, -1.500e

##3. Top similar words

In [None]:
# Top similar words
w2vec.most_similar(['king'], topn=5)

  if np.issubdtype(vec.dtype, np.int):


[('kings', 0.7138046026229858),
 ('queen', 0.6510956883430481),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204220056533813),
 ('prince', 0.6159993410110474)]

In [None]:
# Top similar words
glove.most_similar(['king'], topn=5)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.6336469054222107),
 ('prince', 0.619662344455719),
 ('monarch', 0.5899620652198792),
 ('kingdom', 0.5791267156600952),
 ('throne', 0.5606487989425659)]

In [None]:
# Top similar words
ft.most_similar(['king'], topn=5)

  if np.issubdtype(vec.dtype, np.int):


[('kings', 0.7969564199447632),
 ('queen', 0.763853907585144),
 ('monarch', 0.7399972081184387),
 ('King', 0.7281952500343323),
 ('prince', 0.7132730484008789)]

##4. Contextual Relationship Between Words

Example: airplane - fly + drive = car

In [None]:
# airplane - fly + drive
w2vec.most_similar(positive=['airplane', 'drive'], negative=['fly'], topn=5)

  if np.issubdtype(vec.dtype, np.int):


[('car', 0.5112004280090332),
 ('drives', 0.47777247428894043),
 ('automobile', 0.45616620779037476),
 ('vehicle', 0.44856154918670654),
 ('SUV', 0.44360119104385376)]

In [None]:
# airplane - fly + drive
glove.most_similar(positive=['airplane', 'drive'], negative=['fly'], topn=5)

  if np.issubdtype(vec.dtype, np.int):


[('car', 0.5835879445075989),
 ('drives', 0.5498395562171936),
 ('vehicle', 0.5255967378616333),
 ('truck', 0.488486111164093),
 ('automobile', 0.47820842266082764)]

In [None]:
# airplane - fly + drive
ft.most_similar(positive=['airplane', 'drive'], negative=['fly'], topn=5)

  if np.issubdtype(vec.dtype, np.int):


[('automobile', 0.6070601344108582),
 ('car', 0.6051997542381287),
 ('drives', 0.5803264379501343),
 ('automobiles', 0.557213544845581),
 ('vehicle', 0.5500437021255493)]

# III. Train Word2Vec model from scratch

##1. Load Dataset

In [None]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec

In [None]:
# Load the dataset
df = pd.read_csv('/content/drive/My Drive/AV/Classical NLP course (module 12 + 13)/Feature engineering module/tweets.csv')
df.head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,RT @rssurjewala: Critical question: Was PayTM ...,False,0.0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331.0,True,False
1,RT @Hemant_80: Did you vote on #Demonetization...,False,0.0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66.0,True,False
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0.0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12.0,True,False
3,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0.0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338.0,True,False
4,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0.0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120.0,True,False


In [None]:
# Dropping irrelevant columns
df.drop(df.columns[1:],axis=1,inplace=True)
df.head()

Unnamed: 0,text
0,RT @rssurjewala: Critical question: Was PayTM ...
1,RT @Hemant_80: Did you vote on #Demonetization...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...
4,RT @satishacharya: Reddy Wedding! @mail_today ...


##2. Create Embeddings

In [None]:
# Import relevant libraries
import re
import spacy

# Load English language model
nlp = spacy.load('en_core_web_sm')

In [None]:
# 22
# Preprocessing function
def clean(text):
    
    # Lowercase
    text = text.lower()
    
    # Remove non-alphanumeric words
    text = ' '.join(re.compile(r'[^a-zA-Z0-9]+').split(text))

    # Create spacy object
    doc = nlp(text)

    # List to store clean text
    filtered_text = []

    # Iterate over document and save word lemmas
    for token in doc:
        filtered_text.append(token.lemma_)
    
    return " ".join(word for word in filtered_text)

In [None]:
# Apply the function
df['text_clean'] = df['text'].apply(clean)

In [None]:
# 24
# Print data
df.head()

Unnamed: 0,text,text_clean
0,RT @rssurjewala: Critical question: Was PayTM ...,rt rssurjewala critical question be paytm info...
1,RT @Hemant_80: Did you vote on #Demonetization...,rt hemant 80 do -PRON- vote on demonetization ...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",rt roshankar former finsec rbi dy governor cbd...
3,RT @ANI_news: Gurugram (Haryana): Post office ...,rt ani news gurugram haryana post office emplo...
4,RT @satishacharya: Reddy Wedding! @mail_today ...,rt satishacharya reddy wedding mail today cart...


In [None]:
# Break docs into separate sentences
def sents(doc):

    # Split into individual sentences
    text = re.split('[.?]\s+',doc)
    # List to save the sentences
    clean_sent = []
    
    # Iterate over the sentences
    for sent in text:
        if len(sent)!=0:
            # Remove leading and trailing whitespaces
            sent = sent.strip()
            # Tokenize sentences
            clean_sent.append([word for word in sent.split()])
    # Return list of sentences in a single document
    return clean_sent

In [None]:
# Apply the function
df['text_sents'] = df['text_clean'].apply(sents)

In [None]:
# Print output
df.head()

Unnamed: 0,text,text_clean,text_sents
0,RT @rssurjewala: Critical question: Was PayTM ...,rt rssurjewala critical question be paytm info...,"[[rt, rssurjewala, critical, question, be, pay..."
1,RT @Hemant_80: Did you vote on #Demonetization...,rt hemant 80 do -PRON- vote on demonetization ...,"[[rt, hemant, 80, do, -PRON-, vote, on, demone..."
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",rt roshankar former finsec rbi dy governor cbd...,"[[rt, roshankar, former, finsec, rbi, dy, gove..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...,rt ani news gurugram haryana post office emplo...,"[[rt, ani, news, gurugram, haryana, post, offi..."
4,RT @satishacharya: Reddy Wedding! @mail_today ...,rt satishacharya reddy wedding mail today cart...,"[[rt, satishacharya, reddy, wedding, mail, tod..."


In [None]:
# Sample sentence
df.loc[0,'text_sents']

[['rt',
  'rssurjewala',
  'critical',
  'question',
  'be',
  'paytm',
  'inform',
  'about',
  'demonetization',
  'edict',
  'by',
  'pm',
  '-PRON-',
  's',
  'clearly',
  'fishy',
  'and',
  'require',
  'full',
  'disclosure',
  'amp']]

In [None]:
# Combine all sentences into a single list for word embedding training
combined_sent = []
for i in range(len(df)):
    combined_sent += df.loc[i,'text_sents']

In [None]:
# Create word2vec model
model = Word2Vec(combined_sent, size=100, window=2, sg=0, min_count=5, workers=1)
# Save model
# model.save(r'/word_vec3.bin')

In [None]:
# Word vector rerpesentation
model['demonetization']

  


array([-4.57291931e-01,  3.55889708e-01,  4.65297729e-01, -6.41711131e-02,
        6.64111793e-01, -4.19944823e-02, -1.46308079e-01, -2.74135232e-01,
        5.51713863e-04, -1.48386821e-01,  5.46964586e-01, -9.41203088e-02,
        2.49321908e-01,  2.56963164e-01,  1.53655604e-01, -4.80003893e-01,
        6.34729862e-01,  2.31083289e-01, -2.33632997e-01,  2.26949006e-01,
        2.92454571e-01,  4.21377808e-01,  2.67149419e-01,  1.18984900e-01,
       -1.33870289e-01,  3.74160171e-01, -2.48860255e-01, -4.82132062e-02,
       -1.34608131e-02, -1.57339163e-02,  8.99322331e-02,  3.18409324e-01,
        2.51485676e-01,  3.01106066e-01, -9.55550447e-02,  2.44480237e-01,
        1.40510360e-03, -8.72282907e-02,  4.44643013e-03, -3.59752595e-01,
       -1.85287938e-01,  3.71086895e-01, -2.47750252e-01, -2.70402044e-01,
        1.75243542e-01, -5.25929444e-02,  2.38720551e-01, -1.51307276e-02,
        3.85412812e-01,  1.84335709e-01, -5.72646797e-01, -1.56317145e-01,
        3.00260544e-01, -

In [None]:
# Simlar word
model.similar_by_vector(model['demonetization'],topn=10)

  
  
  if np.issubdtype(vec.dtype, np.int):


[('demonetization', 1.0),
 ('digital', 0.9976805448532104),
 ('don', 0.9955965280532837),
 ('vision', 0.9953009486198425),
 ('against', 0.9952237606048584),
 ('ar', 0.9944949150085449),
 ('after', 0.9941202402114868),
 ('homework', 0.9938411712646484),
 ('hoard', 0.9921005368232727),
 ('worldbank', 0.9901090860366821)]

In [None]:
# Simlar word
model.similar_by_vector(model['india'],topn=10)

  
  
  if np.issubdtype(vec.dtype, np.int):


[('india', 1.0),
 ('impact', 0.9931312799453735),
 ('article', 0.987446665763855),
 ('demonetization', 0.9858487248420715),
 ('like', 0.9852670431137085),
 ('against', 0.9841402173042297),
 ('on', 0.9830706119537354),
 ('after', 0.9817497730255127),
 ('about', 0.9802956581115723),
 ('an', 0.9796943664550781)]

In [None]:
# Simlar word
model.similar_by_vector(model['economy'],topn=10)

  
  
  if np.issubdtype(vec.dtype, np.int):


[('economy', 1.0),
 ('bank', 0.9993865489959717),
 ('effect', 0.9991459846496582),
 ('by', 0.9990564584732056),
 ('more', 0.9988462924957275),
 ('m', 0.9981010556221008),
 ('issue', 0.9978339672088623),
 ('d', 0.9977588653564453),
 ('app', 0.9977414011955261),
 ('gov', 0.9976497292518616)]

In [None]:
# Simlar word
model.similar_by_vector(model['flood'],topn=10)

  


KeyError: ignored