<a href="https://colab.research.google.com/github/onlyabhilash/NLP-Code/blob/main/part-2/05_Pre_Trained_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, let us see how we can represent text using pre-trained word embedding models. 

# 1. Using a pre-trained word2vec model

Let us take an example of a pre-trained word2vec model, and how we can use it to look for most similar words. We will use the Google News vectors embeddings.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

A few other pre-trained word embedding models, and details on the means to access them through gensim can be found in:
https://github.com/RaRe-Technologies/gensim-data

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install scikit-learn==0.21.3
!pip install wget==3.2
!pip install gensim==3.6.0
!pip install psutil==5.4.8
!pip install spacy==2.2.4

# ===========================

Collecting scikit-learn==0.21.3
  Downloading scikit_learn-0.21.3-cp37-cp37m-manylinux1_x86_64.whl (6.7 MB)
[K     |████████████████████████████████| 6.7 MB 4.3 MB/s 
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.21.3 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.21.3 which is incompatible.[0m
Successfully installed scikit-learn-0.21.3
Collecting wget==3.2
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel 

In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [3]:
import os
import wget
import gzip
import shutil

gn_vec_path = "GoogleNews-vectors-negative300.bin"
if not os.path.exists("GoogleNews-vectors-negative300.bin"):
    if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin"):
        #Downloading the reqired model
        if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin.gz"):
            if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
                wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
            gn_vec_zip_path = "GoogleNews-vectors-negative300.bin.gz"
        else:
            gn_vec_zip_path = "../Ch2/GoogleNews-vectors-negative300.bin.gz"
        #Extracting the required model
        with gzip.open(gn_vec_zip_path, 'rb') as f_in:
            with open(gn_vec_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    else:
        gn_vec_path = "../Ch2/" + gn_vec_path

print(f"Model at {gn_vec_path}")

Model at GoogleNews-vectors-negative300.bin


In [4]:
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 

import psutil #This module helps in retrieving information on running processes and system resource utilization
process = psutil.Process(os.getpid())
from psutil import virtual_memory
mem = virtual_memory()

import time #This module is used to calculate the time  

In [5]:
from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = gn_vec_path

#Load W2V model. This will take some time, but it is a one time effort! 
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
print('-'*10)

start_time = time.time() #Start the timer
ttl = mem.total #Toal memory available

w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) #load the model
print("%0.2f seconds taken to load"%float(time.time() - start_time)) #Calculate the total time elapsed since starting the timer
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) #Calculate the memory used after loading the model
print('-'*10)

print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) #Percentage increase in memory after loading the model
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.vocab)) #Number of words in the vocabulary. 

Memory used in GB before Loading the Model: 0.15
----------
44.95 seconds taken to load
----------
Finished loading Word2Vec
----------
Memory used in GB after Loading the Model: 5.08
----------
Percentage increase in memory usage: 3456.35% 
----------
Numver of words in vocablulary:  3000000


In [6]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353004455566406),
 ('lovely', 0.810693621635437),
 ('stunningly_beautiful', 0.7329413890838623),
 ('breathtakingly_beautiful', 0.7231341004371643),
 ('wonderful', 0.6854087114334106),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402292251587)]

In [7]:
#Let us try with another word! 
w2v_model.most_similar('toronto')

[('montreal', 0.698411226272583),
 ('vancouver', 0.6587257385253906),
 ('nyc', 0.6248831748962402),
 ('alberta', 0.6179691553115845),
 ('boston', 0.611499547958374),
 ('calgary', 0.61032634973526),
 ('edmonton', 0.6100261211395264),
 ('canadian', 0.5944076776504517),
 ('chicago', 0.5911980271339417),
 ('springfield', 0.5888351202011108)]

In [8]:
#What is the vector representation for a word? 
w2v_model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [9]:
#What if I am looking for a word that is not in this vocabulary?
w2v_model['practicalnlp']

KeyError: ignored

#### Two things to note while using pre-trained models: 


1.   Tokens/Words are always lowercased. If a word is not in the vocabulary,   the model throws an exception.
2.   So, it is always a good idea to encapsulate those statements in try/except blocks.

 

# 2. Getting the embedding representation for full text

We have seen how to get embedding vectors for single words. How do we use them to get such a representation for a full text? A simple way is to just sum or average the embeddings for individual words. We will see an example of this using Word2Vec in Chapter 4. Let us see a small example using another NLP library Spacy - which we saw earlier in Chapter 2 too.


In [14]:
!python -m spacy download en_core_web_md

SyntaxError: ignored

In [17]:
import spacy
import en_core_web_sm

#python -m spacy download en

%time 
nlp = en_core_web_sm.load()

#nlp = spacy.load('en_core_web_md')
# process a sentence using the model
mydoc = nlp("Canada is a large country")
#Get a vector for individual words
#print(doc[0].vector) #vector for 'Canada', the first word in the text 
print(mydoc.vector) #Averaged vector for the entire sentence

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.63 µs
[-0.41918245  0.16200495 -0.79129374  1.4938685   1.3054081   2.0170395
  1.5175831   0.93228465  2.1891901   1.4416349   0.06106711  0.09743678
 -0.2257375   0.17713971 -1.5512953   0.2645687  -1.0485003  -1.080864
 -0.8598553  -1.0081775   0.6843368  -0.6533736  -0.32018867 -0.45141286
 -1.5463241   0.8096622   0.66305393 -1.3783945   0.9841442  -0.6920743
  0.22289622  0.15090446 -1.1180314  -1.9345913   0.38503057 -1.8199227
  1.3990417  -1.0615559  -1.9546788  -0.2100529   1.9202824  -0.3749935
 -0.25492477 -2.1416779  -0.6990263  -0.03200452 -0.9764668   1.7387159
 -0.14891338  2.310504    2.7912867  -1.0199782   0.10864041  0.5835435
 -3.1004105   1.5820146   2.0926924  -0.40339088 -0.5991646  -1.3402617
 -0.7594353   0.2780761   2.629024    0.21917906  1.6852343   0.01221395
  1.06303    -2.0580707   0.38551608 -1.1065528  -1.5958662  -0.6947671
 -1.4428566  -0.38099432  0.50667036  0.4246232   1.2565958  -0.098301

In [18]:
#What happens when I give a sentence with strange words (and stop words), and try to get its word vector in Spacy?
temp = nlp('practicalnlp is a newword')
temp[0].vector

array([ 1.4743975 , -0.9622246 , -1.1067446 ,  0.71745956,  3.6869755 ,
       -1.4803706 ,  2.6486013 , -0.02807039,  2.0256255 ,  3.9974196 ,
        2.4013276 ,  2.8038695 ,  2.1188593 , -1.0725985 , -1.8718748 ,
       -1.7074881 , -0.47109914,  1.753031  , -2.5303397 , -0.6910662 ,
        1.4618394 ,  2.451487  , -2.729975  , -1.2108035 , -1.0596836 ,
       -0.86415946, -1.8492069 , -1.3960482 ,  0.9203042 , -1.0206674 ,
        2.9118538 , -1.1872265 ,  0.1711235 , -3.0739012 ,  1.3036946 ,
       -2.8744037 ,  4.8433757 ,  0.5957062 , -2.63268   ,  1.5330828 ,
        3.3766036 ,  2.9181588 , -1.454087  , -1.4249781 , -1.578454  ,
        1.8532394 , -1.0139503 , -0.20046425, -0.5760678 ,  1.9376096 ,
       -0.3732175 , -1.9566089 , -1.7265924 , -0.9403594 , -0.6440577 ,
        1.1401592 ,  2.6202524 ,  0.08162385,  1.40631   ,  1.4846356 ,
       -1.5503293 , -4.1603303 ,  0.78613114,  1.4033163 ,  2.3532844 ,
       -0.48720837,  2.444285  , -3.873241  ,  2.10482   ,  2.81

Well, at least, this is better than throwing an exception! :) 

