In this notebook, let us see how we can represent text using pre-trained word embedding models. 

# 1. Using a pre-trained word2vec model

Let us take an example of a pre-trained word2vec model, and how we can use it to look for most similar words. We will use the Google News vectors embeddings.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

A few other pre-trained word embedding models, and details on the means to access them through gensim can be found in:
https://github.com/RaRe-Technologies/gensim-data

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install scikit-learn==0.21.3
# !pip install wget==3.2
# !pip install gensim==3.6.0
# !pip install psutil==5.4.8
# !pip install spacy==2.2.4

# ===========================

In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch3/ch3-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch3-requirements.txt"

# ===========================

In [3]:
import os
import wget
import gzip
import shutil

# gn_vec_path = "GoogleNews-vectors-negative300.bin"
# if not os.path.exists("GoogleNews-vectors-negative300.bin"):
#     if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin"):
#         #Downloading the reqired model
#         if not os.path.exists("../Ch2/GoogleNews-vectors-negative300.bin.gz"):
#             if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
#                 wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
#             gn_vec_zip_path = "GoogleNews-vectors-negative300.bin.gz"
#         else:
#             gn_vec_zip_path = "../Ch2/GoogleNews-vectors-negative300.bin.gz"
#         #Extracting the required model
#         with gzip.open(gn_vec_zip_path, 'rb') as f_in:
#             with open(gn_vec_path, 'wb') as f_out:
#                 shutil.copyfileobj(f_in, f_out)
#     else:
#         gn_vec_path = "../Ch2/" + gn_vec_path

gn_vec_zip_path = "data/bigdata/goog_vec/GoogleNews-vectors-negative300.bin.gz"
gn_vec_path = "data/bigdata/goog_vec/GoogleNews-vectors-negative300.bin"

print(f"Model at {gn_vec_path}")

Model at data/bigdata/goog_vec/GoogleNews-vectors-negative300.bin


In [4]:
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 

import psutil #This module helps in retrieving information on running processes and system resource utilization
process = psutil.Process(os.getpid())
from psutil import virtual_memory
mem = virtual_memory()

import time #This module is used to calculate the time  

In [6]:
from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = gn_vec_path

#Load W2V model. This will take some time, but it is a one time effort! 
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
print('-'*10)

start_time = time.time() #Start the timer
ttl = mem.total #Toal memory available

w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) #load the model
print("%0.2f seconds taken to load"%float(time.time() - start_time)) #Calculate the total time elapsed since starting the timer
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) #Calculate the memory used after loading the model
print('-'*10)

print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) #Percentage increase in memory after loading the model
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.index_to_key)) #Number of words in the vocabulary. 

Memory used in GB before Loading the Model: 0.17
----------
20.02 seconds taken to load
----------
Finished loading Word2Vec
----------
Memory used in GB after Loading the Model: 4.27
----------
Percentage increase in memory usage: 2581.61% 
----------
Numver of words in vocablulary:  3000000


In [7]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353005051612854),
 ('lovely', 0.8106936812400818),
 ('stunningly_beautiful', 0.7329413294792175),
 ('breathtakingly_beautiful', 0.7231340408325195),
 ('wonderful', 0.6854086518287659),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402888298035)]

In [8]:
#Let us try with another word! 
w2v_model.most_similar('toronto')

[('montreal', 0.6984112858772278),
 ('vancouver', 0.6587257385253906),
 ('nyc', 0.6248832941055298),
 ('alberta', 0.6179691553115845),
 ('boston', 0.611499547958374),
 ('calgary', 0.61032634973526),
 ('edmonton', 0.6100260615348816),
 ('canadian', 0.5944076776504517),
 ('chicago', 0.5911980271339417),
 ('springfield', 0.5888351798057556)]

In [9]:
#What is the vector representation for a word? 
w2v_model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [10]:
#What if I am looking for a word that is not in this vocabulary?
w2v_model['practicalnlp']

KeyError: "Key 'practicalnlp' not present"

#### Two things to note while using pre-trained models: 


1.   Tokens/Words are always lowercased. If a word is not in the vocabulary,   the model throws an exception.
2.   So, it is always a good idea to encapsulate those statements in try/except blocks.

 

# 2. Getting the embedding representation for full text

We have seen how to get embedding vectors for single words. How do we use them to get such a representation for a full text? A simple way is to just sum or average the embeddings for individual words. We will see an example of this using Word2Vec in Chapter 4. Let us see a small example using another NLP library Spacy - which we saw earlier in Chapter 2 too.


In [11]:
# !python -m spacy download en_core_web_md

In [13]:
import spacy

%time 
nlp = spacy.load('en_core_web_md')
# process a sentence using the model
mydoc = nlp("Canada is a large country")
#Get a vector for individual words
#print(doc[0].vector) #vector for 'Canada', the first word in the text 
print(mydoc.vector) #Averaged vector for the entire sentence

CPU times: total: 0 ns
Wall time: 0 ns
[-2.12132597e+00  3.35791826e+00 -1.37670004e+00  2.12385988e+00
  6.28810024e+00  3.22182178e-01  1.18766809e+00  4.87165976e+00
  2.24417591e+00  7.14037895e-01  1.03926411e+01  8.83959949e-01
 -1.73903596e+00  5.41560054e-01 -1.55289978e-01  5.18263149e+00
  1.30475593e+00  4.21266031e+00 -5.92720024e-02 -1.28370404e+00
  2.54464006e+00  1.31399959e-01 -4.84842014e+00  1.84918189e+00
 -6.28175914e-01 -1.20439982e+00 -1.89999998e+00 -4.88359404e+00
 -1.59767210e+00 -2.89982986e+00  2.57135957e-01  2.57717991e+00
 -2.17529225e+00 -2.77516985e+00 -2.83998394e+00  8.96261990e-01
  3.73915970e-01  4.36887592e-01  2.06502008e+00 -2.08246017e+00
 -7.68391967e-01  1.87826610e+00  1.21900201e+00  4.61789995e-01
 -2.57270002e+00  2.26117969e+00  2.93105793e+00 -1.84933782e+00
 -5.98986030e-01  1.39556003e+00 -1.71248794e+00  4.13538039e-01
  2.05463791e+00 -4.33485985e+00 -3.63799959e-01 -1.03273201e+00
  2.23117399e+00 -5.93478978e-01 -7.95660019e-01  3

In [14]:
#What happens when I give a sentence with strange words (and stop words), and try to get its word vector in Spacy?
temp = nlp('practicalnlp is a newword')
temp[0].vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Well, at least, this is better than throwing an exception! :) 

