![alt text](https://)In this notebook, let us see how we can represent text using pre-trained word embedding models, as well as train our own word and document embedding models.

# 1. Using a pre-trained word2vec model

Let us take an example of a pre-trained word2vec model, and how we can use it to look for most similar words. We will use the Google News vectors.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

A few other pre-trained word embedding models, and details on the means to access them through gensim can be found in:
https://github.com/RaRe-Technologies/gensim-data

In [1]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-03-12 04:55:21--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.142.54
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.142.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/root/input/GoogleNews-vectors-negative300.bin.gz’


2020-03-12 04:55:54 (55.9 MB/s) - ‘/root/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [0]:
import warnings
warnings.filterwarnings("ignore")


import os
import psutil
process = psutil.Process(os.getpid())

from psutil import virtual_memory
mem = virtual_memory()

import time



In [3]:
from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = '/root/input/GoogleNews-vectors-negative300.bin.gz'

#Load W2V model. This will take some time, but it is a one time effort! 
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9)))
print('-'*10)

start_time = time.time()

w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True)
print("%0.2f seconds taken to load"%float(time.time() - start_time))
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: %0.2f"%float(post/(10**9)))
print('-'*10)

ttl = mem.total
print("Percentage increase in memory usage: %0.2f "%float((post/pre)*100))
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.vocab)) #Number of words in the vocabulary. 

Memory used in GB before Loading the Model: 0.19
----------
107.37 seconds taken to load
----------
Finished loading Word2Vec
----------
Memory used in GB after Loading the Model: 5.04
----------
Percentage increase in memory usage: 2617.63 
----------
Numver of words in vocablulary:  3000000


In [0]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353004455566406),
 ('lovely', 0.810693621635437),
 ('stunningly_beautiful', 0.7329413890838623),
 ('breathtakingly_beautiful', 0.7231341004371643),
 ('wonderful', 0.6854087114334106),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402292251587)]

In [0]:
#Let us try with another word! 
w2v_model.most_similar('toronto')

[('montreal', 0.698411226272583),
 ('vancouver', 0.6587257385253906),
 ('nyc', 0.6248831748962402),
 ('alberta', 0.6179691553115845),
 ('boston', 0.611499547958374),
 ('calgary', 0.61032634973526),
 ('edmonton', 0.6100261211395264),
 ('canadian', 0.5944076776504517),
 ('chicago', 0.5911980271339417),
 ('springfield', 0.5888351202011108)]

In [0]:
#What is the vector representation for a word? 
w2v_model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [0]:
#What if I am looking for a word that is not in this vocabulary?
w2v_model['practicalnlp']

KeyError: "word 'practicalnlp' not in vocabulary"

####Two things to note while using pre-trained models: 


1.   Tokens/Words are always lowercased. If a word is not in the vocabulary,   the model throws an exception.
2.   So, it is always a good idea to encapsulate those statements in try/except blocks.

 

# 2. Getting the embedding representation for full text

We have seen how to get embedding vectors for single words. How do we use them to get such a representation for a full text? A simple way is to just sum or average the embeddings for individual words. We will see an example of this using Word2Vec in Chapter 4. Let us see a small example using another NLP library Spacy - which we saw earlier in Chapter 2 too.


In [0]:
import spacy

# Load the spacy model that we already installed in Chapter 2. This takes a few seconds.
%time nlp = spacy.load('en_core_web_md')
# process a sentence using the model
mydoc = nlp("Canada is a large country")
#Get a vector for individual words
#print(doc[0].vector) #vector for 'Canada', the first word in the text 
print(doc.vector) #Averaged vector for the entire sentence

CPU times: user 13 s, sys: 661 ms, total: 13.7 s
Wall time: 15.2 s
[-1.12055197e-01  2.26087615e-01 -5.15111461e-02 -1.21812008e-01
  4.13958639e-01 -8.56475979e-02 -2.84600933e-03 -2.26096585e-01
  6.98113963e-02  2.27946019e+00 -4.49774921e-01 -6.39050007e-02
 -1.80326015e-01 -8.79765972e-02  9.93399299e-04 -1.57384202e-01
 -1.23817801e-01  1.54990411e+00  2.00794004e-02  1.38399601e-01
 -1.48897991e-01 -2.23025799e-01 -1.48171991e-01  4.68924567e-02
 -3.17026004e-02  1.19096041e-02 -6.10985979e-02  9.57068056e-02
  9.37099904e-02  1.70955807e-01 -9.29740071e-03  7.88536817e-02
  1.74508005e-01 -1.04450598e-01  1.04872189e-01 -1.16961405e-01
  6.23028055e-02 -2.23016590e-01 -1.44107476e-01 -2.03423887e-01
  2.61404991e-01  2.43404001e-01  1.51980996e-01 -1.12484001e-01
  1.18055798e-01 -9.51323956e-02  8.66319984e-02 -2.54322797e-01
  3.84932049e-02  1.18278004e-01 -3.21602583e-01  3.73764008e-01
  1.13018408e-01 -8.05834010e-02  1.84921592e-01  9.38879885e-03
  1.22166201e-01 -3.242

In [0]:
#What happens when I give a strange word, and try to get its word vector in Spacy?
temp = nlp('practicalnlp is a newword')
temp[0].vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Well, at least, this is better than throwing an exception! :) 

