<a href="https://colab.research.google.com/github/nishkalavallabhi/practicalnlp/blob/V_2_0/Ch2/Ch3-Latest/Notebook_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, let us see how we can represent text using pre-trained word embedding models, as well as train our own word and document embedding models.

# 1. Using a pre-trained word2vec model

Let us take an example of a pre-trained word2vec model, and how we can use it to look for most similar words. We will use the Google News vectors.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

A few other pre-trained word embedding models, and details on the means to access them through gensim can be found in:
https://github.com/RaRe-Technologies/gensim-data

In [1]:
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
!mkdir -p drive
!google-drive-ocamlfuse drive

E: Package 'python-software-properties' has no installation candidate
Selecting previously unselected package google-drive-ocamlfuse.
(Reading database ... 131183 files and directories currently installed.)
Preparing to unpack .../google-drive-ocamlfuse_0.7.13-0ubuntu1~ubuntu18.04.1_amd64.deb ...
Unpacking google-drive-ocamlfuse (0.7.13-0ubuntu1~ubuntu18.04.1) ...
Setting up google-drive-ocamlfuse (0.7.13-0ubuntu1~ubuntu18.04.1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
··········
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope

In [0]:
from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = "/content/drive/NLP_book/Datasets/practicalnlp-master/Ch2/GoogleNews-vectors-negative300.bin"
#Load W2V model. This will take some time, but it is a one time effort! 
%time w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True)
print('done loading Word2Vec')
print(len(w2v_model.vocab)) #Number of words in the vocabulary. 

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

In [0]:
#Let us try with another word! 
w2v_model.most_similar('toronto')

In [0]:
#What is the vector representation for a word? 
w2v_model['computer']

In [0]:
#What if I am looking for a word that is not in this vocabulary?
w2v_model['practicalnlp']

Two things to note while using pre-trained models: Tokens/Words are always lowercased. If a word is not in the vocabulary, the model throws an exception. So, it is always a good idea to encapsulate those statements in try/except blocks.

# 2. Let us now see how to train our own word2vec model

We are going to use a small dataset called common_texts that comes with gensim. It is a small list of 9 texts, where each text is tokenized and represented as a list of words. Let us see how to build our own Word2Vec model with this tiny corpus. 

In [0]:
from gensim.test.utils import common_texts

#Inspect common_texts
print(len(common_texts))
#print(common_texts)

#Build the model, by selecting the parameters. 
our_model = Word2Vec(common_texts, size=10, window=5, min_count=1, workers=4)
#Save the model
our_model.save("tempmodel.w2v")
#Inspect the model by looking for the most similar words for a test word. 
print(our_model.wv.most_similar('computer', topn=5))
#Let us see what the 10-dimensional vector for 'computer' looks like.
print(our_model['computer'])

# 3. Getting the embedding representation for full text

We have seen how to get embedding vectors for single words. How do we use them to get such a representation for a full text? A simple way is to just sum or average the embeddings for individual words. We will see an example of this using Word2Vec in Chapter 4. Let us see a small example using another NLP library Spacy - which we saw earlier in Chapter 2 too.


In [0]:
 import spacy.cli
 spacy.cli.download("en_core_web_md")

In [0]:
import spacy

# Load the spacy model that we already installed in Chapter 2. This takes a few seconds.
%time nlp = spacy.load('en_core_web_md')
# process a sentence using the model
mydoc = nlp("Canada is a large country")
#Get a vector for individual words
#print(doc[0].vector) #vector for 'Canada', the first word in the text 
print(doc.vector) #Averaged vector for the entire sentence

In [0]:
#What happens when I give a strange word, and try to get its word vector in Spacy?
temp = nlp('practicalnlp is a newword')
temp[0].vector

Well, at least, this is better than throwing an exception! :) 

