# Word2Vec Custom Model



#### Introduction

If you are working on a domain specific NLP problem, using pre-trained models
may not be sufficient. For example, if you are working on legal documents, there
are words that may be frequently used on legal documents but not found in the public domain.
So, there is a need to train the word vectors against these sets of documents.

In this exercise, we will recreate the word vectors based on new set of news articles.
- The source of the corpus is a text file in folder data.
- The text file is articles.txt. One line contains one document.

In [1]:
# NLP tools
import gensim
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import remove_stopwords
from gensim.utils import tokenize

# Data tools
import pandas as pd


### Part One: Train a custom word2ec


#### Prepare the text corpus
* Read the document into a Dataframe.
* Split documents into sentences
* Split each document into a list of words since GENSIM requires the documents to be represented as a list of sentences to train Word2Vec


In [2]:
import pandas as pd

articles_data = pd.read_csv('./data/articles.txt', sep='\t',  header=None, names=[ "text"])                         
# articles_data[0:3]   # display first three documents

sentences = articles_data.text.str.split()   # Generate sentences for training word2vec
#print (sentences[0])   # Check the content of the first documeht

#### Train the Word2Vec model.
Here are some important hyper parameters to consider
* size - number of dimensions to represent words
* window - context window size - distance between a target word and words around it
* min_count - minimal number of words to consider in the corpus
* workers - number of CPU core to use



In [3]:
size = 100  
window = 5
min_count = 1
workers = 4

### Exercise A

* Instantial the Word2Vec model with the right object.
* Build up the Word2Vec model. Take a stretch, it will take a while.
* The tuning of the hyperparameter is out of scope for this lab. We will keep it for a more advance asession.


In [4]:
from gensim.models import Word2Vec
custom_model = Word2Vec(sentences, size=size, window=window, min_count=min_count, workers=workers)


Save the model as a file. This can be loaded in for future use

In [5]:
custom_model.save("custom_word2vec.model")

In [6]:
print(custom_model)

Word2Vec(vocab=211201, size=100, alpha=0.025)


#### We like to see the first 100 words that are trained

In [7]:
words = list(custom_model.wv.vocab)
print(words[0:100])

["Barclays'", 'defiance', 'of', 'US', 'fines', 'has', 'merit', 'Barclays', 'disgraced', 'itself', 'in', 'many', 'ways', 'during', 'the', 'pre-financial', 'crisis', 'boom', 'years.', 'So', 'it', 'is', 'tempting', 'to', 'think', 'bank,', 'when', 'asked', 'by', 'Department', 'Justice', 'pay', 'a', 'large', 'bill', 'for', 'polluting', 'financial', 'system', 'with', 'mortgage', 'junk', 'between', '2005', 'and', '2007,', 'should', 'cough', 'up,', 'apologise', 'learn', 'some', 'humility.', 'That', 'not', 'view', 'chief', 'executive,', 'Jes', 'Staley.', 'thinks', 'DoJ’s', 'claims', 'are', '“disconnected', 'from', 'facts”', 'that', '“an', 'obligation', 'our', 'shareholders,', 'customers,', 'clients', 'employees', 'defend', 'ourselves', 'against', 'unreasonable', 'allegations', 'demands.”', 'The', 'stance', 'possibly', 'foolhardy,', 'since', 'going', 'into', 'open', 'legal', 'battle', 'most', 'powerful', 'prosecutor', 'risky,', 'especially', 'if', 'you', 'end', 'up']


In [18]:
#### Let's see the embedded vector

In [8]:
my_vector = custom_model.wv['customers']
print(my_vector)

[ 0.01378442 -1.1174667   0.4744351  -1.2910578   0.52911615  1.5561563
 -0.14484563 -0.47406304  0.645721   -1.2883312   1.3062688   0.08798663
 -0.5034557   1.0442336  -0.7617575  -0.33152142 -1.0958656   0.81533015
  0.15853634  0.5193816   2.5061345  -1.9462053   0.98096704 -0.80896395
 -0.55758625  0.86176926 -1.5865052  -0.70564646 -0.69579375  1.3621786
  0.3874463  -1.4278525  -1.4715601  -1.7420949   0.5578193  -0.32139802
 -0.44999853 -0.77638054 -0.39046544 -1.4123049   0.9775229   1.4130435
  0.1332317  -0.08742216  0.60607463 -0.24934907 -0.19561014  0.3671871
  1.5404122   3.2982652   0.295346   -0.5780479   0.9804446  -0.31606913
 -0.32888243 -1.6041582   0.50948507  4.1281695  -1.2953583   0.07255799
  0.84527147  0.04349367  0.23306443  0.007578    2.4469492  -0.00435588
  0.36843035  0.27085546 -0.7627477   1.049201    0.7159596  -0.331684
  0.12164953  0.5501899   1.3465499   1.4330037  -0.24350229  1.6792712
  0.32269552  0.50986665  0.75697035 -0.28999075 -0.997702


### Part Two:  Add more words to custom model
It is fair to assume that the word vector needs to be 'improved' upon subsequently when more documents are available. We can add more documents to a model as it evolves. Each document has to be a string. All the documents need to be in a list, followed by removal of stop words and then tokenized.



In [9]:
docs = ["Overseas REIT That Grew Year-on-Year DPU", 
         "Attention! Activist REIT Investors Needed in Singapore",
         "Forget Physical Property. Buy These REIT Businesses to Gain Safer Exposure to Real Estate",
         "Invest into REIT for stable income",
         "REIT for retirement and property income",
         "Property investment using REIT",
         "Low Risk REIT for your retirement portfolio",
         "Using REIT To get a steady Income",
         "Multiple Income stream using REIT",
         "Supplement Income through property securitied",
          "Famous for been for fast person to make money in 3 years"
          "Invest in propertly and get Passive Income using REIT","Income without effort through REIT",
        ]


### Exercise B
Based on the example in the earlier part of the notebook, prepare the docs list into a structure that is ready to accept by GENSIM

In [38]:

new_docs = []
#your codes
for doc in docs:
    

### Exercise C
Reload an existing Word2Vec model for addition training
* Reload  the saved model
* Summarized the loaded data. Note the vocabulary size
* Get the vocabulary from in the pre-trained model (hint: attribute model.wv.vocab)

In [None]:
# your  codes
new_model = Word2Vec.load("custom_word2vec.model")
print(new_model)
pretrained_words = list(new_model.wv.vocab)

### Exercise D
We will proceed to train the model with the new set of document tokens
* Build up the new vocabulary  (hint: Use the build_vocab() method )
* Train the model with the new sentences (hint: Use the train() method
* Summarized the loaded data. Note the new vocabulary size
* Get the list of words in the new model


In [None]:
new_model.????????(????????, update=True)  ## prepare the model vocabulary
new_model.?????????(????????, total_examples= len(new_docs), epochs=50)
print(new_model)
posttrained_words = list(new_model.wv.vocab)

#### Determine the new vocabulary
* We find the difference in the vocabulary before and after the new training


In [58]:
# words = list(custom_model.wv.vocab)

print(list(set(posttrained_words)-set(pretrained_words)))


['securitied', 'Passive', 'Year-on-Year', 'DPU', 'Attention!', 'Property.', 'yearsInvest', 'REIT', 'propertly', 'Exposure']


In [44]:
#new_model.word_vec('REIT')
v = new_model.wv.word_vec('REIT')
print (v)

[-1.96350794e-02 -2.02486129e-03 -4.10056971e-02  1.35938507e-02
 -2.50798557e-03 -4.30625584e-03 -1.67355291e-04 -2.45491881e-02
  1.88353062e-02 -2.46145786e-03  9.69257235e-05  1.21400459e-02
 -2.23264080e-02 -1.00820260e-02 -1.99193843e-02 -9.01276898e-03
 -4.29931376e-03  1.43385250e-02 -1.17647955e-02  3.10437270e-02
 -1.93367358e-02  1.66339092e-02  1.41026732e-02 -1.56057533e-02
 -6.55088667e-03 -3.67482565e-02 -1.48438793e-02 -2.23036278e-02
 -5.23079466e-03 -4.36255801e-03  3.28454785e-02  4.21109609e-03
 -1.47005273e-02  3.54526043e-02 -3.81266070e-03 -1.38563309e-02
 -4.18880861e-03 -1.98773704e-02  3.29029933e-02 -4.01221123e-03
  2.55043358e-02 -3.20761390e-02  2.69843079e-02 -8.36666208e-03
 -2.28520036e-02 -1.22945150e-02 -6.97563728e-03  7.71008292e-03
 -2.87611084e-03 -1.08578959e-02 -5.44355474e-02  9.74336435e-05
  3.62618989e-03 -1.40412757e-03  3.54965702e-02  1.26089267e-02
 -3.65837254e-02  2.55953986e-02 -4.76709791e-02 -1.44040212e-02
 -1.88120827e-02  2.75070

### Save the final model and the key vectors.
When the modelling is completed, you may wish to save the vectors into the KeyedVectored struture.

The reason for separating the trained vectors into KeyedVectors is that if you don’t need the full model state any more (don’t need to continue training), the state can discarded, resulting in a much smaller and faster object that can be mmapped for lightning fast loading and sharing the vectors in RAM between processes:

In [59]:
new_model.wv.save('new_custom_word2vec.kv')   # This saves the word-vectors lookup
new_model.save("new_custom_word2vec.model")   # this saves the model for future training


The saved KeyedVector can be loaded in subsequently for application use. A sample code to load in is shown below. the vector for a word is also retrieved.


In [63]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load("new_custom_word2vec.kv", mmap='r')
wv


<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x1a7d0416d8>

In [64]:
vector = wv['computer']
print (vector)

[-0.46485904 -1.0854895  -0.422736    0.63330615 -0.01367772 -0.19758989
  0.17786923 -1.1056702   0.44522965 -0.34253973  0.2706924   0.30083555
  0.08683466 -0.13762787 -0.6585511  -0.02505613 -0.6359403  -0.23481385
 -0.45187244  0.8636467  -0.35240334 -0.45184505  0.24784958  0.28864238
  0.09729365 -0.5539774  -0.4438754   0.02398845  0.15363957 -0.431226
  0.18322936  0.06095865 -0.8781888   0.02536917 -0.1797969  -1.1510798
 -0.37910458 -0.32920897  0.9193784  -0.03638902 -0.04682702 -0.8927201
 -0.12078495 -0.50178176  0.27532578 -0.76397955 -0.14572047  0.1539461
  0.38892302  0.38636476 -0.42555583 -0.39981484  0.37844056 -0.44468844
 -0.05397031  0.1801081  -0.45894274  0.49194783 -0.61313987 -0.18911152
  0.35393918 -0.07444182  0.3684309   0.30126336 -0.04820593 -0.00197545
 -0.7725796  -0.05828528 -0.5999901  -0.18389821  0.3623138  -0.57996565
  0.5374159  -0.01805456 -0.57517934  0.44733483 -0.08229531 -0.10354503
 -0.44336882 -0.2573726  -0.52780646 -0.66958344 -0.5550

### Optional Excerise
* Use the file cnus.txt as a new input to the model
* Further train the Word2Vec model with the new file
* Check if the vocaulary size has increased



In [None]:
# your codes

### Optional Exercise
It is worth examining if any word (e.g w1) in the initial training is found in the future documents. Check if the vector for w1 remains the same or changes after the subsequent training.


In [65]:
# your results