# Learning `word2vec` embeddings
#### By: Tu My DOAN & Sali Dauda MOHAMNMED

## Libraries

In [36]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import tensorflow as tf
import keras
import sklearn.preprocessing as pre
from sklearn.utils import shuffle
import pandas as pd
import seaborn as sn
import pickle, random
from xml.dom import minidom
import warnings
warnings.filterwarnings('ignore')
print("Keras version: {}".format(keras.__version__))
print("TensorFlow version: {}".format(tf.__version__))
%matplotlib inline

Keras version: 2.2.4
TensorFlow version: 1.10.0


## Data preparation

First, we will read all train data into dataframe:

In [94]:
df_train = pd.read_csv("./dmdata/train.csv")
df_train.head()

Unnamed: 0,id,file,earnings: 0 no/ 1 yes
0,1,1.xml,0
1,2,2.xml,0
2,3,3.xml,0
3,4,4.xml,0
4,5,5.xml,0


Define a function to read all of our xml files and return a dataframe:

In [206]:
def read_xml(xml_path):
    empty_content = []
    df = pd.DataFrame(columns=['content','file'])
    file_list = glob2.glob(xml_path)
    for file in file_list:
        mydoc = minidom.parse(file)
        file_name = file.split("/")
        items = mydoc.getElementsByTagName('BODY')
        if items[0].firstChild != None:
            content = items[0].firstChild.data
            df = df.append({'content':content, 'file':file_name[2]},ignore_index=True)
        else: 
            empty_content.append(str(file_name[2]))
            df = df.append({'content':np.nan, 'file':file_name[2]},ignore_index=True)
    return df, empty_content
empty_content_lst = []
df_xml, empty_content_lst = read_xml("./dmdata/*.xml")
df_xml.head()

Unnamed: 0,content,file
0,Suffield Financial Corp said the\nFederal Rese...,162.xml
1,<Mark Resources Inc> said it\nagreed to sell 5...,1390.xml
2,Southern New England\nTelecommunications Inc s...,604.xml
3,</BODY>The investor group owning about 42 pct\...,2699.xml
4,Rexon Inc said it filed a\nregistration statem...,2841.xml


As shown above, we have a dataframe that contains all of the content of our xml files along with their file name. Later we will merge this dataframe with another train dataframe that contains the labels and file name together.

The `empty_content_lst` is a list of article without any data in it. Please note that there are `416` items without any content that come from both train (`398 items`) and test set (`18 items`).

In [207]:
len(empty_content_lst),empty_content_lst

(416,
 ['1435.xml',
  '4559.xml',
  '77.xml',
  '189.xml',
  '4565.xml',
  '3975.xml',
  '2302.xml',
  '1780.xml',
  '4808.xml',
  '2909.xml',
  '994.xml',
  '2060.xml',
  '3419.xml',
  '3343.xml',
  '981.xml',
  '2263.xml',
  '4148.xml',
  '2465.xml',
  '3035.xml',
  '4799.xml',
  '76.xml',
  '1346.xml',
  '4558.xml',
  '611.xml',
  '3586.xml',
  '2698.xml',
  '607.xml',
  '3235.xml',
  '1436.xml',
  '2103.xml',
  '3976.xml',
  '4758.xml',
  '3037.xml',
  '4770.xml',
  '571.xml',
  '3341.xml',
  '4413.xml',
  '3354.xml',
  '216.xml',
  '4836.xml',
  '2466.xml',
  '2328.xml',
  '4759.xml',
  '1437.xml',
  '3591.xml',
  '831.xml',
  '1180.xml',
  '400.xml',
  '4761.xml',
  '1143.xml',
  '4832.xml',
  '1989.xml',
  '3146.xml',
  '4417.xml',
  '789.xml',
  '2067.xml',
  '2098.xml',
  '993.xml',
  '1988.xml',
  '3621.xml',
  '3190.xml',
  '3741.xml',
  '398.xml',
  '2649.xml',
  '1368.xml',
  '830.xml',
  '2675.xml',
  '2113.xml',
  '171.xml',
  '99.xml',
  '2105.xml',
  '2893.xml',
  '396

In [209]:
df_xml.describe

<bound method NDFrame.describe of                                                 content      file
0     Suffield Financial Corp said the\nFederal Rese...   162.xml
1     <Mark Resources Inc> said it\nagreed to sell 5...  1390.xml
2     Southern New England\nTelecommunications Inc s...   604.xml
3     </BODY>The investor group owning about 42 pct\...  2699.xml
4     Rexon Inc said it filed a\nregistration statem...  2841.xml
5     </BODY>Hughes Tool Co said its board voted at\...  3587.xml
6     Standard and Poor's Corp said it\nupgraded Geo...  2855.xml
7     National Bank of Hungary first\nvice-president...  3593.xml
8     Sorg Inc said a group composed of\none-third o...    88.xml
9     Shr 27 cts vs 39 cts\n    Net 481,189 vs 697,3...   610.xml
10    Shr 16 cts vs 35 cts\n    Net 476,000 vs 929,0...  1384.xml
11    Liquefied natural gas imports from\nAlgeria ar...   176.xml
12    Shr 20 cts vs 28 cts\n    Net 393,371 vs 555,9...   638.xml
13    General Instrument Corp has received

In [239]:
df_xml[df_xml['content'].isnull()]

Unnamed: 0,content,file
14,,1435.xml
24,,4559.xml
36,,77.xml
38,,189.xml
40,,4565.xml
59,,3975.xml
75,,2302.xml
85,,1780.xml
99,,4808.xml
131,,2909.xml


We will merge dataframe contents with datafram train labels on file name column which will result into a new dataframe with 4 different columns. Because in our training data, we only have `4,800` items so our resulted file will have that same number. `200` remaining items are in the test set which we haven't touched yet.

In [210]:
df = pd.merge(df_train,df_xml, on=['file'])
df = df.rename(columns={'earnings: 0 no/ 1 yes': 'label'})
df.head()

Unnamed: 0,id,file,label,content
0,1,1.xml,0,Showers continued throughout the week in\nthe ...
1,2,2.xml,0,Standard Oil Co and BP North America\nInc said...
2,3,3.xml,0,Texas Commerce Bancshares Inc's Texas\nCommerc...
3,4,4.xml,0,BankAmerica Corp is not under\npressure to act...
4,5,5.xml,0,The U.S. Agriculture Department\nreported the ...


In [212]:
df_train.describe

<bound method NDFrame.describe of         id      file  earnings: 0 no/ 1 yes
0        1     1.xml                      0
1        2     2.xml                      0
2        3     3.xml                      0
3        4     4.xml                      0
4        5     5.xml                      0
5        6     6.xml                      0
6        7     7.xml                      0
7        8     8.xml                      0
8        9     9.xml                      1
9       10    10.xml                      0
10      11    11.xml                      1
11      12    12.xml                      1
12      13    13.xml                      1
13      14    14.xml                      1
14      15    15.xml                      0
15      16    16.xml                      0
16      17    17.xml                      0
17      18    18.xml                      1
18      19    19.xml                      0
19      21    21.xml                      0
20      22    22.xml                      

Saving them into csv format that can be used later for training prediction model.

In [244]:
df.to_csv('news_datafile.csv',index=False)

As we can see that the number of articles with `0` labels outnumbered that of `1` labels.

In [221]:
df['label'].value_counts(dropna=False)

0    3825
1     975
Name: label, dtype: int64

We will drop all `NaN` values in the dataset because we want to build the word embeddings which is based on the words so any empty contents will not help here.

In [242]:
df[df['content'].isnull()]

Unnamed: 0,id,file,label,content
28,30,30.xml,0,
29,31,31.xml,0,
71,76,76.xml,0,
72,77,77.xml,0,
73,78,78.xml,0,
86,92,92.xml,0,
88,94,94.xml,0,
89,95,95.xml,0,
92,99,99.xml,0,
94,101,101.xml,0,


In [245]:
df_new = df.dropna()

In [246]:
df_new.describe

<bound method NDFrame.describe of         id      file  label                                            content
0        1     1.xml      0  Showers continued throughout the week in\nthe ...
1        2     2.xml      0  Standard Oil Co and BP North America\nInc said...
2        3     3.xml      0  Texas Commerce Bancshares Inc's Texas\nCommerc...
3        4     4.xml      0  BankAmerica Corp is not under\npressure to act...
4        5     5.xml      0  The U.S. Agriculture Department\nreported the ...
5        6     6.xml      0  Argentine grain board figures show\ncrop regis...
6        7     7.xml      0  Red Lion Inns Limited Partnership\nsaid it fil...
7        8     8.xml      0  Moody's Investors Service Inc said it\nlowered...
8        9     9.xml      1  Champion Products Inc said its\nboard of direc...
9       10    10.xml      0  Computer Terminal Systems Inc said\nit has com...
10      11    11.xml      1  Shr 34 cts vs 1.19 dlrs\n    Net 807,000 vs 2,...
11      12    12.x

Finally, we have new dataframe with `4,402` items without any `NaN` values.

In [247]:
df_new.shape

(4402, 4)

In [255]:
df_new.content[0:10].values

array(['Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this cocoa

## Learning word embeddings with `Gensim`

### Tokenizing data
We will define a function to convert sentences into a list of words as below:

In [373]:
# Ref: [1]
import re, nltk
special_characters = re.compile("[^A-Za-z0-9 ]")
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
def convert_to_sentences(data, tokenizer):
    # First, converting each review into sentences
    # Use NLTK Tokenizer to split review into sentences (punkt tokenizer - english.pickle)
    data = data.lower().replace("<i>", "")
    data = data.replace("    ", " ")
    data = data.replace("\n", ". ")
    #data = data.replace(".", ". ")
    data = data.replace("reuter", "")
    #print(data)
    data = re.sub("  ", " ", data)
    all_sentences = tokenizer.tokenize(data.strip())
    # Second, converting each sentence into words
    sentences = []
    for words in all_sentences:
        s = re.sub(special_characters, "", words.lower())
        if (len(s)) > 0:
            sentences.append(s.split())
    # Finally, returning a list of sentences (containing words in each sentence)
    return sentences

In [336]:
df_new[df_new['file']=='5.xml']

Unnamed: 0,id,file,label,content
4,5,5.xml,0,The U.S. Agriculture Department\nreported the ...


In [319]:
df_new['content'][12]

'Oper shr loss two cts vs profit seven cts\n    Oper shr profit 442,000 vs profit 2,986,000\n    Revs 291.8 mln vs 151.1 mln\n    Avg shrs 51.7 mln vs 43.4 mln\n    Six mths\n    Oper shr profit nil vs profit 12 cts\n    Oper net profit 3,376,000 vs profit 5,086,000\n    Revs 569.3 mln vs 298.5 mln\n    Avg shrs 51.6 mln vs 41.1 mln\n    NOTE: Per shr calculated after payment of preferred\ndividends.\n    Results exclude credits of 2,227,000 or four cts and\n4,841,000 or nine cts for 1986 qtr and six mths vs 2,285,000 or\nsix cts and 4,104,000 or 11 cts for prior periods from\noperating loss carryforwards.\n Reuter\n    '

Sample result of item at index `12` in our dataframe:

In [374]:
sample = convert_to_sentences(df_new['content'][12],tokenizer)
for i in sample:
    print(i)

['oper', 'shr', 'loss', 'two', 'cts', 'vs', 'profit', 'seven', 'cts']
['oper', 'shr', 'profit', '442000', 'vs', 'profit', '2986000', 'revs', '2918', 'mln', 'vs', '1511', 'mln']
['avg', 'shrs', '517', 'mln', 'vs', '434', 'mln']
['six', 'mths']
['oper', 'shr', 'profit', 'nil', 'vs', 'profit', '12', 'cts']
['oper', 'net', 'profit', '3376000', 'vs', 'profit', '5086000', 'revs', '5693', 'mln', 'vs', '2985', 'mln']
['avg', 'shrs', '516', 'mln', 'vs', '411', 'mln']
['note', 'per', 'shr', 'calculated', 'after', 'payment', 'of', 'preferred']
['dividends', 'results', 'exclude', 'credits', 'of', '2227000', 'or', 'four', 'cts', 'and']
['4841000', 'or', 'nine', 'cts', 'for', '1986', 'qtr', 'and', 'six', 'mths', 'vs', '2285000', 'or']
['six', 'cts', 'and', '4104000', 'or', '11', 'cts', 'for', 'prior', 'periods', 'from']
['operating', 'loss', 'carryforwards']


In [370]:
df_new['file'][4],df_new['label'][4],df_new['content'][4]

('5.xml',
 0,
 'The U.S. Agriculture Department\nreported the farmer-owned reserve national five-day average\nprice through February 25 as follows (Dlrs/Bu-Sorghum Cwt) -\n         Natl   Loan           Release   Call\n         Avge   Rate-X  Level    Price  Price\n Wheat   2.55   2.40       IV     4.65     --\n                            V     4.65     --\n                           VI     4.45     --\n Corn    1.35   1.92       IV     3.15   3.15\n                            V     3.25     --\n X - 1986 Rates.\n\n          Natl   Loan          Release   Call\n          Avge   Rate-X  Level   Price  Price\n Oats     1.24   0.99        V    1.65    -- \n Barley   n.a.   1.56       IV    2.55   2.55\n                             V    2.65    -- \n Sorghum  2.34   3.25-Y     IV    5.36   5.36\n                             V    5.54    -- \n    Reserves I, II and III have matured. Level IV reflects\ngrain entered after Oct 6, 1981 for feedgrain and after July\n23, 1981 for wheat. Level V 

In [371]:
sample = convert_to_sentences(df_new['content'][4],tokenizer)
for i in sample:
    print(i)

the u.s. agriculture department. reported the farmer-owned reserve national five-day average. price through february 25 as follows (dlrs/bu-sorghum cwt) -.    natl   loan     release   call.    avge   rate-x  level price  price.  wheat   2.55   2.40    iv  4.65  --.        v  4.65  --.          vi  4.45  --.  corn 1.35   1.92    iv  3.15   3.15.        v  3.25  --.  x - 1986 rates.. .     natl   loan    release   call.     avge   rate-x  level   price  price.  oats  1.24   0.99  v 1.65 -- .  barley   n.a.   1.56    iv 2.55   2.55.         v 2.65 -- .  sorghum  2.34   3.25-y  iv 5.36   5.36.         v 5.54 -- .  reserves i, ii and iii have matured. level iv reflects. grain entered after oct 6, 1981 for feedgrain and after july. 23, 1981 for wheat. level v wheat/barley after 5/14/82,. corn/sorghum after 7/1/82. level vi covers wheat entered after. january 19, 1984.  x-1986 rates. y-dlrs per cwt (100 lbs).. n.a.-not available..  .  
['the', 'us', 'agriculture', 'department']
['reported', 

In [375]:
sentences = []
for content in df_new.content:
    sentences += convert_to_sentences(content, tokenizer)
print("Done processing.")

Done processing.


In [379]:
for i in sentences[0:10]:
    print("{}\n".format(i))

['showers', 'continued', 'throughout', 'the', 'week', 'in']

['the', 'bahia', 'cocoa', 'zone', 'alleviating', 'the', 'drought', 'since', 'early']

['january', 'and', 'improving', 'prospects', 'for', 'the', 'coming', 'temporao']

['although', 'normal', 'humidity', 'levels', 'have', 'not', 'been', 'restored']

['comissaria', 'smith', 'said', 'in', 'its', 'weekly', 'review', 'the', 'dry', 'period', 'means', 'the', 'temporao', 'will', 'be', 'late', 'this', 'year', 'arrivals', 'for', 'the', 'week', 'ended', 'february', '22', 'were', '155221', 'bags']

['of', '60', 'kilos', 'making', 'a', 'cumulative', 'total', 'for', 'the', 'season', 'of', '593', 'mln', 'against', '581', 'at', 'the', 'same', 'stage', 'last', 'year']

['again', 'it', 'seems']

['that', 'cocoa', 'delivered', 'earlier', 'on', 'consignment', 'was', 'included', 'in', 'the']

['arrivals', 'figures', 'comissaria', 'smith', 'said', 'there', 'is', 'still', 'some', 'doubt', 'as', 'to', 'how']

['much', 'old', 'crop', 'cocoa', 'is', '

In [382]:
len(sentences)

52796

### Training `word2vec` embedding

In [383]:
from gensim.models import *
import logging

In [384]:
num_feature = 100
min_word_count = 5
num_thread = 5
window_size = 10
down_sampling = 0.001
iteration = 30

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = word2vec.Word2Vec(sentences, 
                          iter = iteration,
                          size=num_feature, 
                          min_count = min_word_count, 
                          window = window_size, 
                          sample = down_sampling, 
                          workers=num_thread)

2018-12-01 20:52:30,081 : INFO : collecting all words and their counts
2018-12-01 20:52:30,084 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-12-01 20:52:30,153 : INFO : PROGRESS: at sentence #10000, processed 108681 words, keeping 11280 word types
2018-12-01 20:52:30,217 : INFO : PROGRESS: at sentence #20000, processed 215847 words, keeping 16546 word types
2018-12-01 20:52:30,289 : INFO : PROGRESS: at sentence #30000, processed 324456 words, keeping 20959 word types
2018-12-01 20:52:30,354 : INFO : PROGRESS: at sentence #40000, processed 434961 words, keeping 24652 word types
2018-12-01 20:52:30,425 : INFO : PROGRESS: at sentence #50000, processed 545272 words, keeping 27779 word types
2018-12-01 20:52:30,445 : INFO : collected 28595 word types from a corpus of 575961 raw words and 52796 sentences
2018-12-01 20:52:30,450 : INFO : Loading a fresh vocabulary
2018-12-01 20:52:30,506 : INFO : min_count=5 retains 7651 unique words (26% of original 28595, d

2018-12-01 20:53:29,611 : INFO : EPOCH 4 - PROGRESS: at 71.38% examples, 24113 words/s, in_qsize 10, out_qsize 1
2018-12-01 20:53:31,046 : INFO : EPOCH 4 - PROGRESS: at 80.08% examples, 24220 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:53:32,237 : INFO : EPOCH 4 - PROGRESS: at 88.73% examples, 24697 words/s, in_qsize 7, out_qsize 0
2018-12-01 20:53:32,326 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-12-01 20:53:32,357 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-12-01 20:53:32,871 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-12-01 20:53:33,088 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-12-01 20:53:33,120 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-12-01 20:53:33,122 : INFO : EPOCH - 4 : training on 575961 raw words (415063 effective words) took 15.7s, 26370 effective words/s
2018-12-01 20:53:34,423 : INFO : EPOCH 5 - PROGRESS: at 1.79% examples, 5

2018-12-01 20:54:37,175 : INFO : EPOCH 9 - PROGRESS: at 1.79% examples, 5290 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:38,583 : INFO : EPOCH 9 - PROGRESS: at 10.35% examples, 15633 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:39,777 : INFO : EPOCH 9 - PROGRESS: at 19.21% examples, 19969 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:40,947 : INFO : EPOCH 9 - PROGRESS: at 28.04% examples, 22420 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:42,157 : INFO : EPOCH 9 - PROGRESS: at 36.86% examples, 23863 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:43,328 : INFO : EPOCH 9 - PROGRESS: at 45.50% examples, 24917 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:44,530 : INFO : EPOCH 9 - PROGRESS: at 54.35% examples, 25617 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:46,061 : INFO : EPOCH 9 - PROGRESS: at 62.79% examples, 25297 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:54:47,561 : INFO : EPOCH 9 - PROGRESS: at 71.38% examples, 25118 words/s, in_qsize 10, out_q

2018-12-01 20:55:48,503 : INFO : EPOCH 13 - PROGRESS: at 71.38% examples, 26494 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:55:49,698 : INFO : EPOCH 13 - PROGRESS: at 80.04% examples, 26852 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:55:50,833 : INFO : EPOCH 13 - PROGRESS: at 88.73% examples, 27260 words/s, in_qsize 7, out_qsize 0
2018-12-01 20:55:51,026 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-12-01 20:55:51,048 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-12-01 20:55:51,523 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-12-01 20:55:51,746 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-12-01 20:55:51,800 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-12-01 20:55:51,801 : INFO : EPOCH - 13 : training on 575961 raw words (415097 effective words) took 14.4s, 28762 effective words/s
2018-12-01 20:55:52,915 : INFO : EPOCH 14 - PROGRESS: at 1.79% example

2018-12-01 20:56:48,641 : INFO : EPOCH 18 - PROGRESS: at 1.79% examples, 6669 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:56:49,814 : INFO : EPOCH 18 - PROGRESS: at 10.21% examples, 19265 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:56:50,992 : INFO : EPOCH 18 - PROGRESS: at 19.15% examples, 23089 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:56:52,121 : INFO : EPOCH 18 - PROGRESS: at 28.04% examples, 25205 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:56:53,359 : INFO : EPOCH 18 - PROGRESS: at 36.86% examples, 26041 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:56:54,678 : INFO : EPOCH 18 - PROGRESS: at 45.50% examples, 26272 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:56:55,950 : INFO : EPOCH 18 - PROGRESS: at 54.30% examples, 26574 words/s, in_qsize 10, out_qsize 0
2018-12-01 20:56:57,082 : INFO : EPOCH 18 - PROGRESS: at 62.87% examples, 27187 words/s, in_qsize 8, out_qsize 1
2018-12-01 20:56:58,233 : INFO : EPOCH 18 - PROGRESS: at 71.36% examples, 27608 words/s, in_qsize

2018-12-01 20:57:52,769 : INFO : EPOCH 22 - PROGRESS: at 62.87% examples, 26217 words/s, in_qsize 10, out_qsize 0
2018-12-01 20:57:53,930 : INFO : EPOCH 22 - PROGRESS: at 71.38% examples, 26704 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:57:55,068 : INFO : EPOCH 22 - PROGRESS: at 80.04% examples, 27154 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:57:56,176 : INFO : EPOCH 22 - PROGRESS: at 88.73% examples, 27597 words/s, in_qsize 7, out_qsize 0
2018-12-01 20:57:56,360 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-12-01 20:57:56,367 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-12-01 20:57:56,837 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-12-01 20:57:57,006 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-12-01 20:57:57,072 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-12-01 20:57:57,073 : INFO : EPOCH - 22 : training on 575961 raw words (415051 effective

2018-12-01 20:58:51,207 : INFO : EPOCH - 26 : training on 575961 raw words (414611 effective words) took 13.4s, 30950 effective words/s
2018-12-01 20:58:52,291 : INFO : EPOCH 27 - PROGRESS: at 1.79% examples, 6796 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:58:53,448 : INFO : EPOCH 27 - PROGRESS: at 10.21% examples, 19573 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:58:54,584 : INFO : EPOCH 27 - PROGRESS: at 19.21% examples, 23547 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:58:55,703 : INFO : EPOCH 27 - PROGRESS: at 27.98% examples, 25737 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:58:56,883 : INFO : EPOCH 27 - PROGRESS: at 36.86% examples, 26735 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:58:58,093 : INFO : EPOCH 27 - PROGRESS: at 45.50% examples, 27273 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:58:59,223 : INFO : EPOCH 27 - PROGRESS: at 54.30% examples, 27896 words/s, in_qsize 9, out_qsize 0
2018-12-01 20:59:00,340 : INFO : EPOCH 27 - PROGRESS: at 62.87% examples, 2

In [385]:
model.save("./gensim_word2vec_model_011218")

2018-12-01 20:59:45,349 : INFO : saving Word2Vec object under ./gensim_word2vec_model_011218, separately None
2018-12-01 20:59:45,352 : INFO : not storing attribute vectors_norm
2018-12-01 20:59:45,363 : INFO : not storing attribute cum_table
2018-12-01 20:59:45,568 : INFO : saved ./gensim_word2vec_model_011218


In [386]:
print("Total of words: {}".format(len(model.wv.vocab)))

Total of words: 7651


### Testing the quality of model in terms of semantic and syntactic

In the example below we can see that the similarity of two number `1` and `2` is quite high at `0.7`.

In [387]:
print(model.wv.similarity('1', '2'))

0.7935787295718684


Below is similarity for `loan` and `money`:

In [409]:
print(model.wv.similarity('loan', 'money'))

0.36741977267850645


We can also find the most similar words for a specific word like `bank`. The most similar word will be placed at the top of the result.

In [399]:
model.wv.most_similar('bank')

[('banks', 0.5285881161689758),
 ('bankers', 0.45777976512908936),
 ('authorities', 0.423633337020874),
 ('measure', 0.40743112564086914),
 ('money', 0.393973708152771),
 ('dealers', 0.39064037799835205),
 ('lending', 0.38939034938812256),
 ('deposits', 0.38620954751968384),
 ('requirement', 0.37992456555366516),
 ('bundesbank', 0.3742866814136505)]

Among the 4 words, we can point out that the word `weather` is the most dissimilar one.

In [404]:
print(model.wv.doesnt_match("money bank loan weather".split()))

weather


To display the vector representation of the word `loan`:

In [436]:
print("Vector for word 'loan': \n{}".format(model.wv.get_vector('loan')))

Vector for word 'loan': 
[-3.2303596e-01 -2.4156787e+00  4.4412516e-02  4.5770600e-01
 -2.5768003e+00 -2.7601206e-01  1.2320732e+00 -9.4704521e-01
  7.5148098e-02 -7.6155037e-01 -9.3001324e-01  3.0199960e-01
 -1.2191356e+00 -2.3171395e-01 -6.5511161e-01  1.6387726e+00
  1.9019133e-01  1.2324833e+00 -6.8817651e-01  1.8227566e+00
 -2.2284260e+00 -3.9081306e+00 -1.4234127e+00 -2.7999072e+00
  1.3827580e+00  3.1156006e-01 -1.4501753e+00 -3.0824441e-01
  7.7271469e-02  3.0485579e-01 -1.8449939e+00  3.4685783e+00
  8.2687175e-01  6.0459113e-01 -1.9342793e+00 -3.5091162e-02
 -1.4164740e+00  3.4551995e+00  3.8812149e+00 -6.1105329e-01
 -1.7352334e-01  3.4135205e-01  7.5961894e-01  2.8761439e+00
 -1.3272648e+00 -4.3128464e-01  1.1381413e-01  1.7972025e-01
 -5.4642242e-01 -1.9100033e+00 -9.3116874e-01  2.4653428e+00
  1.2924670e+00  2.7447934e+00  2.6395755e+00  8.2297546e-01
 -3.0392332e+00  7.5616819e-01  2.4810507e+00 -1.0917925e+00
 -1.6387278e+00  5.5766636e-01  2.4589796e-03  1.4736474e-01

In [433]:
print("At index 15 is the word: {}".format(model.wv.index2word[43]))

At index 15 is the word: bank


In [432]:
print("Index of word 'money' is: {}".format(model.wv.vocab['bank'].index))

Index of word 'money' is: 43


### Create embedding vector list for words

After finishing training the `word2vec` model, we have to seperate words in vocabulary and their corresponding vectors into 2 different list which will be used as inputs for our deep learning model later. Here, we will add 1 character `-` at the beginning as our padding value and `unk` at the end of the list as our unknown word. More details about the use of padding and unknown characters will be discussed later in the deep learning model section.

In [415]:
word_lst = []
# adding padding value
word_lst.append('-')
length = len(model.wv.vocab)
# vector dimension will be the same with the word2vec model
vec_dim = 100
# adding '-' for padding value and 'unk' for word that can't be found in the model
word_vec = np.zeros((length + 2, vec_dim))
for i in range (length):
    word = model.wv.index2word[i]
    vector = model.wv[word]
    word_vec[i+1] = vector
    word_lst.append(word)
# adding unknown word
word_lst.append('unk')

In [416]:
print("Number of words in list: {}".format(len(word_lst)))
print("Shape of word vector: {}".format(word_vec.shape))

Number of words in list: 7653
Shape of word vector: (7653, 100)


For both padding and unknown character, they will have zero vectors.

In [441]:
print("First word in vocab: {}\n".format(word_lst[0]))
print("Vector values: \n{}".format(word_vec[0]))

First word in vocab: -

Vector values: 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


In [440]:
print("Last word in vocab: {}".format(word_lst[-1]))
print("Vector value: \n{}".format(word_vec[-1]))

Last word in vocab: unk
Vector value: 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


Let's save them so later we can recall them in training text classification model.

In [419]:
#np.save('word_lst_gensim_w2v', word_lst)
#np.save('word_vec_gensim_w2v', word_vec)

Below is some checking on the saved lists:

In [437]:
load_word_lst = np.load('word_lst_gensim_w2v.npy').tolist()
load_word_vec = np.load('word_vec_gensim_w2v.npy')

In [430]:
model.wv.word_vec('bank')

array([-0.23191258,  3.7256334 , -4.3452015 ,  1.0801154 , -3.7996838 ,
       -1.1255977 , -1.7133667 ,  1.6471709 ,  2.073226  ,  1.1289417 ,
       -1.1795563 ,  0.14911233,  0.89349145,  2.4277902 , -4.1869807 ,
       -2.626644  , -0.6793582 ,  6.502421  , -3.83722   ,  2.8329139 ,
        2.7426562 , -3.8311145 ,  3.1579711 , -1.022657  ,  0.4647392 ,
        0.9899303 , -1.4741098 , -1.7771564 ,  0.82412255, -0.6090162 ,
       -2.5584276 ,  5.6834025 ,  1.3674535 , -0.15203974, -2.073013  ,
        3.1131816 ,  0.8168806 , -0.508884  , -0.9480829 , -1.2174823 ,
       -0.04121821,  3.1202126 , -1.4758865 ,  4.934309  , -0.6888991 ,
        1.0782565 , -1.8044974 ,  0.6940486 , -1.0348865 ,  0.14809468,
       -1.7579285 , -2.8125982 , -1.2737167 ,  1.6908863 ,  3.747259  ,
        0.95193034, -3.3165982 , -0.27914476, -0.7361819 ,  0.7466166 ,
       -4.218593  ,  1.4799529 , -2.7139335 , -1.0337161 , -1.6900539 ,
        0.7303991 ,  0.23056044,  4.2733393 , -4.370119  ,  3.02

In [438]:
print("Index of word 'bank' in word_list: {}".format(load_word_lst.index("bank")))
print("Vector of word 'bank':\n{}".format(load_word_vec[16]))

Index of word 'bank' in word_list: 44
Vector of word 'bank':
[ 0.77163297  0.72085327 -0.17937817 -0.20104121 -0.69574487  0.43511963
 -0.57506287 -0.01126903  0.77312046  0.47661039 -0.62783295 -0.50222337
 -1.01169503 -0.73183298  1.15113175  1.01346171  3.33355951 -0.3658703
 -1.68693757  0.52608061 -2.01098037  0.20490231  0.05878011 -0.70031399
 -2.17487311  1.16204977 -2.3982594   1.77197659 -0.88600147  1.37544596
 -0.74634874  0.66399968 -2.09872293 -0.28664479 -1.54224348  0.04331245
 -2.49897432  2.14435387 -0.64037412 -1.57764399  1.54033315  1.75365996
 -1.55189586  0.64460242 -0.19954933 -2.0546608  -0.24295682 -0.33705029
  0.33937937  1.14508641  0.21413738 -1.78668499 -2.04637361  0.23480865
 -1.13142443 -0.61997014  0.59658432 -0.00985729  1.17620337  0.06949378
  1.05523515  2.03563571  0.29630861  0.12656495  0.4541429  -2.73681521
  0.92469925 -1.2683326  -1.79471838  0.43595925  0.57826906 -0.37551475
 -1.47156572  2.06419945  1.63619328  1.51565218 -0.58776575  1.

In [439]:
print("Result: \n{}".format(model.wv.word_vec('bank') - load_word_vec[44]))

Result: 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]


# References

1. DOAN Tu My, Learning Word Embeddings - https://github.com/doantumy/Word-Embeddings
2. Gensim Package - https://radimrehurek.com/gensim/models/word2vec.html