# **How to train your <del>dragon</del> custom word embeddings**

In the [baseline-keras-lstm-is-all-you-need](https://www.kaggle.com/huikang/baseline-keras-lstm-is-all-you-need) notebook shared by Hui Kang (thanks again!), it was demonstrated that a LSTM model using generic global vector (GLOVE) achieved a pretty solid benchmark results.

After playing around with GLOVE, you will quickly find that certain words in your training data are not present in its vocab. These are typically replaced with same-shape zero vector, which essentially means you are 'sacrificing' the word as your input feature, which can potentially be important for correct prediction. Another way to deal with this is to train your own word embeddings, using your training data, so that the semantic relationship of your own training corpus can be better represented.

In this notebook, I will demonstrate how to train your custom word2vec using Gensim.

For those who are new to word embeddings and would like to find out more, you can check out the following articles:
1. [Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
2. [A Beginner's Guide to Word2Vec and Neural Word Embeddings](https://skymind.ai/wiki/word2vec)

In [1]:
import numpy as np
import pandas as pd
import os
import re
import time

from gensim.models import Word2Vec
from tqdm import tqdm

tqdm.pandas()

## **Something to take note**
Word2vec is a **self-supervised** method (well, sort of unsupervised but not unsupervised, since it provides its own labels. check out this [Quora](https://www.quora.com/Is-Word2vec-a-supervised-unsupervised-learning-algorithm) thread for a more detailed explanation), so we can make full use of the entire dataset (including test data) to obtain a more wholesome word embedding representation.

In [3]:
df_train = pd.read_csv('../data/new_train.csv')
df_test = pd.read_csv('../data/new_test.csv')

In [4]:
sentences = pd.concat([df_train['title'], df_test['title']],axis=0)
train_sentences = list(sentences.progress_apply(str.split).values)

100%|██████████| 839017/839017 [00:01<00:00, 568440.59it/s]


In [5]:
# Parameters reference : https://www.quora.com/How-do-I-determine-Word2Vec-parameters
# Feel free to customise your own embedding

start_time = time.time()

model = Word2Vec(sentences=train_sentences, 
                 sg=1, 
                 size=300,
                 workers=4)

print(f'Time taken : {(time.time() - start_time) / 60:.2f} mins')

Time taken : 0.81 mins


## **Pretty fast isn't it.**

Let's check out some of the features of the customised word vector.

In [6]:
# Total number of vocab in our custom word embedding

len(model.wv.vocab.keys())

15702

In [7]:
# Check out the dimension of each word (we set it to 300 in the above training step)

model.wv.vector_size

300

In [8]:
# Check out how 'iphone' is represented (an array of 100 numbers)

model.wv.get_vector('iphone')

array([ 0.5451705 ,  0.17535159, -0.46454248, -0.17189127,  0.5599154 ,
       -0.5107339 ,  0.16901785, -0.18851307,  0.14714868,  0.09982701,
       -0.08246141, -0.17406596,  0.2804124 , -0.0507641 , -0.48802358,
       -0.36542034, -0.2145641 , -0.04688673,  0.1424934 ,  0.1637054 ,
       -0.27545628,  0.17334777,  0.2657709 , -0.19411126, -0.04116624,
       -0.49339792, -0.46244827, -0.05299556, -0.42241699,  0.03331193,
        0.13610442,  0.29399866, -0.02095415,  0.0515291 ,  0.32896805,
       -0.11736588, -0.2722362 ,  0.43094817,  0.05486928, -0.31320786,
        0.02036065,  0.15312116, -0.36650494,  0.4689258 ,  0.7094724 ,
       -0.08336945, -0.19291283, -0.68246174, -0.02891858,  0.57406276,
        0.2672893 ,  0.1489239 ,  0.08525643,  0.05862563, -0.20578611,
        0.02702558, -0.23098549,  0.33465037, -0.13116418,  0.5667268 ,
        0.05880717,  0.15229182,  0.23845781,  0.61248153, -0.2200646 ,
       -0.256163  , -0.0071689 ,  0.12713589,  0.09662355, -0.27

## Now, why are word embeddings powerful? 

This is because they capture the semantics relationships between words. In other words, words with similar meanings should appear near each other in the vector space of our custom embeddings.

Lets check out an example:

In [9]:
# Find words with similar meaning to 'iphone'

model.wv.most_similar('iphone')

[('jetblack', 0.637566089630127),
 ('spacegrey', 0.6348098516464233),
 ('originaliphone', 0.6347168684005737),
 ('splus', 0.6323285102844238),
 ('selleriphone', 0.6306023001670837),
 ('apple', 0.629304051399231),
 ('iph', 0.6253551244735718),
 ('cpo', 0.6204924583435059),
 ('singapura', 0.6125050783157349),
 ('mateblack', 0.6120768189430237)]

Well, you will see words similar to 'iphone', sorted based on euclidean distance.
Of cause, there are also not so intuitive and relevant ones (e.g. jetblack, cpo, ten). If you would like to tackle this, you can do a more thorough pre-processing/ try other embedding dimensions


## **The most important part!**
Last but not least, save your word embeddings, so that you can used it for modelling. You can load the text file next time using Gensim KeyedVector function.

In [10]:
model.wv.save_word2vec_format('custom_glove_300d.txt')


# How to load:
# w2v = KeyedVectors.load_word2vec_format('custom_glove_100d.txt')

# How to get vector using loaded model
# w2v.get_vector('iphone')
