This tutorial is part-2 of [Comprehensive tutorial on NLP](https://www.kaggle.com/kksienc/comprehensive-nlp-tutorial-1-ml-perspective). In this part we will learn about word embedding and see how Deep learning has simplified NLP processing. 

Pre-requsite: Basic Deep learning undestanding would be helpful though not mandatory. 

<a class="kk" id="0.1"></a>
## Contents

1. [Introduction to Word Embedding](#1)
    1. [Dimensionality](#1.1) 
    1. [Padding](#1.2)
    1. [Euclidean Distance](#1.3)
    1. [Cosine Similarity](#1.4)
1. [Word Embedding Techniques](#2)
    1. [Word2Vec](#2.1)
        1. [Skip-Gram](#2.1.1)
        1. [CBOW (Continuous Bag of Words)](#2.1.2)
    1. [GloVe](#2.2) 
    1. [FastText](#2.3)
1. [Text to Numeric Convertion Using Word Vectors](#3)
    1. [Vector Averaging](#3.1)
        1. [Vector Averaging With Word2Vec](#3.1.1)
        1. [Vector Averaging With GloVe](#3.1.2)
        1. [Vector Averaging With FastText](#3.1.3) 
    1. [Embedding Matrix & Keras Embedding layer](#3.2)
        1. [Word2Vec Embedding layers](#3.2.1)
        1. [GloVe Embedding layers](#3.2.2)
        1. [FastText Embedding layers](#3.2.3)      
1. [Deep Learning models](#4)
    1. [CNN](#4.1)
    1. [Simple RNN](#4.2)
    1. [Recurrent Neural Network -LSTM](#4.3)
    1. [Recurrent Neural Network – GRU](#4.4)
    1. [Bidirectional RNN](#4.5)
 

# 1. Introduction to Word Embedding  <a class="kk" id="1"></a>
[Back to Contents](#0.1)

Word Embedding is also known as Word Vectorization. It means converting word into vector. Vectors are numeric representation of a point in space. Mathematically vectors are 1D array or sequence of numbers.  


<B>Why we need Word Embedding? </B>

A problem with our previous text to numeric conversion techniques was that they ignore synonyms for example word 'measure' and ‘calculate’ were represented differently however in most sentences they can be used interchangeably. In Word Embedding similar words are spatially close to each other in vector space. Word Embedding is also capable of preserving semantic and syntactic similarity and relation with other words. The vector representation are such that geometric transformation adopts syntax and semantic. For instance, by adding a “female” vector to the vector “king”, we obtain the vector “queen” and by adding a “plural” vector to the vector “king”, we obtain “kings”. 

Another problem we observe in part 1 was production of high dimensionality sparse matrix. Word Embedding produces low dimensionality dense matrix.

<img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1078299/textproc.jpg?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587069821&Signature=Gu0D8VdPGbHqYR0hgSyEaPbZlM95%2FIUtleAJeZVSiF4cu8%2FwVFtOdIjebA0SkD%2BBTxwEdu6gFhtbvIszo8diz22CsNNQ4oNdR0qNY3IjKgtYiy1qkA3bt5ExfeA%2BpH1TAOnDmMFqUZ9JvO5f7x1c4SfjM5aN269zGUESZJQZwMJJl1CWlYYrEtJALpXUYrk8gZ%2BKwucdEipTInTQNJG1qR32bInzOk7nN88DUrc6kxfd0aZn7mN%2FP%2FZ87d4JMdJ3ul7hAJm42vPEmFe4pbLDR1k9xwGYzlN1AM0cVJs6M2Z7StFJ4uSMEDhc5Iil6xn%2BbHroEZyoBVGr7rjWMPf6Rg%3D%3D" width="250">

Before applying Word Embedding techniques lets look into into few common NLP vocabulary terms.
 

## 1.1 Dimensionality  <a class="kk" id="1.1"></a>

Dimensionality refers to the length of vectors.

## 1.2 Padding <a class="kk" id="1.2"></a>
Padding is task of appending string up to given specific length with whitespaces. Padding is used to represent all records as fixed length.

## 1.3 Euclidean Distance <a class="kk" id="1.3"></a>

Euclidean distance is the shortest distance between two points in (Euclidean) space.

<img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1078299/eucldeandistance.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587070547&Signature=ND5YK6rJrRHMwJ4erD9fFQCfBYlL2YIcWVNQJsi1o%2FK9RgGgj6egkkqdlcHsBShwx%2B9uovT1Fwwnv9xnGhKigjdtMHHWA%2F8%2Fi2E3qK3iScr2o9iDYjv5WXQ0WeXMV8AUq%2F1nWMqCLPsSfwB0hc%2FrgVrHi5xtOxBaGlt%2BP7X73d8cSK4WoPBI%2FhJswcqruQFDxrO2%2BvIupOzibAC6XBLQ%2BM%2BS0H0Fpuyxke45F%2B6qqsLpyFzNUFfzQzUXM6n6JvLMUZVLEjP%2BaikoujgvFZ9AmFtljGZ%2FOzKuYWNdFtuvZ55aQkLAPI9yU%2FlUDmI45y%2Bq%2BO43e7ql%2FhsDPyhcaG%2Fh8g%3D%3D" width="250">

## 1.4 Cosine Similarity  <a class="kk" id="1.4"></a>
Cosine similarity is a measure of similarity between two nonzero vectors of an inner product space. It measures the cosine of the angle between them.

<img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1078299/cosineSimilarity.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587070566&Signature=iebe54mjBjdPju%2Bl5KwvX5iJTg0jpt%2FH%2FvQUmGuCxpG1t0zoPlmugIHc%2ByBnoLf3iu8yjxg%2FdK7Qj6G6iSS3JsQjt%2FEwavxrB3IjOHH0%2Fu3%2F3vE8Q1cY4tZSEjTI4u98VuDbpYUUPHa%2FRSe6IE2%2BVhZxeaQ1jcSI8LshJncCCrmS9IOxr4U2vN%2F7q6hfXJgqIin6jO%2B0PX%2BeUw1PNa%2BObhmve4zeA2rJ4qzhEnoMY6mF6vRm%2Bf3%2FsMew1crd67GVzoMFlmwisfS6g%2Fb2GR620YjoPL5cZ6XrxAmFnZQEqh2ur%2FGvpRfcPRCm1rrL0N%2FQc%2FAm7fRXbiMibWNqi%2FBD6w%3D%3D" width="250">
 

# 2. Word Embedding Techniques <a class="kk" id="2"></a>
[Back to Contents](#0.1)

Now we will look into word Embedding techniques but before that let's fetch our [Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started) dataset and clean it as we did in part 1.

#####  Fetch & Clean Dataset

In [3]:
#Data  make it ready 
#!pip install pyspellchecker
import pandas as pd
from nltk.corpus import stopwords 
from nltk.corpus import wordnet
from spellchecker import SpellChecker
from nltk.stem import WordNetLemmatizer 
import nltk 
import re

train_df = pd.read_csv("nlp-getting-started/train.csv")
test_df = pd.read_csv("nlp-getting-started/test.csv")


def convert_to_antonym(sentence):
    words = nltk.word_tokenize(sentence)
    new_words = []
    temp_word = ''
    for word in words:
        antonyms = []
        if word == 'not':
            temp_word = 'not_'
        elif temp_word == 'not_':
            for syn in wordnet.synsets(word):
                for s in syn.lemmas():
                    for a in s.antonyms():
                        antonyms.append(a.name())
            if len(antonyms) >= 1:
                word = antonyms[0]
            else:
                word = temp_word + word # when antonym is not found, it will
                                    # remain not_happy
            
            temp_word = ''
        if word != 'not':
            new_words.append(word)
    return ' '.join(new_words)


def correct_spellings(text):
    spell = SpellChecker()
    corrected_words = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_words.append(spell.correction(word))
        else:
            corrected_words.append(word)
    return " ".join(corrected_words)
        

 
 
def clean_text(text):
    """
        text: a string
        
        return: modified initial string
  """
    text = text.lower() # lowercase text
    text= re.sub(r'[^\w\s#]',' ',text) #Removing every thing other than space, word and hash
    text  = re.sub(r"https?://\S+|www\.\S+", "", text )
    text= re.sub(r'[0-9]',' ',text)
    #text = correct_spellings(text)
    text = convert_to_antonym(text)
    text = re.sub(' +', ' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text    
    return text


train_df['text'] = train_df['text'].apply(clean_text)
test_df['text'] = test_df['text'].apply(clean_text)

sentences= pd.DataFrame(columns=['text'])
sentences['text']= pd.concat([train_df["text"], test_df["text"]])

from collections import defaultdict
tokens_list = [row.split() for row in sentences['text']]



## 2.1 Word2Vec    <a class="kk" id="2.1"></a>

 Word2Vec is group of related models that are used to produce Word Embeddings. It was created & patented by Tomas Mikolov and a group of a research team from Google in 2013. Each unique word in the corpus is assigned a corresponding vector in the space. Word2Vec relies only on local information of language hence the semantics learnt for a given word is only affected by the surrounding words. Underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning, this at times results into similar vector representation (cosine similarity) of multiple words.One more drawback of word2vec is its unablity to takecare of OOV word. 

 <img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1078299/word2vec.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587070589&Signature=Po7EpAiaeyI5eQz3PkX1JTE81LQNoCnvlfmMIEWqwkBEJBbN5ci6wGTrj0oIacrk8y95cz1x6LRqfeHhqXaw96D299LtueOXK%2FNRqsJRXUPJNuPl%2F%2BWT25kIC%2BChAnfWYrJ8h%2FHXkoUxylq8DRpAxFAxpvdUXivwY2vR6h3aqhstu26c%2FMGTFljD%2FXIl4I5aDv3m8agm8nx7UdVLJ0nGBjEKmuhzszbACwIkUi2SGyiVhZhWngP8yXg3KsLg3vtbuNF9LOkk%2BfBeLBCYAB7qwuQZhNcXRqDLHmepVPc0%2BlWHllfj1RYvxLjbjHTlry1n0Ovy9ELl9maLT8FLFIhVgA%3D%3D" width="250">
 
 we will see error when we try to buid  Word2Vec comes in two flavours,
 - Skip-Gram and 
 - Continuous Bag of Words (CBOW)

Underneath Word2Vec uses neural network algorithms that can be trained on any type of sequential data. Fortunately we have libraries available that have already implemented these algorithms and we have to just call the method with proper arguments. A popular one among such libraries is <B>gensim</B>. It provides the [Word2Vec Class](https://radimrehurek.com/gensim/models/word2vec.htm) for working with a Word2Vec model.


##### Gensim implementation of Word2vec



<B>Arguments:</B>

- min_count : Minimum number of occurrences of a word in the corpus to be included in the model. The higher the number, the less words we have in our corpus
- window: The maximum distance between the current and predicted word within a sentence
- size: The dimensionality of the feature vectors
- workers: no of cores
- sg = 1  for skipgram and 0 for cbow

- sample = (type float) - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)

- alpha = float - The initial learning rate - (0.01, 0.05)

- min_alpha = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00

- negative = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)


<B>Methods :</B> 
- model.build_vocab: Prepare the model vocabulary
- model.train: Train word vectors
- model.init_sims(): When we do not plan to train the model any further, we use this line of code to make the model more memory-efficient




###  2.1.1 Skip-Gram  <a class="kk" id="2.1.1"></a>
 
Skip-Gram is designed to predict the context from base word. From a given word, Skip-gram model tries to predict its neighbouring words.  <img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1078299/skipgram.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587070609&Signature=FN0mWWM%2Fuvc34xG2SZQx7VCLn70Ma58jAnw%2BP%2BO8Gtf4opQVvuu2iRkUArE6rNokzb0xfC3JcA1ybNHFAvJBQU%2FNjeY1cQLFKedxJ8TZMYPj%2FYx1tz5QJ8OvKTok%2FeDDPyr442dKbkuQDNsAawxo%2FRJXoWxqHFMAFZyOiK2UfjyVhA%2Fth8k9OeBhEaUi0slkIgeEp5oYi0BLrJmjVpYmh6tXstvK%2Bs43QxyZw98PDDZpKOZ6QX8o%2B9LbZR6ZsjfILc79CHXQHWvLjfKC%2FDKTe1RB%2ByEAc8XooPm5BaUTHXMYRYJS03hNo5NmvAb1anKSYX25dj6Ww2SrB0at6VS0dQ%3D%3D" width="250">

 Skip-gram is a [(target, context), relevancy] generator. Skip-gram generator gives us pair of words and their relevance (a float value). Lets generate Word2Vec skip-gram embedding for our cleaned-up text dataset using gensim.  

##### Building Skipgram  WordVectors using gensim

In [64]:
from gensim.models import Word2Vec
from time import time
t = time()
# initialize skipgram model
sg_model = Word2Vec(min_count=2,window=2,size=300, sg = 1,sample=5e-5, alpha=0.05, min_alpha=0.0005,negative=20 )
# build model vocabulary
sg_model.build_vocab(tokens_list)

# train the model
sg_model.train(tokens_list, total_examples=sg_model.corpus_count, epochs=30, report_delay=1)

print('Time to build Skip gram model vocab: {} mins'.format(round((time() - t) / 60, 2)))


Time to build Skip gram model vocab: 0.12 mins


We have just build our first word-embedding model.and that also with only 3 lines of code. Lets play with the model

##### Convert a word  to vector 

In [None]:
sg_model['hope']

##### Validate dimension of our word vector

In [None]:
len(sg_model['hope'])

##### Measure similarity   b/w two word 

In [None]:
sg_model.similarity('people','saint' )


In [None]:
sg_model.similarity('people', 'terrorist')

##### Fetch most similar words  wrt any given word 

In [None]:
sg_model.most_similar('fire')[:5]

##### Fetch list of word vocabulary 

In [None]:
words = list(sg_model.wv.vocab)
print(words)

### 2.1.2 CBOW (Continuous Bag of Words)  <a class="kk" id="2.1.2"></a>

CBOW is designed to predict the base(target) word from context. CBOW is faster to train than the skip-gram and gives slightly better accuracy for the frequent words.

<img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1078299/cbow.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587070646&Signature=O1By12DGolOFtiR2%2FsFF%2FJs9ECXlq3HilWLoy0fR6ypVPGKsZJZ5EhViXBU%2FgkK%2FwDcZ9scFexc3XktXgxJl2BjnQb3GGQuwUeU0y2oYuZwuVRitfGe86McgPrstbBcqKsehbjunpoyPwgRoiq3Hk7Fu5vaoKo7y0fBJTv0OziTT%2Bgwc9Uxc6WjS5SfAIrk1uW9OzFtYdRekgIbAgEyCHpqh6UrfHz9ePi2Ggpfbns42VP7oMRKQXHKU8LDs5vYuXi2ew9pUYmChDz8C9FV%2FfprQSj7oDqMxrKijZEyYXDbo4VgBMJI%2FC7lDzEn%2B060m5%2BwAVXhEgYS8KmXba%2BgS2w%3D%3D" width="250">



In [4]:
#### Building CBOW wordvectors
from gensim.models import Word2Vec
from time import time
t = time()
# initialize
cbow_model = Word2Vec(min_count=2,window=2,size=300, sg = 0,sample=5e-5, alpha=0.05, min_alpha=0.0005, 
                     negative=20 )
# build model vocabulary
cbow_model.build_vocab(tokens_list)

# train the model
cbow_model.train(tokens_list, total_examples=cbow_model.corpus_count, epochs=30, report_delay=1)

print('Time to build CBOW model vocab: {} mins'.format(round((time() - t) / 60, 2)))


Time to build CBOW model vocab: 0.09 mins


Did you notice CBOW trained faster ?

####  Pretrain Word2Vec

Google has made available pretrained word embedding which includes word vectors for a vocabulary of 3 million words and phrases that they have trained on roughly 100 billion words from Google News dataset using Word2Vec.

In [94]:
#fetching  pretrain wordvector
from gensim.models.keyedvectors import KeyedVectors
t = time()
pretrained_w2vec_embedding = KeyedVectors.load_word2vec_format('/Users/kaustuv/DataScience/DS_tutorials/datasets/Google-Word2vec/GoogleNews-vectors-negative300.bin', binary=True)
print('Time to fetch  pretrain  Word2Vec model vocab: {} mins'.format(round((time() - t) / 60, 2)))

Time to fetch  pretrain  Word2Vec model vocab: 1.42 mins


In [None]:
pretrained_w2vec_embedding['people']

In [None]:
pretrained_w2vec_embedding.syn0.shape

## 2.2 GloVe  <a class="kk" id="2.2"></a>

[GloVe](https://nlp.stanford.edu/pubs/glove.pdf) stands for "Global Vectors". It is a Word Embedding [project](https://nlp.stanford.edu/projects/glove/)  written in C language and developed by Stanford university researchers in 2014. Glove embedding technique is based on (first) construction of a co-occurrence matrix from a training corpus and then (second) factorization of co-occurrence matrix in order to yield word vector.

Unlike word2vec which captures only local statistics of token Glove captures both global statistics and local statistics of a text tokens. Its embeddings relate to the probabilities that two words appear together. [glove_python](https://github.com/maciejkula/glove-python) library provides glove implementation.

#####  Implementation of Glove via  glove_python


Arguments description : 

1. For corpus.fit()  :
    - lines : this is the 2D array we created after the pre-processing
    - window : this is the distance between two words algorithm should consider to find some relationship between them
    
    
2. For glove() :
    - no_of_components : This is the dimension of the output vector generated by the GloVe
    - learning_rate : Algo uses gradient descent so learning rate defines the rate at which the algo reaches towards the minima (lower the rate more time it takes to learn but reaches the minimum value)


3. For glove.fit() :
    - cooccurence_matrix: the matrix of word-word co-occurrences
    - epochs: this defines the number of passes algo makes through the data set
    - no_of_threads: number of threads used by the algo to run

In [20]:
#!pip install glove_python

#importing the glove library
from glove import Corpus, Glove

# creating a corpus object
corpus = Corpus() 

#training the corpus to generate the co occurence matrix which is used in GloVe
corpus.fit(tokens_list, window=3)
#creating a Glove object which will use the matrix created in the above lines to create embeddings
#We can set the learning rate as it uses Gradient Descent and number of components
glove = Glove(no_components=300, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')



Performing 30 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29


#####  Displaying Glove WordVector of a word 

In [None]:
glove.word_vectors[glove.dictionary['people']]

#### Pretrain Glove

Glove developers have also made available pre-computed embeddings for millions of English tokens, obtained from training Wikipedia data and Common crawl data.

In [119]:
# Fetch pretrain glove word vectors 
import numpy as np 
pretrained_glove_embedding={}
with open('/Users/kaustuv/DataScience/DS_tutorials/datasets/Glove_wordembeeding/glove.6B/glove.6B.300d.txt','r') as f:
    for line in f:
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        pretrained_glove_embedding[word]=vectors
f.close()

#####  Displaying Glove pretrained WordVector of a word 

In [None]:
len(pretrained_glove_embedding)

In [None]:
pretrained_glove_embedding['hello']

## 2.3. Fast-Text <a class="kk" id="2.3"></a>

[FastText](https://fasttext.cc/) is a library for learning of word embeddings and text classification. The Facebook Research Team created fastText in Nov 2015. Fast-Text is an extension of word2vec library. It builds on Word2Vec by learning vector representations for each word and the n-grams found within each word. FastText assumes a word to be formed by a n-grams of character for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny]... etc, where n could range from 1 to the length of the word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to training it enables word embeddings to encode sub-word information. Thus even for previously unseen words, typo errors, and OOV (Out Of Vocabulary) words the model can make an educated guess towards its meaning.Obvious trade off is processing time. Gensim provides the [FastText implementation](https://radimrehurek.com/gensim/models/fasttext.html).



##### FastText Implementatation using gensim

Refer : https://radimrehurek.com/gensim/models/fasttext.html for parameter deatails

In [30]:
dimension =300
from gensim.models import FastText
fasttext_model = FastText(tokens_list, size=dimension, window=5, min_count=5, workers=4, sg=1)


#####  Displaying FastText WordVector of given word 

In [None]:
fasttext_model['people']

In [None]:
fasttext_model.similarity('evacuation','shelter' )

fasttext_model.most_similar('earthquake')[:5]

##### PreTrained Fasttext

FastText developers have also made available pre-computed embeddings for millions of english tokens, obtained from training Wikipedia data and common crawl data. 

Disclaimer: Loading the fastText pretrain will consume some serious memory.

In [115]:
def get_coefs(word, *arr): 
    return word, np.asarray(arr, dtype='float32')

EMBEDDING_FILE = '/Users/kaustuv/DataScience/DS_tutorials/datasets/FastText/wiki-news-300d-1M-subword.vec'
pretrained_fasttext_embedding = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in (open(EMBEDDING_FILE)))

#####  Displaying FastText pretrained WordVector of a word 

In [None]:
pretrained_fasttext_embedding['earthquake']

In [None]:
len(pretrained_fasttext_embedding['earthquake'])

# 3. Text to Numeric Convertion Using Word Vectors <a class="kk" id="3"></a>
[Back to Contents](#0.1)

So we have learned about word embedding techniques and created word vectors for our corpus.  Now we will convert our textual data into numerical using these word vectors.  I will explain about two popular texts to numerical conversion techniques using word vectors, 
1. Vector Averaging  
2. Embedding Matrix and Keras Embedding layer


## 3.1 Vector Averaging  <a class="kk" id="3.1"></a>
In this approach we directly averages all word embedding occurred in the text. Final length remains equal to word vector dimension. This is go to technique when we are planning to use standard machine learning models such a logistic regression, naïve-bayes, svm etc.  
 

### Vector Averaging With Word2Vec <a class="kk" id="3.1.1"></a>

In [13]:
# functions for Vector Averaging with word2Vec
import numpy as np
def w2v_embeddings(text,w2v_model,dimension):
    if len(text) < 1:
        return np.zeros(dimension)
    else:
        vectorized = [w2v_model[word] if word in w2v_model else np.random.rand(dimension) for word in text] 
    
    sum = np.sum(vectorized,axis=0)
    ## return the average
    return sum/len(vectorized)     

def get_w2v_embeddings(text,w2v_model,dimension):
        embeddings = text.apply(lambda x: w2v_embeddings(x, w2v_model,dimension))
        return list(embeddings)

In [14]:
# Text to numeric using Vector Averaging for sgmodel 
train_embeddings_sg_model  = get_w2v_embeddings(train_df['text'],sg_model,dimension=300)
 

  
  


In [16]:
len(train_embeddings_sg_model)

7613

In [17]:
len(train_embeddings_sg_model[0])

100

In [18]:
# Text to numeric using Vector Averaging for cbow model
train_embeddings_cbow_model_  = get_w2v_embeddings(train_df['text'],cbow_model,dimension=300)
 

  
  


### Vector Averaging With Glove <a class="kk" id="3.1.2"></a>

In [26]:
# functions  for Vector Averaging with GloVe
import numpy as np
def glove_embeddings(text, glove_model, dim ):
    dic=glove_model.dictionary
    if len(text) < 1:
        return np.zeros(dim)
    else:
        vectorized = [glove_model.word_vectors[dic[word]] if word in dic else np.random.rand(dim) for word in text]  
    sum = np.sum(vectorized,axis=0)
    ## return the average
    return sum/len(vectorized)     

def get_glove_embeddings(text,glove_model,dimension):
        embeddings = text.apply(lambda x: glove_embeddings(x,glove_model, dimension))
        return list(embeddings)




In [28]:
# Text to numeric using Averaging for glove
import numpy as np
train_embeddings_glove = get_glove_embeddings(train_df['text'],glove,dimension=300)
test_embeddings_glove = get_glove_embeddings(test_df['text'],glove,dimension=300)

### Vector Averaging With Fasttext  <a class="kk" id="3.1.3"></a>

As Fastext is an extension of word2vec hence the same averaging function of w2vec i.e.`get_ w2v_embeddings`  will work with fasttext too. 

In [32]:
###  Text to numeric using Averaging with Fasttext
import numpy as np
fasttext_train_embeddings = w2v_embeddings(train_df['text'], fasttext_model,dimension=300)
fasttext_test_embeddings = w2v_embeddings(test_df['text'],  fasttext_model,dimension=300)

  
  


## 3.2 Embedded Matrix & Keras Embedding layer <a class="kk" id="3.2"></a>

Averaging is preferred choice when we intend to use ML models such as lr, svm, gbm etc. but our purpose here is to utilise Deep-learning algorithms. Deep Learning is a layer bases learning where each layer passes its learning to the next layer.   Few libraries have implement deep learning algorithms. A popular one among them is Keras. We will use Keras for our deep learning modelling purpose.

For text processing Keras offers an embedding layer. This is the first layer of deep learning algorithm. Weights of the Embedding layer are of the shape (vocabulary_size, embedding_dimension) , this weight matrix is also called as Embedding matrix. We will first generate this embedding matrix from our word vectors and then initialize Keras embedding layer for each of our word embeddings. 

Moreover, Keras has built-in utilities for doing tokenization and encoding of text. We will use these utilities as they take care of a number of important features such as stripping special characters from strings, padding, fetching N most common words in dataset etc.

In [124]:
# tokenizing using keras  tokenizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm

tokenizer_obj=Tokenizer()
# to builds the word index
tokenizer_obj.fit_on_texts(tokens_list)
# to turns strings into lists of integer indices.
sequences=tokenizer_obj.texts_to_sequences(tokens_list)
# defining maximum length of sequence 
MAX_LEN= 50
# pad_sequences is used to ensure that all sequences in a list have the same length
tweet_pad= pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')

# segregating text & train from corpu
x_train = tweet_pad[:7613]
x_test = tweet_pad[7613:]

targets =  [target for target in train_df['target']]

# set of all word and their sequence no
word_index=tokenizer_obj.word_index
print('Number of unique words:',len(word_index))
vocab_size = word_index + 1


In [101]:
def generate_embeeding_matrix(word_vector_model, dimension, vocab_size= vocab_size, word_index =word_index):
    embedding_matrix=np.zeros((vocab_size,dimension))
    for word,i in tqdm(word_index.items()):
        if i > vocab_size:
            continue
        if word in word_vector_model:  
            emb_vec=word_vector_model[word]
            embedding_matrix[i]=emb_vec
    return embedding_matrix

Lets create keras embeeding layers for our word vectors.  we have created  4 trained word embedding model (skipgram, cbow, glove and fasttext) and 3 pretrain  model each for (word2vec ,glove and python) for all these seven we will crate keras word embedding model. 

### Word2Vec Embedding layers  <a class="kk" id="3.2.1"></a>

##### Trained skipgram

In [102]:
from keras.layers import Embedding

embedding_matrix_sg_trained = generate_embeeding_matrix(sg_model, dimension = 300)

embedding_layer_sg_trained = Embedding(vocab_size, output_dim= 300, weights=[embedding_matrix_sg_trained], 
                                     input_length=MAX_LEN, trainable=False)

  
  import sys
100%|██████████| 28874/28874 [00:00<00:00, 147284.00it/s]


##### Pre-Trained  Word2Vec

In [103]:
# pre trainde word2vec dimesion is 300
embedding_matrix_w2v_pretrained = generate_embeeding_matrix(pretrained_w2vec_embedding, dimension =300)    

embedding_layer_w2v_pretrained = Embedding(vocab_size, output_dim= 300, weights=[embedding_matrix_w2v_pretrained], 
                                     input_length=MAX_LEN, trainable=False)

100%|██████████| 28874/28874 [00:00<00:00, 111852.76it/s]


##### Trained  CBOW

In [104]:
embedding_matrix_cbow_trained = generate_embeeding_matrix(cbow_model, dimension = 300)

embedding_layer_cbow_trained = Embedding(vocab_size, output_dim= 300, weights=[embedding_matrix_cbow_trained], 
                                     input_length=MAX_LEN, trainable=False)

  
  import sys
100%|██████████| 28874/28874 [00:00<00:00, 164841.84it/s]


### GloVe Embedding Layers   <a class="kk" id="3.2.2"></a>

##### Trained Glove

In [106]:
import numpy as np
embedding_matrix_glove_trained=np.zeros((vocab_size,300))
for word,i in tqdm(word_index.items()):
    if i > vocab_size:
        continue
    
    emb_vec=glove.word_vectors[glove.dictionary[word]]
    if emb_vec is not None:
        embedding_matrix_glove_trained[i]=emb_vec

100%|██████████| 28874/28874 [00:00<00:00, 467832.26it/s]


In [107]:
embedding_layer_glove_trained = Embedding(vocab_size, dimension, weights=[embedding_matrix_glove_trained], 
                                     input_length=MAX_LEN, trainable=False)

##### PreTrained Glove

In [121]:
embedding_matrix_glove_pretrained = generate_embeeding_matrix(pretrained_glove_embedding, dimension =300)    

embedding_layer_glove_pretrained = Embedding(vocab_size, output_dim= 300, weights=[embedding_matrix_w2v_pretrained], 
                                     input_length=MAX_LEN, trainable=False)

100%|██████████| 28874/28874 [00:00<00:00, 373207.89it/s]


### Fasttext  Embedding layers  <a class="kk" id="3.2.3"></a>

##### Trained FastText

In [112]:
embedding_matrix_fasttext_trained = generate_embeeding_matrix(fasttext_model, dimension =300)    

embedding_layer_fasttext_trained = Embedding(vocab_size, output_dim= 300, weights=[embedding_matrix_fasttext_trained], 
                                     input_length=MAX_LEN, trainable=False)

  
  import sys
100%|██████████| 28874/28874 [00:02<00:00, 13341.40it/s]


##### Pre-Trained FastText

In [116]:
embedding_matrix_fasttext_pretrained = generate_embeeding_matrix(pretrained_fasttext_embedding, dimension =300)    

embedding_layer_fasttext_pretrained = Embedding(vocab_size, output_dim= 300, weights=[embedding_matrix_fasttext_pretrained], 
                                     input_length=MAX_LEN, trainable=False)

100%|██████████| 28874/28874 [00:00<00:00, 362862.16it/s]


## 4. Deep Learning Models <a class="kk" id="4"></a>
[Back to Contents](#0.1)

 We have initialized Keras embedding layer for our various word embedding models. Now it’s time to train using deep learning models. I will demonstrate how to train for glove pertained layer. You can test with other six embedding layer also (by just reassigning embedding_layer). One point you will notice that pretrained embedding layers performs much better than their trained counter parts. Again the purpose here is to depict basic Deep Learning model performance and not to obtain high score.

In [141]:
# Declare embeeding layer of your choics 
embedding_layer = embedding_layer_glove_pretrained

# can try with other embedding layes too
# embedding_layer_fasttext_pretrained
# embedding_layer_fasttext_trained
# embedding_layer_cbow_trained
# embedding_layer_sg_trained
# embedding_layer_w2vec_pretrained
# embedding_layer_glove_trained

### 4.1 Basic DNN <a class="kk" id="4.1"></a>


In [132]:
from keras.models import Sequential
from keras.layers import Flatten, Dense
dnn_model = Sequential()
dnn_model.add(embedding_layer)
dnn_model.add(Flatten())
dnn_model.add(Dense(1, activation='sigmoid'))

dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

dnn_model.summary()

history = dnn_model.fit(x_train,  y = targets,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 50, 300)           8662500   
_________________________________________________________________
flatten_5 (Flatten)          (None, 15000)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 15001     
Total params: 8,677,501
Trainable params: 15,001
Non-trainable params: 8,662,500
_________________________________________________________________
None
Train on 6090 samples, validate on 1523 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


###  4.2 CNN <a class="kk" id="4.2"></a>

In [133]:
import keras
cnn_model = Sequential()
# note here we are adding embeeding layer
cnn_model.add(embedding_layer)
cnn_model.add(keras.layers.Dropout(0.2))
cnn_model.add(keras.layers.Conv1D(3,3, padding='valid',activation='relu', strides=1))
cnn_model.add(keras.layers.GlobalMaxPooling1D())
cnn_model.add(keras.layers.Dense(20))
cnn_model.add(keras.layers.Dropout(0.2))
cnn_model.add(keras.layers.Activation('relu'))
cnn_model.add(keras.layers.Dense(1))
cnn_model.add(keras.layers.Activation('sigmoid'))

# Get model summary
cnn_model.summary()

# compile the model
history = cnn_model.fit(x_train,  y = targets,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 50, 300)           8662500   
_________________________________________________________________
dropout_1 (Dropout)          (None, 50, 300)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 48, 3)             2703      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 3)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 20)                80        
_________________________________________________________________
dropout_2 (Dropout)          (None, 20)                0         
_________________________________________________________________
activation_1 (Activation)    (None, 20)               

### 4.3 Simple RNN <a class="kk" id="4.2"></a>

In [134]:
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
rnn_model = Sequential()
rnn_model.add(embedding_layer)
rnn_model.add(SimpleRNN(32))
rnn_model.add(Dense(1, activation='sigmoid'))
rnn_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = rnn_model.fit(x_train, y = targets,epochs=10, batch_size=32,validation_split=0.2)

Train on 6090 samples, validate on 1523 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


### 4.3  Recurrent Neural Network -LSTM <a class="kk" id="4.3"></a>

In [136]:
from keras.layers import LSTM
lstm_model = Sequential()
lstm_model.add(embedding_layer)
lstm_model.add(LSTM(32))
lstm_model.add(Dense(1, activation='sigmoid'))

lstm_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = lstm_model.fit(x_train, y = targets,epochs=10, batch_size=32,validation_split=0.2)

Train on 6090 samples, validate on 1523 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### 4.4 Recurrent Neural Network – GRU <a class="kk" id="4.4"></a>

In [142]:
from keras.layers import GRU
gru_model = Sequential()
gru_model.add(embedding_layer)
gru_model.add(GRU(32))
gru_model.add(Dense(1, activation='sigmoid'))

gru_model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = gru_model.fit(x_train, y = targets,epochs=22, batch_size=32,validation_split=0.2)

Train on 6090 samples, validate on 1523 samples
Epoch 1/22
Epoch 2/22
Epoch 3/22
Epoch 4/22
Epoch 5/22
Epoch 6/22
Epoch 7/22
Epoch 8/22
Epoch 9/22
Epoch 10/22
Epoch 11/22
Epoch 12/22
Epoch 13/22
Epoch 14/22
Epoch 15/22
Epoch 16/22
Epoch 17/22
Epoch 18/22
Epoch 19/22
Epoch 20/22
Epoch 21/22
Epoch 22/22


###  Target Prediction  

In [143]:
raw_preds = model.predict(x_test)
preds = raw_preds.round().astype(int)
preds

array([[1],
       [1],
       [1],
       ...,
       [1],
       [1],
       [1]])

Thanks for reading! Purpose of this notebook is to provide fair understanding of word embedding techniques and get beginners started in quick time. Do give your feed back. In the next part-3 we will read about state-of-art 'BERT Embedding'.

<img src="https://storage.googleapis.com/kagglesdsdata/datasets/598303/1081639/too%20much.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1587151146&Signature=revsFWYk5EZkV%2B26mNr6q8vzpAkvHsiA8TrtphPck5DU%2FxcT9iSbaDlUkFwTtXmoLPTkVLwylkc9cbswCWJTHBogYaNUF%2Bv3YXdwHGurX1lbJ6SH41ZR%2B9%2BwpMmIIBv%2FMYenQUdO5ERJbfbMW%2BFK6rxFbJizkUwAQhy4eDpbywlhLu1l2P79JUsHr5uf6L8fAOzaf4CjQTC1VP5TjG%2BERvLzwnBQO1oN9g9%2B2Y%2FeL0LXHNL17297xfX8pZ3j%2FGt2hl%2BXWYP3nW5ymkuM4rpa5TbkQDac0qELQrZy8ncHbDqtwVj9csjFK3J0ALwmRLUryuLqdvh3v6cnS1sd3kU%2FBQ%3D%3D" width="250">




## References
- Deep Learning with Python by FRANÇOIS CHOLLET  http://faculty.neu.edu.cn/yury/AAI/Textbook/Deep%20Learning%20with%20Python.pdf
- https://en.wikipedia.org/
- https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/eb9cd609-e44a-40a2-9c3a-f16fc4f5289a.xhtml
- https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-skip-gram.html
- https://www.kaggle.com/slatawa/simple-implementation-of-word2vec
- https://becominghuman.ai/how-does-word2vecs-skip-gram-work-f92e0525def4 (image)
- https://www.thinkinfi.com/2019/06/single-word-cbow.html(image)
- https://www.kaggle.com/rajmehra03/a-detailed-explanation-of-keras-embedding-layer
- https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b
- https://www.kaggle.com/christofhenkel/fasttext-starter-description-only
