# Word Embeddings and Deep Learning for NLP

This notebook covers some of basic steps involved in using Deep Learning for NLP. This notebook covers:

A brief overview of Word2Vec based Embeddings.
A brief on HuggingFace Transformer :hugs: based implementation of NLP tasks
Note: This is just an overview and not an exhaustive material on NLP with Deep Learning

## Text Representation using Word2Vec

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups

pd.options.display.max_colwidth = 200

### Get Dataset

In [None]:
categories = ['alt.atheism', 'comp.graphics', 'sci.med']

In [None]:
twenty_corpus = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

In [None]:
[news.split('\n')[1] for news in twenty_corpus.data[:10]]

In [None]:
twenty_corpus.target[:10]

In [None]:
corpus = [news.split('\n')[1] for news in twenty_corpus.data[:10]]
labels = [categories[i] for i in twenty_corpus.target[:10]]

In [None]:
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

### Preprocess Dataset

In [None]:
import nltk
import re
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [None]:
stop_words = nltk.corpus.stopwords.words('english')
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(corpus)
norm_corpus

### Train Word2Vec Model

The Skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the center word).

Considering our simple sentence from earlier, “the quick brown fox jumps over the lazy dog”. If we used the CBOW model, we get pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on.

Now considering that the skip-gram model’s aim is to predict the context from the target word, the model typically inverts the contexts and targets, and tries to predict each context word from its target word. Hence the task becomes to predict the context [quick, fox] given target word ‘brown’ or [the, brown] given target word ‘quick’ and so on.

Thus the model tries to predict the context_window words based on the target_word.


![skipgram_arch](../assets/skipgram_arch.png)

In [None]:
import nltk
from gensim.models import word2vec

In [None]:
tokenized_corpus = [nltk.word_tokenize(doc) for doc in norm_corpus]

### Gensim
The gensim framework, created by Radim Řehůřek consists of a robust, efficient and scalable implementation of the Word2Vec model. We will leverage the same on our sample toy corpus. In our workflow, we will tokenize our normalized corpus and then focus on the following four parameters in the Word2Vec model to build it.

size: The word embedding dimensionality
window: The context window size
min_count: The minimum word count
sample: The downsample setting for frequent words
sg: Training model, 1 for skip-gram otherwise CBOW
We will build a simple Word2Vec model on the corpus and visualize the embeddings.

In [None]:
# Set values for various parameters
feature_size = 15    # Word vector dimensionality  
window_context = 20  # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3        # Downsample setting for frequent words
sg = 1               # skip-gram model

w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
                              window=window_context, min_count = min_word_count,
                              sg=sg, sample=sample, iter=5000)


### Visualize Embeddings

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
# visualize embeddings
from sklearn.manifold import TSNE

In [None]:
words = w2v_model.wv.index2word
wvs = w2v_model.wv[words]

tsne = TSNE(n_components=2, random_state=42, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words

plt.figure(figsize=(12, 6))
plt.scatter(T[:, 0], T[:, 1], c='blue')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
    plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')

## Transformers 🤗

In [None]:
!pip install transformers

In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline('sentiment-analysis')

In [None]:
classifier.tokenizer('The hugging-face transformer package really simplifies NLP tasks')

In [None]:
classifier('The hugging-face transformer package really simplifies NLP tasks')