Motivation:

Cloned from kaggle kernel (details in fork)
https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb



## Word embeddings with Gensim

The importance of encoding text data is crucial for Deep Learning models. A model that encodes the similarity and proximity between words in the representation itself intuitively should work better for many tasks and it has been proved to be so - it is not always the best choice though: it's no silver bullet.

Two of the most important models for word representation in the n-dimensional space are [word2vec](https://arxiv.org/abs/1310.4546) and [GloVe](https://nlp.stanford.edu/projects/glove/). 

In this tutorial, I will show how to use [Gensim](https://radimrehurek.com/gensim/index.html) in order to use both word2vec and GloVe encodings for text data.

I assume you already know how to setup an environment for machine learning development with Python. If you don't, take a look at [this tutorial](https://medium.com/cocoaacademymag/basic-tools-for-machine-learning-85e887224ee4) on the basic tools for Machine Learning, which has everything you will need to follow this one.

### Summary

* ✅ Installing and importing Gensim
* ✅ Creating a word2vec model from text data
* Creating a GloVe model from text data
* Intrinsic evaluation for both models
* Extrinsic evaluation for both models

# word2vec

## Preparing the text to train the model

In this example, I will open a csv file, get the text from it, split it into the different lines, then I split each line into "words" - actually, I should use a more sophisticated method to separate the words, but since this is just an example, I will use the space as a boundary between words, *which is absolutely naive and should not be done in production* - stripping the ponctuation in order to clean the corpus a little. In _real life_ you should use a tokenizer in order to separate the tokens to be vectorized and also in order to handle ponctuation properly. Depending on the task, it might also be helpful to lemmatize the tokens.

My `sentences` variable will store a list of lists of strings, where each string ~roughly~ represents a word.

In [None]:
import pandas as pd
df = pd.read_csv('../input/train.csv')
corpus_text = '\n'.join(df[:50000]['comment_text'])
sentences = corpus_text.split('\n')
sentences = [line.lower().split(' ') for line in sentences]

In [None]:
def clean(s):
    return [w.strip(',."!?:;()\'') for w in s]
sentences = [clean(s) for s in sentences if len(s) > 0]

## Training the model

Once we have the sentences, we can use `Gensim` to create a model for us.
Here's a simple way to do it:

In [None]:
from gensim.models import Word2Vec

model = Word2Vec(sentences, size=100, window=5, min_count=3, workers=4)

modeOf course, you can change the hyperparameters such as window size or the dimensions of the resulting vectors to get better results.
If our model is too big, and we're done training it we can delete it keeping only the vectors.

In [None]:
vectors = model.wv
# del model

## Using the vectors

Now, for each word (as represented in a string), we can get its appropriate vector.

In [None]:
vectors['good']

We can also compare words in order to assess their similarity, 
check which word is the most similar to a given word - i.e. the 
one with the least distant vector.

In [None]:
print(vectors.similarity('you', 'your'))
print(vectors.similarity('you', 'internet'))

In [None]:
vectors.most_similar('kill')

In [None]:
len(model.wv.vocab)

In [None]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count) for term, voc in model.wv.vocab.items()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda k: -k[2])

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)
# print(ordered_terms)
# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(model.wv.syn0norm[term_indices, :], index=ordered_terms)

word_vectors

In [None]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in model.most_similar(positive=[token], topn=topn):
        print (word, round(similarity, 3))

In [None]:
get_related_terms(u'killed')

In [None]:
get_related_terms(u'japanese')

In [None]:
get_related_terms(u'asshole')

In [None]:
get_related_terms(u'discussion')

In [None]:
get_related_terms(u'wikipedia')

In [None]:
get_related_terms(u'please')

In [None]:
get_related_terms(u'vandalism')

In [None]:
get_related_terms(u'media')

In [None]:
get_related_terms(u'language')

In [None]:
get_related_terms(u'perhaps')

In [None]:
get_related_terms(u'sex')

In [None]:
get_related_terms(u'conflict')

In [None]:
get_related_terms(u'bastard')

In [None]:
get_related_terms(u'jewish')

In [None]:
get_related_terms(u'introduction')

In [None]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = model.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [None]:
word_algebra(add=[u'i', u'will'])

In [None]:
word_algebra(add=[u'you', u'will'])

In [None]:
word_algebra(add=[u'i', u'am'])

In [None]:
word_algebra(add=[u'mother', u'fuck'])

In [None]:
word_algebra(add=[ u'fuck', 'you'])

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne_input = word_vectors
tsne_input = tsne_input.head(5000)

In [None]:
tsne_input

In [None]:
tsne = TSNE()
tsne_vectors = tsne.fit_transform(tsne_input.values)

In [None]:
tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

tsne_vectors.head()

In [None]:
tsne_vectors[u'word'] = tsne_vectors.index

In [None]:
tsne_vectors.head()

In [None]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [None]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@word') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);