## Generating word embedding with IMDB Rating

We'll use the imdb dataset to create word embeddings. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. We will only use the first 20 words from each review to speed up training, use a max vocab size of 10,000.

In [70]:
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000
maxlen = 30
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = vocab_size)



In [75]:
x_train = pad_sequences(x_train, maxlen = maxlen)  #makes all the array the same length by filling out with 0x
x_test = pad_sequences(x_test, maxlen = maxlen)

In [76]:
x_train.shape

(25000, 30)

In [84]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model = Sequential()
model.add(Embedding(10000, 8, input_length=30))  #10000 vocab, 8 features
model.add(Flatten())
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer = 'rmsprop', loss ='binary_crossentropy', metrics=['acc'])
model.summary() 


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 30, 8)             80000     
_________________________________________________________________
flatten_5 (Flatten)          (None, 240)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 241       
Total params: 80,241
Trainable params: 80,241
Non-trainable params: 0
_________________________________________________________________


In [81]:
x_train[1]

array([ 371,   78,   22,  625,   64, 1382,    9,    8,  168,  145,   23,
          4, 1690,   15,   16,    4, 1355,    5,   28,    6,   52,  154,
        462,   33,   89,   78,  285,   16,  145,   95], dtype=int32)

In [85]:
epochs = 3
history = model.fit(x_train, y_train, batch_size = 32, epochs = epochs)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [86]:
model.layers

[<keras.layers.embeddings.Embedding at 0x12624b400>,
 <keras.layers.core.Flatten at 0x12624b390>,
 <keras.layers.core.Dense at 0x12624b668>]

In [105]:
embeddings = model.layers[0].get_weights()[0]

In [106]:
embeddings.shape

(10000, 8)

# Game of Throne: Generating word embeddings with Gensim


There are many choices for NLP libraries in python including spaCy, NLTK and gensim. Today we will use gensim because it has a intuitive word2vec implementation. This implementation is highly optimized and should run very fast, even on just CPU.

In [7]:
from gensim.models import Word2Vec #prebuilt word to vec implementation
import glob #finds all pathnames matching a pattern, like regex
import codecs #unicode support when reading files
from multiprocessing import cpu_count #use to get number of cpus on host machine
from gensim.utils import simple_preprocess,simple_tokenize #text processing
from string import punctuation #string  containing all puncuation
import numpy as np

In [8]:
book_filenames = sorted(glob.glob("data/*txt"))
print ("Found books:")
book_filenames

Found books:


['data/got1.txt',
 'data/got2.txt',
 'data/got3.txt',
 'data/got4.txt',
 'data/got5.txt']

In [9]:
corpus_raw = u""
#for each book, read it, open it un utf 8 format, 
#add it to the raw corpus
for book_filename in book_filenames:
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()

print("Corpus is {0} characters long".format(len(corpus_raw)))

Corpus is 9719485 characters long


In [10]:
table= str.maketrans("","",punctuation)     #create translation table
text = corpus_raw.translate(table) # remove puncuation
sentences = text.split('\n') #split into sentences
sentences = list(filter(None,sentences)) #remove empty strings
for i,sentence in enumerate(sentences):
    sentences[i] = sentence.lower().split() #lower case and split into words

In [11]:
sentences = corpus_raw.split('\n') #split at new lines
sentences =  filter(None, sentences) # remove empty strings
sentences =  list(map(simple_preprocess,sentences)) #clean text 

In [12]:
sentences[0]

['this',
 'edition',
 'contains',
 'the',
 'complete',
 'text',
 'of',
 'the',
 'original',
 'hardcover',
 'edition']

In [9]:
np.shape(sentences)

(44758,)

In [36]:
sentences

[['this',
  'edition',
  'contains',
  'the',
  'complete',
  'text',
  'of',
  'the',
  'original',
  'hardcover',
  'edition'],
 ['not', 'one', 'word', 'has', 'been', 'omitted'],
 ['clash', 'of', 'kings'],
 ['bantam', 'spectra', 'book'],
 ['publishing', 'history'],
 ['bantam', 'spectra', 'hardcover', 'edition', 'published', 'february'],
 ['bantam', 'spectra', 'paperback', 'edition', 'september'],
 ['spectra',
  'and',
  'the',
  'portrayal',
  'of',
  'boxed',
  'are',
  'trademarks',
  'of',
  'bantam',
  'books',
  'division',
  'of',
  'random',
  'house',
  'inc'],
 ['all', 'rights', 'reserved'],
 ['copyright', 'by', 'george', 'martin'],
 ['maps', 'by', 'james', 'sinclair'],
 ['heraldic', 'crest', 'by', 'virginia', 'norey'],
 ['library', 'of', 'congress', 'catalog', 'card', 'number'],
 ['no',
  'part',
  'of',
  'this',
  'book',
  'may',
  'be',
  'reproduced',
  'or',
  'transmitted',
  'in',
  'any',
  'form',
  'or',
  'by',
  'any',
  'means',
  'electronic',
  'or',
  'mech

In [13]:
workers = cpu_count()

In [122]:
Word2Vec?

In [14]:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=workers) #fit model

In [15]:
model.wv.vocab

{'this': <gensim.models.keyedvectors.Vocab at 0x105f40ba8>,
 'edition': <gensim.models.keyedvectors.Vocab at 0x1a112e6ef0>,
 'the': <gensim.models.keyedvectors.Vocab at 0x1a112e6a90>,
 'complete': <gensim.models.keyedvectors.Vocab at 0x1a112e6cf8>,
 'of': <gensim.models.keyedvectors.Vocab at 0x1a112e69b0>,
 'not': <gensim.models.keyedvectors.Vocab at 0x1a112e6e10>,
 'one': <gensim.models.keyedvectors.Vocab at 0x1a112e6f98>,
 'word': <gensim.models.keyedvectors.Vocab at 0x1a112e6e80>,
 'has': <gensim.models.keyedvectors.Vocab at 0x1a112e6e48>,
 'been': <gensim.models.keyedvectors.Vocab at 0x1a112e6c88>,
 'clash': <gensim.models.keyedvectors.Vocab at 0x1a1130f080>,
 'kings': <gensim.models.keyedvectors.Vocab at 0x1a1130f208>,
 'bantam': <gensim.models.keyedvectors.Vocab at 0x1a1130f198>,
 'spectra': <gensim.models.keyedvectors.Vocab at 0x1a1130f240>,
 'book': <gensim.models.keyedvectors.Vocab at 0x1a1130f0b8>,
 'history': <gensim.models.keyedvectors.Vocab at 0x1a1130f128>,
 'published': 

In [16]:
len(model.wv.vocab)

11766

In [15]:
model.wv.vectors.shape  #embedding vectors

(11766, 100)

In [16]:
model.wv['man']
#model.wv.most_similar('man')

array([-0.93118995,  0.70357364, -1.77665591, -1.22589934,  1.44295657,
       -0.85105968, -0.74845225, -1.1224184 , -1.5676645 , -0.96208918,
       -0.50178736,  1.41906106, -1.45686841,  0.59505284, -0.50870335,
       -0.79569578, -0.69877213,  3.36645627,  1.30326569, -1.16167021,
        1.04591501, -1.96877921, -0.1211483 ,  0.30852014, -0.20455964,
       -0.49310881, -1.1012764 , -1.01252282,  0.78260958,  0.63978231,
       -2.41301322, -0.71702802, -0.46050122, -1.34534812, -1.24001336,
       -1.32104027, -0.74939841, -0.06482343,  1.05338466,  1.9677974 ,
        0.16377982,  2.31893206,  2.63010907, -0.56471789,  0.80483538,
        0.1071474 ,  0.87939471, -0.640441  , -0.48702291,  0.64615232,
        0.75853151,  2.27553225,  0.54368836,  1.78162348, -1.37585557,
       -0.63799083, -1.63012612,  0.5040136 , -0.23040193, -1.55083239,
        0.91517597, -0.66354257, -1.09628105, -2.5788064 , -0.39855826,
       -0.7770732 , -0.978293  , -3.53314686,  1.92458963,  1.28

In [18]:
model.wv.most_similar(positive=['she', 'king'], negative=['man']) 

[('cersei', 0.6207334995269775),
 ('queen', 0.6201149225234985),
 ('sansa', 0.5549461841583252),
 ('catelyn', 0.535677969455719),
 ('dany', 0.5277178287506104),
 ('margaery', 0.5257571339607239),
 ('joffrey', 0.4984680414199829),
 ('daenerys', 0.4920499324798584),
 ('viserys', 0.4814302921295166),
 ('arya', 0.4803134799003601)]

## TSNE Dimension Reduction

In [17]:
from sklearn.manifold import TSNE #from dimensionality reduction
import pandas as pd 
from sklearn.manifold import MDS

In [18]:
tsne = TSNE(2, 5, random_state = 0)
tsne_vectors = tsne.fit_transform(model.wv.vectors[:1000])  #use first 1000 vectors

In [19]:
words = model.wv.index2word[:1000]

In [20]:
mds = MDS(n_components=2)
mds_vectors = mds.fit_transform(model.wv.vectors[:1000])

In [21]:
tsne_vectors

array([[ 27.940817  ,  -1.1939619 ],
       [ -8.301733  , -23.934116  ],
       [-10.567223  , -10.575001  ],
       ...,
       [-34.2763    ,   0.49309605],
       [-63.40205   ,  -4.828011  ],
       [ 28.243414  , -14.405399  ]], dtype=float32)

## Plotting with Bokeh

In [22]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value
output_notebook()

In [56]:
#create a dataframe to plot with
df = pd.DataFrame(tsne_vectors,index=words,columns=['x_coord','y_coord'])
df["words"] = df.index
df.index.name = 'word'
df.head()

Unnamed: 0_level_0,x_coord,y_coord,words
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
the,27.940817,-1.193962,the
and,-8.301733,-23.934116,and
to,-10.567223,-10.575001,to
of,18.352484,-26.73983,of
he,24.756479,28.214083,he


In [57]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(df)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, reset'),
                   active_scroll=u'wheel_zoom')


In [160]:

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@word') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')


# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# To show the for each word (Comment out if not needed)
labels = LabelSet(x="x_coord", y="y_coord", text= "words" , y_offset=8,
                  text_font_size= "8pt", text_color="#555555",
                  source=plot_data, text_align='center')

tsne_plot.add_layout(labels)

# plot!
show(tsne_plot);