In [62]:
import numpy as np

### Word embeddings

#### Problem:
#### There are too many words - converting them to one-hot encoding seems crazy.
#### Also, using tf-idf or bag of words disregards the relation between similar/disimilar words (cat and dog are 
#### more realted than cat and train)


##### Idea :
##### For each word map a randomly initialized vector sized 100/200/300. 
##### Then train a neural network to update these vectors till they are meaningful. 

##### What was the solution/s before?

#### N-gram

-Which word produces the highest probability to be next given we have seen n specific other words before

Words: Thank, you, Hello, goodbye

If we have 4 words and we are looking at 2-gram? 
    Example: no. of times Thank you occurs divided by number of times Thank occurs

We need to calculate the probabilty of 
 - Thank Hello
 - Thank you
 - Thank goodbye
 - Thank Thank

So we needed to do 4 calculations

In [63]:
def how_many_calc_to_do(gram, voc_size):
    '''This function needs to calculate all combos 
    of words'''
    
    return np.prod(np.repeat(voc_size, gram))

In [64]:
how_many_calc_to_do(7, 10000)
#### Notice that this is only an approxiamtion and it can be implemented in more efficient ways.

4477988020393345024

![](./img/one_hot_encoding_distance_on_3d.png)

#### Insight I: 
    we can actually just turn each word to a random vector sized 100/200/300, 
    train a classic neural net to predict the next word and update both the weights and the random vectors.
    You can think of it as just another layer of weights multiplying the one hot encoded inputs.
<a href="http://hunterheidenreich.com/blog/intro-to-word-embeddings/">word_embed blog</a>

<a href="https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa">word_embed blog II</a>

![W2V](./img/w2v.png)

##### Objective: maximize the sum of probabilities of each word given its observed window

This idea is very strong in comparison to other options: 

    Bag of words - just count occourences 
    TF-IDF - word is either informative or not but has no relation to other words
    one-hot encoder: for the computer paris-france is the same distance as paris-blabla

- Distance and direction are meaningful! King - man = Queen - woman
- Now the words massive and huge are similar!
- Extends to sentences, paragraphs and documents.

![](./img/see_attached_word_embed.png)

##### We have reduced the dimension of the vocublary by a big factor!
example: from 80,000 to 300

![](https://lemay-images.nyc3.cdn.digitaloceanspaces.com/job_postings_embedding.gif)

##### Insight I.I  The same thing can be applied to any categorical variable. 
##### With enough training data we can learn its continous position in space - state of the art

![categorical_embed](./img/categorical_embedding.png) # image I

<a href="https://arxiv.org/pdf/1604.06737.pdf/">source paper - entity embeddings/a>
entity embedding paper: 

![german_states](./img/german_states_mapped_2D.png) 

### Takeaways:

##### Word embeddings
- Word/categorical embeddings gives meaning to words in relation to one another
- Word/categorical embeddings are computationally efficient
- Training is done through a classic NN with small window around words

### Hands on Word2Vec/word embedding

How can we train w2v?
1. Use gensim to do the training for you
2. Build your own w2v using keras
3. Use transfer learning 

#### 1. Using gensim

!pip install gensim

In [65]:
import gensim
import numpy as np
import json
import string

In [66]:
##### Reading in the data

In [67]:
with open('JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [68]:
# Let's look at the first element in our list
data[0]

{'category': 'HISTORY',
 'air_date': '2004-12-31',
 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 'value': '$200',
 'answer': 'Copernicus',
 'round': 'Jeopardy!',
 'show_number': '4680'}

In [69]:
# Word2Vec requires that our text have the form of a list
# of 'sentences', where each sentence is itself a list of
# words. How can we put our _Jeopardy!_ clues in that shape?

text = []

for clue in data:
    sentence = clue['question'].translate(str.maketrans('', '',
                                                        string.punctuation)).split(' ')
    
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
    
    text.append(new_sent)

In [70]:
# Let's check the new structure of our first clue
text[0]

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

#####  Constructing the model

In [71]:
# simply a matter of
# instantiating a Word2Vec object.
model = gensim.models.Word2Vec(text, sg=1)
## sg means skip-gram this is a flavour of the word2vec algo. 

##### training 

In [72]:
# To train, call 'train()'!
model.train(text, total_examples=model.corpus_count, epochs=model.epochs)

W0905 21:50:57.763874 4685043136 base_any2vec.py:1182] Effective 'alpha' higher than previous training cycles


(11336478, 15849970)

##### Let's explore our results 

In [73]:
model.wv.most_similar('happiness')

[('ignorance', 0.7723346948623657),
 ('shakespearebr', 0.7453909516334534),
 ('prosperity', 0.73906409740448),
 ('shame', 0.7375794649124146),
 ('wherefore', 0.7300489544868469),
 ('kindness', 0.7293806076049805),
 ('despair', 0.7275824546813965),
 ('vigor', 0.7242348194122314),
 ('hatred', 0.7181109189987183),
 ('autumns', 0.7149204015731812)]

In [74]:
model.wv.similarity('furniture', 'jewelry')

0.61292565

In [75]:
model.wv.most_similar(positive=['president', 'germany'], negative='usa')

[('emperors', 0.24791167676448822),
 ('dictator', 0.22046294808387756),
 ('destroyed', 0.1994936764240265),
 ('regime', 0.19767695665359497),
 ('exiled', 0.19326935708522797),
 ('stalin', 0.19325149059295654),
 ('emperor', 0.19045278429985046),
 ('headlines', 0.18985213339328766),
 ('mecca', 0.18638044595718384),
 ('coup', 0.1814723014831543)]

In [76]:
model.wv.doesnt_match(['breakfast', 'lunch', 'frog', 'food'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'frog'

#### 2. using keras

In [77]:
from keras.layers import Embedding
import codecs
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential

##### For your help: 
#####    a nice  function for text cleaning 

In [78]:
########################################
## process texts in datasets
########################################
print('Processing text dataset')

# The function "text_to_wordlist" is from
# https://www.kaggle.com/currie32/quora-question-pairs/the-importance-of-cleaning-text
def text_to_wordlist(text, remove_stopwords=False, stem_words=False):
    # Clean the text, with the option to remove stopwords and to stem words.
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    # Optionally, shorten words to their stems
    if stem_words:
        text = text.split()
        stemmer = SnowballStemmer('english')
        stemmed_words = [stemmer.stem(word) for word in text]
        text = " ".join(stemmed_words)
    
    # Return a list of words
    return(text)

Processing text dataset


In [79]:
MAX_VOC_WORDS = 200000
MAX_SEQUENCE_LENGTH = 50

In [80]:
tokenizer = Tokenizer(num_words=MAX_VOC_WORDS)
tokenizer.fit_on_texts(texts_1[1:5])

In [81]:
sequences_1 = tokenizer.texts_to_sequences(texts_1[1:5])

In [82]:
### Notice that sequences are of varying lengths. How would we use them in a model??
### solution: padding 
data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
labels = np.array(labels)
print('Shape of data tensor:', data_1.shape)
print('Shape of label tensor:', labels[1:5].shape)

Shape of data tensor: (0, 50)
Shape of label tensor: (0,)


In [83]:
### Input to the model will be list of lists. Each list will be the index of the words in that sentence.

In [84]:
model = Sequential()
model.add(Embedding(1000, 64, input_length=10))

#### 3. Transfer learning 

Download before lecture:

!brew install wget
!wget -c “https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz”

In [85]:
from gensim.models import KeyedVectors

In [86]:
nb_words  = 3000000 
MAX_SEQUENCE_LENGTH = 15
EMBEDDING_DIM = 300

In [87]:
########################################
## set directories and parameters
########################################
BASE_DIR = './'
EMBEDDING_FILE = '/Users/omer/Documents/RNN_ex/GoogleNews-vectors-negative300.bin'

In [88]:
########################################
## read pre-trained word vectors
########################################
print('Indexing word vectors')
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, \
        binary=True)
print('Found %s word vectors of word2vec' % len(word2vec.vocab))

Indexing word vectors
Found 3000000 word vectors of word2vec


In [89]:
word2vec.vocab

{'</s>': <gensim.models.keyedvectors.Vocab at 0x1a35b1b518>,
 'in': <gensim.models.keyedvectors.Vocab at 0x1a35b1b4a8>,
 'for': <gensim.models.keyedvectors.Vocab at 0x1a35b1b668>,
 'that': <gensim.models.keyedvectors.Vocab at 0x1a570c3518>,
 'is': <gensim.models.keyedvectors.Vocab at 0x1a35b1b710>,
 'on': <gensim.models.keyedvectors.Vocab at 0x1a35b1b7b8>,
 '##': <gensim.models.keyedvectors.Vocab at 0x1a35b1b828>,
 'The': <gensim.models.keyedvectors.Vocab at 0x1a35b1b898>,
 'with': <gensim.models.keyedvectors.Vocab at 0x1a35b1b908>,
 'said': <gensim.models.keyedvectors.Vocab at 0x1a35b1b978>,
 'was': <gensim.models.keyedvectors.Vocab at 0x1a35b1b9e8>,
 'the': <gensim.models.keyedvectors.Vocab at 0x1a35b1ba58>,
 'at': <gensim.models.keyedvectors.Vocab at 0x1a35b1bac8>,
 'not': <gensim.models.keyedvectors.Vocab at 0x1a35b1bb38>,
 'as': <gensim.models.keyedvectors.Vocab at 0x1a35b1bba8>,
 'it': <gensim.models.keyedvectors.Vocab at 0x1a35b1bc18>,
 'be': <gensim.models.keyedvectors.Vocab at

In [90]:
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for i, word in enumerate(word2vec.vocab):
        embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

Null word embeddings: 4


In [97]:
list(word2vec.vocab.keys())[5]

'on'

In [98]:
embedding_matrix[5]

array([ 2.67333984e-02, -9.08203125e-02,  2.78320312e-02,  2.04101562e-01,
        6.22558594e-03, -9.03320312e-02,  2.25830078e-02, -1.61132812e-01,
        1.32812500e-01,  6.10351562e-02, -1.57470703e-02,  8.83789062e-02,
        1.37939453e-02,  4.63867188e-02, -5.59082031e-02, -6.68945312e-02,
        1.22680664e-02,  1.36718750e-01,  1.54296875e-01, -4.61425781e-02,
       -3.93066406e-02, -1.54296875e-01, -1.65039062e-01,  1.07910156e-01,
        3.32031250e-02, -5.10253906e-02,  3.71093750e-02,  1.01562500e-01,
        1.10351562e-01,  2.05078125e-02,  6.77490234e-03,  1.18255615e-03,
       -1.25122070e-02, -1.25000000e-01,  1.48315430e-02, -2.68554688e-02,
       -2.14843750e-02,  1.50756836e-02,  1.38671875e-01,  4.85839844e-02,
       -7.66601562e-02, -1.16699219e-01,  1.06933594e-01,  4.17480469e-02,
        1.28173828e-02, -9.46044922e-03, -2.89306641e-02, -3.85742188e-02,
        2.43164062e-01,  9.52148438e-03,  2.20947266e-02,  2.22656250e-01,
        9.15527344e-03, -

In [99]:
########################################
## define the model structure
########################################
embedding_layer = Embedding(nb_words,
        EMBEDDING_DIM,
        weights=[embedding_matrix],
        input_length=MAX_SEQUENCE_LENGTH,
        trainable=False)

##### Input to our network

In [None]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)