In [1]:
import numpy as np

### Word embeddings

#### Problem:
#### There are too many words - converting them to one-hot encoding seems crazy.
#### Also, using tf-idf or bag of words disregards the relation between similar/disimilar words (cat and dog are 
#### more realted than cat and train)


##### Idea :
##### For each word map a randomly initialized vector sized 100/200/300. 
##### Then train a neural network to update these vectors till they are meaningful. 

##### What was the solution/s before?

#### N-gram

-Which word produces the highest probability to be next given we have seen n specific other words before

Words: Thank, you, Hello, goodbye

If we have 4 words and we are looking at 2-gram? 
    Example: no. of times Thank you occurs divided by number of times Thank occurs

We need to calculate the probabilty of 
 - Thank Hello
 - Thank you
 - Thank goodbye
 - Thank Thank

So we needed to do 4 calculations

In [4]:
def how_many_calc_to_do(gram, voc_size):
    '''This function needs to calculate all combos 
    of words'''
    
    return np.prod(np.repeat(voc_size, gram))

In [5]:
how_many_calc_to_do(7, 10000)
#### Notice that this is only an approxiamtion and it can be implemented in more efficient ways.

4477988020393345024

![](./img/one_hot_encoding_distance_on_3d.png)

#### Insight I: 
    we can actually just turn each word to a random vector sized 100/200/300, 
    train a classic neural net to predict the next word and update both the weights and the random vectors.
    You can think of it as just another layer of weights multiplying the one hot encoded inputs.
<a href="http://hunterheidenreich.com/blog/intro-to-word-embeddings/">word_embed blog</a>

<a href="https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa">word_embed blog II</a>

![W2V](./img/w2v.png)

##### Objective: maximize the sum of probabilities of each word given its observed window

This idea is very strong in comparison to other options: 

    Bag of words - just count occourences 
    TF-IDF - word is either informative or not but has no relation to other words
    one-hot encoder: for the computer paris-france is the same distance as paris-blabla

- Distance and direction are meaningful! King - man = Queen - woman
- Now the words massive and huge are similar!
- Extends to sentences, paragraphs and documents.

![](./img/see_attached_word_embed.png)

##### We have reduced the dimension of the vocublary by a big factor!
example: from 80,000 to 300

![](https://lemay-images.nyc3.cdn.digitaloceanspaces.com/job_postings_embedding.gif)

##### Insight I.I  The same thing can be applied to any categorical variable. 
##### With enough training data we can learn its continous position in space - state of the art

![categorical_embed](./img/categorical_embedding.png) # image I

![german_states](./img/german_states_mapped_2D.png) 

### Takeaways:

##### Word embeddings
- Word/categorical embeddings gives meaning to words in relation to one another
- Word/categorical embeddings are computationally efficient
- Training is done through a classic NN with small window around words

### Hands on Word2Vec/word embedding

How can we train w2v?
1. Use gensim to do the training for you
2. Build your own w2v using keras
3. Use transfer learning 

#### 1. Using gensim

!pip install gensim

In [6]:
import gensim
import numpy as np
import json
import string

In [7]:
##### Reading in the data

In [8]:
with open('JEOPARDY_QUESTIONS1.json') as f:
    data = json.load(f)

In [9]:
# Let's look at the first element in our list
data[0]

{'category': 'HISTORY',
 'air_date': '2004-12-31',
 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
 'value': '$200',
 'answer': 'Copernicus',
 'round': 'Jeopardy!',
 'show_number': '4680'}

In [10]:
# Word2Vec requires that our text have the form of a list
# of 'sentences', where each sentence is itself a list of
# words. How can we put our _Jeopardy!_ clues in that shape?

text = []

for clue in data:
    sentence = clue['question'].translate(str.maketrans('', '',
                                                        string.punctuation)).split(' ')
    
    new_sent = []
    for word in sentence:
        new_sent.append(word.lower())
    
    text.append(new_sent)

In [11]:
# Let's check the new structure of our first clue
text[0]

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 'mans',
 'theory']

#####  Constructing the model

In [13]:
# simply a matter of
# instantiating a Word2Vec object.
model = gensim.models.Word2Vec(text, sg=1)
## sg means skip-gram this is a flavour of the word2vec algo. 

##### training 

In [14]:
# To train, call 'train()'!
model.train(text, total_examples=model.corpus_count, epochs=model.epochs)

(11336466, 15849970)

##### Let's explore our results 

In [None]:
model.wv.most_similar('happiness')

In [None]:
model.wv.similarity('furniture', 'jewelry')

In [None]:
model.wv.most_similar(positive=['president', 'germany'], negative='usa')

In [None]:
model.wv.doesnt_match(['breakfast', 'lunch', 'frog', 'food'])

#### 2. using keras

In [11]:
from keras.layers import Embedding

Using TensorFlow backend.


##### For your help: 
#####    a nice  function for text cleaning 

In [None]:
########################################
## process texts in datasets
########################################
print('Processing text dataset')

# The function "text_to_wordlist" is from
# https://www.kaggle.com/currie32/quora-question-pairs/the-importance-of-cleaning-text
def text_to_wordlist(text, remove_stopwords=False, stem_words=False):
    # Clean the text, with the option to remove stopwords and to stem words.
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    # Optionally, shorten words to their stems
    if stem_words:
        text = text.split()
        stemmer = SnowballStemmer('english')
        stemmed_words = [stemmer.stem(word) for word in text]
        text = " ".join(stemmed_words)
    
    # Return a list of words
    return(text)

In [12]:
MAX_VOC_WORDS = 200000

In [None]:
## create list of lists: each sentence is a list containing a list of its words
count = 1 
texts_1 = [] 
labels = []
with codecs.open(TRAIN_DATA_FILE, encoding='utf-8') as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)
    for values in reader:
        if count == 1:
            print(values)
        texts_1.append(text_to_wordlist(values[3]))
        labels.append(int(values[5]))
        count += 1
print('Found %s texts in train.csv' % len(texts_1))

test_texts_1 = []
test_ids = []
with codecs.open(TEST_DATA_FILE, encoding='utf-8') as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)
    for values in reader:
        test_texts_1.append(text_to_wordlist(values[1]))
        test_ids.append(values[0])
print('Found %s texts in test.csv' % len(test_texts_1))

In [None]:
tokenizer = Tokenizer(num_words=MAX_VOC_WORDS)
tokenizer.fit_on_texts(texts_1[1:5] + test_texts_1[1:5])

In [None]:
sequences_1 = tokenizer.texts_to_sequences(texts_1[1:5])

In [None]:
### Notice that sequences are of varying lengths. How would we use them in a model??
### solution: padding 
data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
labels = np.array(labels)
print('Shape of data tensor:', data_1.shape)
print('Shape of label tensor:', labels[1:5].shape)

test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
test_ids = np.array(test_ids)

In [None]:
### Input to the model will be list of lists. Each list will be the index of the words in that sentence.

In [None]:
model = Sequential()
model.add(Embedding(1000, 64, input_length=10))

#### 3. Transfer learning 

Download before lecture:

!brew install wget
!wget -c “https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz”

In [13]:
MAX_SEQUENCE_LENGTH = 15
EMBEDDING_DIM = 300

In [14]:
########################################
## set directories and parameters
########################################
BASE_DIR = './'
EMBEDDING_FILE = BASE_DIR + 'GoogleNews-vectors-negative300.bin'

In [15]:
########################################
## read pre-trained word vectors
########################################
print('Indexing word vectors')
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, \
        binary=True)
print('Found %s word vectors of word2vec' % len(word2vec.vocab))

Indexing word vectors


NameError: name 'KeyedVectors' is not defined

In [None]:
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    print(word, i)
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

In [None]:
########################################
## define the model structure
########################################
embedding_layer = Embedding(nb_words,
        EMBEDDING_DIM,
        weights=[embedding_matrix],
        input_length=MAX_SEQUENCE_LENGTH,
        trainable=False)

##### Input to our network

In [None]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)