# Word Embeddings on Harry Potter

##### (Notebook by Itay Hazan)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

## Step 1: Use gensim implementation

In [3]:
!pip install gensim
import gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/ad/63/5a4b694ac7d0dd0a7d061ba6af0dbd057379da21c7ea7efd44ae3299f87d/gensim-3.7.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (24.6MB)
[K    100% |████████████████████████████████| 24.6MB 994kB/s 
Collecting smart-open>=1.7.0 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/ff/c8/de7dcf34d4b5f2ae94fe1055e0d6418fb97a63c9dc3428edd264704983a2/smart_open-1.8.0.tar.gz (40kB)
[K    100% |████████████████████████████████| 40kB 2.2MB/s 
Collecting bz2file (from smart-open>=1.7.0->gensim)
  Downloading https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.7.0->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/84/46/bf0020c9ac0e500ffdc794d64bd7ef0120f2c196a28544cac74c7df9e88e/boto3-1.9.116-py2.py3-none-an

The following code reads the entire Harry Potter series into a list, split by periods:

In [30]:
def get_harry_potter_books():
    books = []
    for i in range(7):
        with open('HarryPotter/{}.txt'.format(i+1)) as f:
            books += f.read().split('.')
            
    return books

In [31]:
books = get_harry_potter_books()

Complete the following function, that does some basic pre-processing on the texts:

In [34]:
def pre_processing(books): 
    # TODO: lowercase
    # TODO: remove all end-of-line characters
    # TODO: remove all punctuation
    # TODO: tokenize words (=split by whitespaces)
    lst = []
    lst = [list(gensim.utils.tokenize(book, lower=True)) for book in books]
    
    #return a list of lists: element i of the outer list is a list of word in the i'th book
    return lst

In [35]:
books = pre_processing(books)

Print the first ten sentences (after pre-processing)

In [37]:
for i in range(20):
    print(books[i])

['the', 'boy', 'who', 'lived', 'mr']
['and', 'mrs']
['dursley', 'of', 'number', 'four', 'privet', 'drive', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', 'thank', 'you', 'very', 'much']
['they', 'were', 'the', 'last', 'people', 'you', 'd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', 'because', 'they', 'just', 'didn', 't', 'hold', 'with', 'such', 'nonsense']
['mr']
['dursley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'grunnings', 'which', 'made', 'drills']
['he', 'was', 'a', 'big', 'beefy', 'man', 'with', 'hardly', 'any', 'neck', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'mustache']
['mrs']
['dursley', 'was', 'thin', 'and', 'blonde', 'and', 'had', 'nearly', 'twice', 'the', 'usual', 'amount', 'of', 'neck', 'which', 'came', 'in', 'very', 'useful', 'as', 'she', 'spent', 'so', 'much', 'of', 'her', 'time', 'craning', 'over', 'garden', 'fences', 'spying', 'on', 'the', 'neighbors']
['the', 'dursleys

In [69]:
len(books)

86672

Next, we initialize a Word2Vec model from gensim:

In [38]:
model = gensim.models.Word2Vec(books, size=150, window=10, min_count=2, workers=10)
model.train(books, total_examples=len(books),epochs=10)

(8337046, 11208760)

In [46]:
wv = lambda x: model.wv.word_vec(x)

# w1 = "lord"
model.wv.most_similar(positive=[wv("harry")-wv("hermione")+wv("ron")])

#model.wv.most_similar(positive="harry")



[('ron', 0.6754653453826904),
 ('harry', 0.5371386408805847),
 ('krum', 0.33083343505859375),
 ('bravely', 0.315293550491333),
 ('he', 0.31319132447242737),
 ('cedric', 0.30808591842651367),
 ('crookshanks', 0.2866794168949127),
 ('greyback', 0.28020909428596497),
 ('gotcha', 0.2790670096874237),
 ('tentatively', 0.2658160924911499)]

## Step 2: Implementing CBOW with Negative Sampling

### Step 2.1: Setting things up 

First, get a list of all unique words in the dataset, and sort them alphabetically.

In [60]:
all_words = [item for sublist in books for item in sublist]
all_words.sort()
all_words = set(all_words)

Next, create an inverted index of the words:

In [64]:
vocabulary = {}
word_list = list(all_words)
for i in range(len(word_list)):
    vocabulary[word_list[i]] = i

Now, compute the number of occurances of each word in our dataset (a histogram):

In [106]:
hist = []
# TODO: complete hist

Now, given the histogram $h$, write a function that returns a probability distribution over the words in the histgram such that a more popular word will have a higher probability of being chosen:
$$ \Pr [\text{sampling $i$'th word}] = \frac{hist[i]}{\sum_i hist[i]} $$

Remark: it is customary to take the elemets in the right-hand side of the equality to some power smaller than 1, e.g.:
$$ \Pr [\text{sampling $i$'th word}] = \frac{hist[i] ^{3/4}}{\sum_i hist[i]^{3/4}} $$
You may use this in your code as well (ampirically gives better performance).

In [None]:
distribution = 

### Step 2.2: Construct the train set
We define the window size and the dimension of the embedding

In [113]:
window_size = 10
neg_sample_size = 5

Our train set will consist of labeled pairs: ` (x=(context, center), y=0/1`)`: 

To create the train set:

 1. For every `window=(context, center)` of the input
   1. Add the pair `(x=(context, center), y=1)` to the dataset.
   1. Sample `neg_sample_size` words, `w_1, ..., w_k`, from the distribution we computed in step 2.1, and add the all the pairs `(x=(context, w_i), y=0)` to the dataset. 

In [114]:
# TODO: generate the dataset

### Step 2.3: Construct the neural net
We are going to create the following network architecture for negative sampling.

![Negative Sampling Architecture](neg_sampling.png "Negative Sampling")

In [None]:
size = 150 # dimension of the embedding

### Step 2.4: Train and evaluate

Write a function that, given a word, returns the 10 most similar words to it.

In [134]:
def most_similar(word):
    return None

Play with it :)