---
# Word2Vec with Python

---

Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient.

The Word2Vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications. Word2Vec does not need labels in order to create meaningful representations. This is useful, since most data in the real world is unlabeled. If the network is given enough training data (tens of billions of words), it produces word vectors with intriguing characteristics. Words with similar meanings appear in clusters, and clusters are spaced such that some word relationships, such as analogies, can be reproduced using vector math. The most famous examples of highly trained word vectors are `"king - man + woman = queen"` and `"Paris - France + Italy = Rome"`.
 
In Python, we will use the excellent implementation of Word2Vec from the [`gensim` package](https://pypi.python.org/pypi/gensim). If you don't alread y have gensim installed, you'll need to install it using pip

    sudo pip3 install gensim
    
Although Word2Vec does not require graphics processing units (GPUs) like many deep learning algorithms, it is compute intensive. Both Google's version and the Python version rely on multi-threading (running multiple processes in parallel on your computer to save time). In order to train your model in a reasonable amount of time, you will need to install `cython` (instructions [here](http://docs.cython.org/src/quickstart/install.html)). Word2Vec will run without `cython` installed, but it will take days to run instead of minutes.

## 1. Dataset reading

In this lesson, we will use a dataset for binary sentiment classification containing users reviews for movies from IMDb web site. You need download the dataset called <u><span style="color: red">movie.zip (81.1Mb)</span></u> from [http://www.cs.cornell.edu](http://www.cs.cornell.edu/people/pabo/movie-review-data/) web site. It contains about 28,000 reviews in HTML format. 

After downloading the zip file unzip it in the folder with the current IPython notebook.

Let's look at the conent of the some HTML file containg reviews to some movie. To parse HTML format and extract data between HTML tags we will use [`BeautifulSoup` Python library](http://www.crummy.com/software/BeautifulSoup/) wich we have used earlier.

In [None]:
from bs4 import BeautifulSoup

# Read HTML file
with open('polarity_html/movie/0002.html') as f:
    html = f.read()
# Create a new BeautifulSoup instance
soup = BeautifulSoup(html, "lxml")
# Display HTML file content if a prettified format
print(soup.prettify())

As you can see, each users review is wrapped in `<p>` HTML tag, but there are also some `<p>` tags with helpfull information, so we should miss them. But we can also see that `<p>` tags necessary for us are positioned after the first `<pre>` tag and before the last one. The similar content have all other HTML files with reviews (please check it). Thus, we will remain only those HTML text which is placed between these `<pre>` tags.

In [None]:
# HTML tags (including <pre></pre>) are written uppercase in dataset's files
soup = BeautifulSoup(' '.join(html.split('</PRE>')[1:-1]))
print(soup.prettify())

Now we can get text of reviews using [`findAll`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method.

In [None]:
for num, i in enumerate(soup.findAll('p')):
    print(num, '\n', i.text)

## 2. Bag-of-words mode of text representation

Now we can get text of reviews from one HTML file and process then, but how do we convert them to some kind of numeric representation for machine learning, particularly, for text classification as good or bad review or for prediction of which movie was devoted the comment for? One common approach is called a **Bag-of-words**. The Bag-of-words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. For example, consider the following two sentences:

    Sentence 1: "The cat sat on the hat"

    Sentence 2: "The dog ate the cat and the hat"

From these two sentences, our vocabulary is as follows:

    { the, cat, sat, on, hat, dog, ate, and }

To get our bags of words, we count the number of times each word occurs in each sentence. In `Sentence 1`, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for `Sentence 1` is:

    { the, cat, sat, on, hat, dog, ate, and }

    Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }

Similarly, the features for `Sentence 2` are: 

    Sentence 2: { 3, 1, 0, 0, 1, 1, 1, 1}
    
<img src="images/bag-of-words.png" width=75%>
    
This vector representation does not preserve the order of the words in the original sentences. This kind of representation has several successful applications, for example email filtering.

We'll be using the `feature_extraction` module from `scikit-learn` to create Bag-of-words features. If you have not `scikit-learn` installed, read [this](http://scikit-learn.org/stable/install.html) instruction.

Before, we need to split a paragraph into sentences. There are all kinds of gotchas in natural language. English sentences can end with "?", "!", """, or ".", among other things, and spacing and capitalization are not reliable guides either. For this reason, we'll use the Python [Natural Language Toolkit](http://www.nltk.org)'s punkt tokenizer for sentence splitting.  You'll need to [install](http://www.nltk.org/install.html) the library if you don't already have it on your computer

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

In [None]:
import re

text = ''
for i in soup.findAll('p'):
    text += ' ' + i.text

# Lowercase text, remove "\n" symbols
text = text.lower().replace('\n', ' ').strip()
    
sentences = sent_tokenize(text)
# Remove non-letters  
sentences = map(lambda x: re.sub("[^a-zA-Z]", " ", x).strip(), sentences)  
        


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Creating the bag-of-words
vectorizer = CountVectorizer()

# fit_transform() does two functions: First, it fits the model and learns the vocabulary; 
# second, it transforms our training data into feature vectors. 
# The input to fit_transform should be a list of strings.
data_features = vectorizer.fit_transform(sentences)
l = vectorizer.get_feature_names()

# Look at the collection of all words in `sentences`
print("Unique words ({uniq}):".format(uniq=len(l)))
print(l)

In [None]:
# Numpy arrays are easy to work with, so convert the result to an array
#data_features = data_features.toarray()

#print("Data features size:", data_features.shape)
#print("Vector representation of sentences:\n")
#for num, vec in enumerate(data_features):
#    print(sentences[num], '\n', vec)

## 3. Data processing

To train Word2Vec, it makes sense to remove punctuation. It also might be better not to remove numbers, but we will do it in our class to simplify training process. We need also to decide how to deal with frequently occurring words that don't carry much meaning. Such words are called "stop words"; in English they include words such as "a", "and", "is", and "the". Conveniently, there are Python packages that come with stop word lists built in. Let's import a stop word list from the Python NLTK. You need to install the data packages that come with it, as follows:

In [None]:
# Download text data sets, including stop words
# You need download it only once. After this comment the next line
# nltk.download() 

Now we can use `nltk` to get a list of stop words:

In [None]:
# Import the stop word list
from nltk.corpus import stopwords 

print(stopwords.words("english"))

Let's look at which form all reviews will have after all transformations mentioned above

In [None]:
# Extract HTML text wrapped in <p> tags
text = ''   # Here we will collect all review in the current document
for p in soup.findAll('p'):
    text += ' ' + p.text
# Look at how many words are in all reviews of this HTML file
print("Total words amount:", len(text.split()))
# Remove non-letters  
text = re.sub("[^a-zA-Z]", " ", text)  
# Convert words to lowercase and split them  
words = text.lower().split()  
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]

print("Without stopwords:", len(words))
print(words)

`gensim` only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…

For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file or line by line:

In [None]:
import os
import time
import re
from bs4 import BeautifulSoup


# Import the stop word list
from nltk.corpus import stopwords 

#print(stopwords.words("english"))

class DataTransformer(object):  
    
    def __init__(self, dirname):  
        self.dirname = dirname 
        
    def __iter__(self):  
        for fname in os.listdir(self.dirname):
            # 1. Read the HTML file
            with open(os.path.join(self.dirname, fname), encoding='latin-1') as f:
                html = f.read()
            # 2. Create a new BeautifulSoup instance
            soup = BeautifulSoup(' '.join(html.split('</PRE>')[1:-1]), "lxml")
            # 3. Extract HTML text wrapped in <p> tags
            text = ''   # Here we will collect all review in the current document
            for p in soup.findAll('p'):
                text += ' ' + p.text
            # 4. Remove non-letters  
            text = re.sub("[^a-zA-Z]", " ", text)  
            # 5. Convert words to lowercase and split them  
            words = text.lower().split()  
            # 6. Remove stop words from "words"
            words = [w for w in words if not w in stopwords.words("english")]
            yield words

## 4. Model building and saving

With the list of nicely parsed words, we're ready to train the model. `gensim`’s Word2Vec API requires some parameters for initialization. Of course, they do have default values, but you want to define some on your own (note, below we list not all attributes):

* `size` – denotes the number of dimensions present in the vectorial forms. If you have read the document and have an idea of how many 'topics' it has, you can use that number. For sizeable blocks, people use 100-200. 

* `min_count` – terms that occur less than min_count number of times are ignored in the calculations. This reduces noise in the semantic space. 

* `window` - only terms hat occur within a window-neighbourhood of a term, in a sentence, are associated with it during training. The usual value is 4. Unless your text contains big sentences, leave it at that.

* `sg` – this defines the algorithm. If equal to 1, the skip-gram technique is used. 

* `min_count` - ignores all words with total frequency lower than this.
        
* `sample` - threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).

* `workers` - defines how many worker threads to train the model (= faster training with multicore machines).
       
Choosing parameters is not easy, but once we have chosen our parameters, creating a Word2Vec model is straightforward.
Next we want to initialize and train our model. Note that this will take some time (even a few hours depending on your computer's performance).

In [1]:
from gensim.models import word2vec


ImportError: No module named 'boto'

In [None]:
import gensim

gensim.__version__

In [None]:

# Set values for various parameters  
num_features = 100     # Word vector dimensionality                        
min_word_count = 20    # Minimum word count                          
num_workers = 64       # Number of threads to run in parallel  

# Iterate all data
sentences = DataTransformer('polarity_html/movie') 

# Let's measure the ellapsed time
start = time.time()

print("Training model...")
model = word2vec.Word2Vec(sentences, 
                          workers = num_workers, 
                          size = num_features,
                          min_count = min_word_count
                         ) 

print("Elapsed time: {time}".format(time = time.time() - start))
model

You can **store/load** models using the standard `gensim` methods:

In [None]:
from gensim.models import Word2Vec

model.save('IMDb_reviews.w2v_model')
print("Model was saved")


In [None]:
from gensim.models import word2vec

model = word2vec.Word2Vec.load('IMDb_reviews.w2v_model')
print("Model is loaded")
model

which uses `pickle` internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

If you don't plan to train the model any further, calling `init_sims()` will make the model much more memory-efficient.

    model.init_sims(replace=True)

In addition, you can load models created by the original C tool, both using its text and binary formats:

    model = Word2Vec.load_word2vec_format('IMDb_reviews', binary=False)
    # using gzipped/bz2 input works too, no need to unzip:
    model = Word2Vec.load_word2vec_format('IMDb_reviews.bin.gz', binary=True)

## 5. Exploring the model results

Now that you have the model initialized, you can access all the terms in its vocabulary, using something like `list(model.vocab.keys())`. 

To get the vectorial representation of a particular term, use `model[term]`. 

In [None]:
vocab = list(model.vocab.keys())
print("Words amount in Word2Vec vocabulary:", len(vocab) )
print("\nThe first 10 words:\n", vocab[:10] )
print('learn' in model.vocab )
print("\nVector represintation of 'learn':\n", model['learn'] )
print("\nThe size of 'learn':", model['learn'].size )

Word2Vec supports several word similarity tasks out of the box. Particularly, you can find a word as an arithmetical combination of some words by its meaning 

In [None]:
print(model.most_similar(positive=['jolie'], negative=['pitt'], topn=10))

Depending on how good the model was trained we can get not exepted result (in the begining of this lesson we've provided a classical example of Word2Vec usage `"king - man + woman = queen"`) as it is above (note, if you are lucky you may get the "queen" at once).

Let's check whether "queen" is in the Word2Vec vocabulary

In [None]:
print 'queen' in model.vocab

Yes, it is. So, we have shown only the most closest result in the example above. Let's display 50 matches

In [None]:
print model.most_similar(positive=['woman', 'king'], negative=['man'], topn=50)

We can **retrain** model with new data to get more better results. Let's take a few fairy tails containg many combinations of "king" of "queen" words and read them like we've made above. You may find these documents in the folder "fairy_tails".

In [None]:
class FairyTails(object):  
    
    def __init__(self, dirname):  
        self.dirname = dirname 
        
    def __iter__(self):  
        for fname in os.listdir(self.dirname):
            # Read the TXT file
            with open(os.path.join(dirname, fname)) as f:
                txt = f.read()
            # Process file content line by line
            for line in txt:
                # Remove non-letters and convert words to lowercase and split them
                words = re.sub("[^a-zA-Z]", " ", line).lower().split()  
                words = [w for w in words if not w in stopwords.words("english")]
                yield words

fairy_tails = FairyTails('fairy_tails')

# Let's measure the ellapsed time
start = time.time()
print "Retraining model..."  
model.train(fairy_tails)
print "Elapsed time:", time.time() - start

Check if something change

In [None]:
print model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

Yes, now my retrained model positioned "queen" in TOP 5 results.

As we mentioned above, to get a wonderfull result we need train Word2Vec algorithm on the large dataset (with tens bilions of words, but the current dataset contains a few millions of words). Other way: shuffle documents and train algorithm many times on the same dataset, i.e. we can read our 30 000 HTML documents in various order a few decades or hundreds times and retrain model on each documents combination. It is a good practice, but it is a time-consuming process.

A few examples, which work properly in my model

In [None]:
print model.most_similar(positive=['paris', 'england'], negative=['france'], topn=1)

print model.most_similar(positive=['woman', 'boy'], negative=['man'], topn=1)

A few other methods of Word2Vec using for text analysis: 

In [None]:
# To get a list of most similar words
print model.most_similar('good')

print model.most_similar("earth") 

In [None]:
# Find the excess word in the sequence
print model.doesnt_match(["breakfast", "cereal", "dinner", "lunch"])

print model.doesnt_match("good fine ugly wonderfull".split())

# And more difficult variant
print model.doesnt_match("theory study education science".split())

In [None]:
# Get the measure of similarity of two words
print model.similarity('woman', 'man')

print model.similarity('beautiful', 'ugly')

print model.similarity('leave', 'leaf')

Since each word is a vector in 100-dimensional space, we can use vector operations to combine the words in each review. One method we tried was to simply average the word vectors in a given review or in some sentence (for this purpose, we removed stop words, which would just add noise).

In [None]:
def transform_word_to_matrix(model, sentence):
    l = []
    for word in sentence.lower().split():
        try:
            l.append(model[word])
            print word
        except:
            pass
    return np.array(l)

def get_agg_vector(model, sentence):
    word_array = transform_word_to_matrix(model, sentence)
    return word_array.mean(axis=1)

print get_agg_vector(model, "The dog ate the cat and the hat")

This representation of sentence (the average vectors) can be uses in a machine learning algorithm. 

At the end let's provide a simple example how we can visuaize obtained results and build a tree of a two levels hierarchy of most similar words: 

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

# Set plot size
plt.rcParams['figure.figsize'] = (17, 12)

# Recursive function allowing to draw a tree using NetworkX
def hierarchy_pos(G, root, width=1., vert_gap = 0.2, vert_loc = 0, xcenter = 0.5, 
                  pos = None, parent = None):
    '''If there is a cycle that is reachable from root, then this will see infinite recursion.
       G: the graph
       root: the root node of current branch
       width: horizontal space allocated for this branch - avoids overlap with other branches
       vert_gap: gap between levels of hierarchy
       vert_loc: vertical location of root
       xcenter: horizontal location of root
       pos: a dict saying where all nodes go if they have been assigned
       parent: parent of this branch.'''
    if pos == None:
        pos = {root:(xcenter,vert_loc)}
    else:
        pos[root] = (xcenter, vert_loc)
    neighbors = G.neighbors(root)
    if parent != None:
        neighbors.remove(parent)
    if len(neighbors)!=0:
        dx = width/len(neighbors) 
        nextx = xcenter - width/2 - dx/2
        for neighbor in neighbors:
            nextx += dx
            pos = hierarchy_pos(G,neighbor, width = dx, vert_gap = vert_gap, 
                                vert_loc = vert_loc-vert_gap, xcenter=nextx, pos=pos, 
                                parent = root)
    return pos

# Create a new graph instance
G=nx.Graph()

# Define the first 5 most similar words
main_word = 'earth'
top5 = model.most_similar(main_word, topn=5)
top5_words = map(lambda x: x[0], top5)
print top5_words

# To miss repetitions we will remind unique words
unique = [i for i in top5_words]
for word in top5_words:
    G.add_edge(main_word, word)
    for subword in model.most_similar(word, topn=3):
        if subword[0] not in unique and subword[0] != main_word:
            G.add_edge(word, subword[0])
            unique.append(subword[0])

        
pos = hierarchy_pos(G, main_word)    
nx.draw(G, pos=pos, with_labels=True, node_size=5000)

plt.show()