In [1]:
# using Jupyter notebooks
# pushing CTRL-c will run the code in a cell
2 + 2

4

# Gentle Introduction to NLP through Document Embeddings

### Quick Review of Last Time
* Cosine Similarity

### Two Approaches to Embedding Documents
* Sparse, bag-of-words embeddings
 - Count embeddings
 - TFIDF embeddings
* Dense embeddings

![NLP](images/NLP.png)

## From Last Time

![distance](images/distance_measures.png)
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

![cos_sim](images/cos_sim.png)
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

### calculating dot product
$vector_a = [1,2,3]$ <br>
$vector_b = [4,5,6]$ <br>
$vector_a \cdot vector_b = (1*4) + (2*5) + (3*6) = 4 + 10 + 18 = 32$ 

### normalizing a vector
To normalize a vector, we shrink all values so they fall between $0$ and $1$.

![normalize](images/normalize.jpg)
http://www.wikihow.com/Normalize-a-Vector

In [2]:
import numpy as np
import utils
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [3]:
def normalize_vector(vector):
    """
    Normalizes a vector so that all its values are between 0 and 1
    :param vector: a `numpy` vector
    :return: a normalized `numpy` vector
    """
    # norm = np.sqrt(vector.dot(vector))
    # numpy has a built in function
    norm = np.linalg.norm(vector)
    if norm:
        return vector / norm
    else:
        # if norm == 0, then original vector was all 0s
        return vector

In [4]:
vector_3d = np.array([1,2,4])
print("original vector", vector_3d)
print("normalized vector", normalize_vector(vector_3d))
#0.218 is 1/4th of .873 just like 1 is 1/4th of 4

original vector [1 2 4]
normalized vector [ 0.21821789  0.43643578  0.87287156]


In [5]:
def cos_sim(vector_one, vector_two):
    """
    Calculate the cosine similarity of two `numpy` vectors
    :param vector_one: a `numpy` vector
    :param vector_two: a `numpy` vector
    :return: A score between 0 and 1
    """
    # ensure that both vectors are already normalized
    vector_one_norm = normalize_vector(vector_one)
    vector_two_norm = normalize_vector(vector_two)
    
    # calculate the dot product between the two normalized vectors
    return vector_one_norm.dot(vector_two_norm)

In [6]:
vector_one = np.array([1,1,1,1,1])
vector_two = np.array([1,1,1,1,2])
vector_three = np.array([1,2,3,4,5])
vector_four = np.array([10,20,30,40,50])

print("cosine similarity of vector_one and vector_two", cos_sim(vector_one, vector_two))
print("cosine similarity of vector_one and vector_three", cos_sim(vector_one, vector_three))
print("cosine similarity of vector_one and vector_four", cos_sim(vector_one, vector_four))

cosine similarity of vector_one and vector_two 0.948683298051
cosine similarity of vector_one and vector_three 0.904534033733
cosine similarity of vector_one and vector_four 0.904534033733


### Interpreting "Similarity"
![cos_sim_compare](images/cos_sim_compare.png)
https://medium.com/@camrongodbout/creating-a-search-engine-f2f429cab33c#.z7i9w8y5t

![vectorize](images/vectorize.png)

## Embedding a Document 
### Bag of Words
#### Count Vectorizing

![bag_of_words](images/bag_of_words_vis.png)

![bag_of_words_count](images/bag_of_words_count_matrix.png)

## Embedding a Document 
### Bag of Words
#### TFIDF Vectorizing
`TFIDF` = `term frequency, inverse document frequency`

![tfidf_rationale](images/tfidf_rationale.png)

![doc_freq_vis](images/document_frequency_vis.png)

![tfidf_matrix](images/tfidf_matrix.png)

![tfidf_matrix_decimal](images/tfidf_matrix_decimal.png)

![bop](images/bags_of_popcorn.png)

In [7]:
# load reviews
reviews_dict = utils.load_data("movie_reviews.tsv")
all_docs, lookup = utils.get_all_docs(reviews_dict)

In [8]:
# `all docs` is a list of all documents
all_docs[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

In [9]:
# `lookup` is a lookup dict with {idx: text}
lookup[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

### Using `scikit-learn`

[CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) <br>
[TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [10]:
# call the vectorizer
cv = CountVectorizer()
tv = TfidfVectorizer()
# run fit_transform on the <list> of documents
X_cv = cv.fit_transform(all_docs)
X_tv = tv.fit_transform(all_docs)
X_cv

<999x18373 sparse matrix of type '<class 'numpy.int64'>'
	with 137082 stored elements in Compressed Sparse Row format>

In [11]:
# see the vocabulary
cv_vocab = cv.get_feature_names()
# see the nonzero features (e.g. words) for each row of data
cv_words_per_doc = cv.inverse_transform(X_cv)
tv_words_per_doc = cv.inverse_transform(X_tv)
cv_words_per_doc[0][0:15]

array(['latter', 'hope', 'liars', 'sickest', 'stupid', 'extremely',
       'either', 'fact', 'doors', 'closed', 'behind', 'different', 'be',
       'can', 'don'], 
      dtype='<U44')

In [12]:
def get_one_row(X, idx):
    """
    Gets one row (representing a document) from sparse matrix, converts to dense, and reshapes
    :param X: the sparse matrix
    :param idx: the index of desired row
    :return: `numpy` dense vector
    """
    row_ = X[idx].toarray()
    size = row_.shape[1]
    return row_.reshape(size)

### How much does TFIDF affect things?

In [13]:
x_0_count = get_one_row(X_cv, 0)
x_0_tfidf = get_one_row(X_tv, 0)
sim = cos_sim(x_0_count, x_0_tfidf)
sim_in_degrees = np.rad2deg(np.arccos(sim))
print("cosine similarity = {:.3f},\nequal to a {:.1f} degree angle".format(sim, sim_in_degrees))

cosine similarity = 0.751,
equal to a 41.3 degree angle


![cos_sim_compare](images/cos_sim_compare.png)
https://medium.com/@camrongodbout/creating-a-search-engine-f2f429cab33c#.z7i9w8y5t

In [14]:
lookup[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

In [15]:
word = "and"
idx = cv_vocab.index(word)
c = x_0_count[idx]
t = x_0_tfidf[idx]
print("{}: count value = {} versus tfidf value = {:.4f}".format(word, c, t))

and: count value = 10 versus tfidf value = 0.0869


In [16]:
word = "pesci"
idx = cv_vocab.index(word)
c = x_0_count[idx]
t = x_0_tfidf[idx]
print("{}: count value = {} versus tfidf value = {:.4f}".format(word, c, t))

pesci: count value = 2 versus tfidf value = 0.1203


### Most/Least similar documents

In [17]:
cos_sim_matrix = utils.load_matrix_from_csv("tfidf_cos_matrix.csv")

In [18]:
# index of highest similarity
np.where(cos_sim_matrix == np.nanmax(cos_sim_matrix))

(array([125, 166]), array([166, 125]))

In [19]:
lookup[125]

'"Zombie Review #3**Spoilers**Few films are actually \\"so bad they\'re good\\", and Zombi 3 is not just bad, it\'s wretchedly, unforgivably bad in so many ways that a whole new language may be needed just to describe them allMore than that, it\'s a film credited to Lucio Fulci that even by his standards has absolutely no coherency, sense or reason. However we can\'t blame Fulci as it wasn\'t really directed by him but by Bruno Mattei, who doesn\'t even have Fulci\'s sense of style to help carry the film. Mattei seems to have brought little to the film but staggering ineptitude.So, I\'m ashamed to say how much I enjoyed every worthless minute of Zombi 3. It has no redeeming features - in a genre known for thin characters, weak story, and lack of film making skill, Zombi 3 pushes the boat out but in doing so it\'s even funnier than Nightmare City.The \\"action\\" starts when the \\"Death 1\\" gas is stolen from a military base, and damaged in the escape. Who is the thief, why did he ste

In [20]:
lookup[166]

'"Title: Zombie 3 (1988) Directors: Mostly Lucio Fulci, but also Claudio Fragasso and Bruno Mattei Cast: Ottaviano DellAcqua, Massimo Vani, Beatrice Ring, Deran Serafin Review: To review this flick and get some good background of it, I gotta start by the beginning. And the beginning of this is really George Romeros Dawn of the Dead. When Dawn came out in 79, Lucio Fulci decided to make an indirect sequel to it and call it Zombie 2. That film is the one we know as plain ole Zombie. You know the one in which the zombie fights with the shark! OK so, after that flick (named Zombie 2 in Italy) came out and made a huge chunk of cash, the Italians decided, heck. Lets make some more zombie flicks! These things are raking in the dough! So Zombie 3 was born. Confused yet? The story on this one is really just a rehash of stories we\'ve seen in a lot of American zombie flicks that we have seen before this one, the best comparison that comes to mind is Return of the Living Dead. Lets see...there\'s

In [21]:
# words present in both reviews
print(set(tv_words_per_doc[166]).intersection(set(tv_words_per_doc[125])))

{'if', 'did', 'and', 'human', 'is', 'you', 'actually', 'review', 'up', 'before', 'reason', 'way', 'even', 'some', 'don', 'every', 'will', 'off', 'about', 'people', 'gas', 'hotel', 'over', 'girls', 'gore', 'in', 'out', 'best', 'but', 'by', 'turn', 'on', 'film', 'so', 'steal', 'characters', 'make', 'seen', 'head', 'still', 'after', 'that', 'get', 'sequel', 'well', 'their', 'also', 'they', 'we', 'into', 'dead', 'story', 'infected', 'one', 'just', 'it', 'or', 'came', 'there', 'flesh', 'zombie', 'wasn', 'not', 'who', 'yet', 'the', 'my', 'down', 'poor', 'no', 'faces', 'bruno', 'lucio', 'for', 'making', 'of', 'why', 'now', 'lot', 'few', 'fights', 'here', 'same', 'be', 'them', 'ashamed', 'these', 'couldn', 'from', 'to', 'many', 'an', 'good', 'those', 'living', 'minutes', 'his', 'action', 'fact', 'he', 'flying', 'zombies', 'doesn', 'fulci', 'was', 'when', 'all', 'this', 'really', 'more', 'are', 'credited', 'can', 'end', 'then', 've', 'with', 'mattei', 'have', 'as'}


In [22]:
# index of lowest similarity
lowest_similarity = np.nanmin(cos_sim_matrix)
print("least similar documents have cosine similarity of {}".format(lowest_similarity))
np.where(cos_sim_matrix == lowest_similarity)

least similar documents have cosine similarity of 0.0


(array([ 10,  10, 242, 319, 401, 404, 456, 456, 512, 512, 512, 512, 512,
        512, 512, 621, 692, 713]),
 array([456, 512, 512, 456, 512, 512,  10, 319,  10, 242, 401, 404, 621,
        692, 713, 512, 512, 512]))

In [23]:
# words present in both reviews
print(set(tv_words_per_doc[10]).intersection(set(tv_words_per_doc[512])))

set()


### Problems with Bag-of-words

In [26]:
lookup[45]

'"I loved the episode but seems to me there should have been some quick reference to the secretary getting punished for effectively being an accomplice after the fact. While I like when a episode of Columbo has an unpredictable twist like this one, its resolution should be part of the conclusion of the episode, along with the uncovering of the murderer.The interplay between Peter Falk and Ruth Gordon is priceless. At one point, Gordon, playing a famous writer, makes some comment about being flattered by the famous Lt. Columbo, making a tongue-in-cheek allusion to the detective\'s real life fame as a crime-solver. This is one of the best of many great Columbo installments."'

In [27]:
lookup[433]

'"It is always satisfying when a detective wraps up a case and the criminal is brought to book. In this case the climax gives me even greater pleasure. To see the smug grin wiped off the face of Abigail Mitchell when she realises her victim has left \\"deathbed testimony\\" which leaves no doubt about her guilt is very satisfying.Please understand: while I admire Ruth Gordon\'s performance, her character really, *really* irritates me. She is selfish and demanding. She gets her own way by putting on a simpering \'little girl\' act which is embarrassing in a woman of her age. Worse, she has now set herself up as judge, jury and executioner against her dead niece\'s husband.When Columbo is getting too close she tries to unnerve him by manipulating him into making an off-the-cuff speech to an audience of high-class ladies. He turns the tables perfectly by delivering a very warm and humane speech about the realities of police work.Nothing can distract Columbo from the pursuit of justice. Ab

In [24]:
print(set(tv_words_per_doc[45]).intersection(set(tv_words_per_doc[433])))

{'to', 'an', 'one', 'in', 'and', 'while', 'is', 'by', 'getting', 'the', 'when', 'ruth', 'this', 'columbo', 'gordon', 'making', 'of', 'has', 'detective', 'me', 'about', 'as'}


In [28]:
cos_sim(get_one_row(X_tv, 45), get_one_row(X_tv, 433))

0.20648822212468901

In [25]:
X_cv

<999x18373 sparse matrix of type '<class 'numpy.int64'>'
	with 137082 stored elements in Compressed Sparse Row format>

![bag_of_words_problem](images/bag_of_words_problem.png)

#### Problems with Bag-of-words:
 - two different sentences, same embedding
 - sparse matrix the size of *vocabulary*
 - same concepts, different words don't appear similar
 
### So can we do better?

## Embedding a Document
### Neural Networks

![recurrent](images/recurrent.png)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

![cnn](images/cnn.png)
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

![dan](images/dan.png)
https://cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf

![bow](images/bag_of_words_performance.png)

## Resources
[Stanford IR book, online](http://nlp.stanford.edu/IR-book/html/htmledition/) <br>
[Bag of Words Meets Bags of Popcorn (Kaggle)](https://www.kaggle.com/c/word2vec-nlp-tutorial) <br>
[Neural Networks for NLP](https://arxiv.org/pdf/1510.00726.pdf) <br>
[Blog about LSTM's](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) <br>
[Blog about CNN's](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) <br>
[Examples of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) 