# word2vec + SVM + Evaluation

# Part 1: Term embeddings + SVM

### Dataset


For this homework, we will still play with Yelp reviews from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge). You'll see that each line corresponds to a review on a particular business. Each review has a unique "ID" and the text content is in the "review" field. Additionally, this time, we also offer you the "label". If `label=1`, it means that this review is `Food-relevant`. If `label=0`, it means that this review is `Food-irrelevant`. Similarly, we have already done some basic preprocessing on the reviews, so you can just tokenize each review using whitespace.

There are about 40,000 reviews in total, in which about 20,000 reviews are "Food-irrelevant". We split the review data into two sets. *review_train.json* is the training set. *review_test.json* is the testing set. 

In [50]:
import json
import numpy as np
from sklearn import svm
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
import nltk
import plotly.plotly as py
import plotly.graph_objs as go
import urllib.request
arr = np.array

In [19]:
# Load the dataset
with open("data/review_train.json") as f:
    train = [json.loads(l) for l in f.readlines()]
    train_reviews = arr([t['review'] for t in train])
    train_labels = arr([t['label'] for t in train])
with open("data/review_test.json") as f:
    test = [json.loads(l) for l in f.readlines()]
    test_reviews = arr([t['review'] for t in test])
    test_labels = arr([t['label'] for t in test])
    id_to_label = {t['id']: t['label'] for t in test}

###  Pre-trained term embeddings

To save your time, you can make use of  pre-trained term embeddings. In this homework, we are using one of the great pre-trained models from [GloVe](https://nlp.stanford.edu/projects/glove/) based on 2 billion tweets. GloVe is quite similar to word2vec. Unzip the *glove.6B.50d.txt.zip* file and run the code below. You will be able to load the term embeddings model, with which each word can be represented with a 50-dimension vector.

In [29]:
# Reload the pre-trained term embeddings
link = "https://s3.eu-west-2.amazonaws.com/josiah-public-assets/glove.6B.50d.txt"
with urllib.request.urlopen(link, "rb") as lines:
    w2vmodel = {line.split()[0].decode(): arr(line.split()[1:]).astype(float)
           for line in lines}
    
model_words, model_vectors = zip(*w2vmodel.items())

Now, you have a vector representation for each word. First, we use the simple (arithmetic) **mean** of these vectors of words in a review to represent the review. *Note: Just ignore those words which are not in the corpus of this pre-trained model.*

In [24]:
# Tokenize the review and remove stop words
stops = arr(stopwords.words('english'))
def clean(string):
    tokens = arr(nltk.word_tokenize(string.lower()))
    return np.extract(np.isin(tokens, stops, invert=True), tokens)

In [25]:
# Get the vector representation for each review in the training data and testing data.
get_vector_mean = lambda word_vecs: np.mean(word_vecs, axis=0)
get_review_vector = lambda rev: get_vector_mean(arr([w2vmodel[word] for word in clean(rev) if word in w2vmodel]))

In [26]:
train_vectors = arr([get_review_vector(r) for r in train_reviews])
test_vectors = arr([get_review_vector(r) for r in test_reviews])

## Diving in to the coolness of word2vec
- synonyms
- combining words
- analogies

In [30]:
def closest_points(point, points):
    points = np.asarray(points)
    dist_2 = np.sum((points - point)**2, axis=1)
    return sorted([(i, dist) for i, dist in enumerate(dist_2)], key = lambda x: x[1])

### Synonyms

In [37]:
# Try also experimenting with government, thursday, business, soldiers
distances = closest_points(w2vmodel['foods'], model_vectors)
nearby_words = [model_words[idx] for idx, d in distances]
nearby_words[1:10]

['beverages',
 'seafood',
 'beverage',
 'diet',
 'drinks',
 'dairy',
 'meats',
 'snacks',
 'products']

In [38]:
distances = closest_points(w2vmodel['korea'], model_vectors)
nearby_words = [model_words[idx] for idx, d in distances]
nearby_words[1:10]

['korean',
 'pyongyang',
 'dprk',
 'seoul',
 'japan',
 'china',
 'iran',
 'beijing',
 'koreans']

### Adding words

In [39]:
# Add vectors to get more complex queries!
distances = closest_points(w2vmodel['korea'] + w2vmodel['foods'], model_vectors)
nearby_words = [model_words[idx] for idx, d in distances]
nearby_words[:10]

['korea',
 'products',
 'beef',
 'china',
 'foods',
 'export',
 'states',
 'imports',
 'rice',
 'taiwan']

### Analogies

In [40]:
dist_vector = w2vmodel['king'] - w2vmodel['queen']
distances = closest_points(dist_vector + w2vmodel['she'], model_vectors)
nearby_words = [model_words[idx] for idx, d in distances]
nearby_words[0]

'he'

In [75]:
dist_vector = w2vmodel['spain'] - w2vmodel['madrid']
distances = closest_points(dist_vector + w2vmodel['moscow'], model_vectors)
nearby_words = [model_words[idx] for idx, d in distances]
nearby_words[0]

'russia'

In [78]:
dist_vector = w2vmodel['walking'] - w2vmodel['walked']
distances = closest_points(dist_vector + w2vmodel['talked'], model_vectors)
nearby_words = [model_words[idx] for idx, d in distances]
nearby_words[0]

'talking'

## Visualizing word2vec

Our vectors are in a 50 dimensional vectors space. First we have to do some PCA to bring it down to 3 dimensions.

### PCA

In [55]:
# Let's pick two words that are moderately far apart...
# "torpedo" and "food" about 11,000 words "away" (out of 40,000)
group1_query = 'torpedo'
group2_query = 'food'

distances = closest_points(w2vmodel[group1_query], model_vectors)
group1_words = [model_words[idx] for idx, d in distances][:100]
group1_vectors = [w2vmodel[w] for w in group1_words]

distances = closest_points(w2vmodel[group2_query], model_vectors)
group2_words = [model_words[idx] for idx, d in distances][:100]
group2_vectors = [w2vmodel[w] for w in group2_words]

In [56]:
# find a mapping to a low dimension representation of data
pca = PCA(n_components=3)
pca.fit([*group1_vectors, *group2_vectors])
x1, y1, z1 = pca.transform(group1_vectors).T
x2, y2, z2 = pca.transform(group2_vectors).T

### Visualization of two words groups

In [57]:
trace1 = go.Scatter3d(
    name=f'Words like {group1_query.upper()}',
    x=x1,
    y=y1,
    z=z1,
    text=group1_words,
    mode='markers',
    hoverinfo='text',
    marker={'opacity': 0.8}
)

trace2 = go.Scatter3d(
    name=f'Words like {group2_query.upper()}',
    x=x2,
    y=y2,
    z=z2,
    text=group2_words,
    mode='markers',
    hoverinfo='text',
    marker={'opacity': 0.8}
)


data = [trace1, trace2]
layout = go.Layout(
    showlegend=True,
    margin=dict(l=0, r=0, b=0, t=0)
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='3d-scatter-colorscale')


Consider using IPython.display.IFrame instead



### Visualize reviews
Lets compare documents that are food-related versus not food-related

In [109]:
cut = 1000 # just take the first 1000 reviews
short_reviews = [entry['review'][:100] + '...' for entry in train[:cut]]
pca = PCA(n_components=3)
x, y, z = pca.fit_transform(train_vectors[:cut]).T
review_vectors = [np.array(t) for t in zip(x, y, z)]

In [110]:
trace1 = go.Scatter3d(
    x=x, y=y, z=z,
    text=short_reviews,
    mode='markers',
    hoverinfo='text',
    marker=dict(
        size=4,
        opacity=0.5,
        color=train_labels[:cut]
    )
)

data = [trace1]
layout = go.Layout(
    showlegend=True,
    margin=dict(l=0, r=0, b=0, t=0)
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='3d-scatter-colorscale')


Consider using IPython.display.IFrame instead



## SVM

### Train a SVM

In [113]:
clf = svm.SVC(gamma="scale", kernel="linear")
predictions = clf.fit(train_vectors, train_labels).predict(test_vectors)
print(classification_report(test_labels, predictions))

              precision    recall  f1-score   support

           0       0.93      0.90      0.91      5975
           1       0.90      0.93      0.92      5945

   micro avg       0.91      0.91      0.91     11920
   macro avg       0.91      0.91      0.91     11920
weighted avg       0.91      0.91      0.91     11920



### Visualize the SVM decision plane