### In this notebook, the task was to investigate the similarity between texts.

In [16]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy

**Load in the text data.**

In [17]:
# Load the large model to get the vectors
import en_core_web_sm
nlp = en_core_web_sm.load()

review_data = pd.read_csv('./yelp_ratings.csv')
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


In [18]:
reviews = review_data
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in reviews.iterrows()])
    
vectors.shape

(44530, 96)

**Train an SVM model using the document vectors (each review is represented by a vector).**

In [19]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, reviews.sentiment, test_size=0.1, random_state=1)

# Create the LinearSVC model
model = LinearSVC(random_state=1, dual=False)
# Fit the model
model.fit(X_train, y_train)

# Uncomment and run to see model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')

Model test accuracy: 82.506%


**Calculate the similarity between a random piece of text and reviews from the dataset.**
We need to center the data beforehand, since all of the reviews are pretty similar already and we want to compare the texts within the data.

In [20]:
review = """I absolutely love this place. The 360 degree glass windows with the 
Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere 
transports you to what feels like a different zen zone within the city. I know 
the price is slightly more compared to the normal American size, however the food 
is very wholesome, the tea selection is incredible and I know service can be hit 
or miss often but it was on point during our most recent visit. Definitely recommend!

I would especially recommend the butternut squash gyoza."""

def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors, should have shape (300,)
vec_mean = vectors.mean(axis=0)
# Subtract the mean from the vectors
centered = vectors - vec_mean

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = np.array([cosine_similarity(review_vec-vec_mean, document) for document in centered])

# Get the index for the most similar document
most_similar = sims.argmax()

**Print the most similar review to the random text above.**

In [21]:
print(review_data.iloc[most_similar].text)

I'm not one to write a bad review, I usually just don't go back. But I've lived in vegas for 5 years, being from New Jersey I've eaten the best in Jersey and All the 5 boroughs in New York. And believe me between Vegas and Henderson there are some really excellent italian restaurants....I'm 100% italian and in the food business. So I know what food should taste, look, smell like....First off I've passed this joint 100 times...always said one day I'll give it a shot...so the other night I did....my foodie partner and I walk in...first I think wow this joint is gonna be great...just by the ambiance of kinda vintage ol' school Vegas...so I give it and 8 for appearance....not clean or organized at all but that doesn't mean it's gonna be a bad joint...now we sit down and we are given Red wine on the house with little tasting glasses which I loved...thought man this joint is going on my list...ok skip ahead to the menu on the wall...yeah u read that right....there's a huge ass menu on the wa