# Domain Specific Sentiment Analysis

In the first model, we saw that the sentiment analyzer we got from google didn't work well on our data. We'll now build a second model to see if a sentiment classifier trained on the full text of the reviews can better predict the final ratings.

In [1]:
# Imports and notebook settings

# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Dealing with text data
from sklearn.feature_extraction.text import CountVectorizer

# General ML Stuff
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Plotting options
%matplotlib inline
plt.rcParams['figure.figsize'] = [8, 8]

First, we get the data:

In [2]:
with open('./data/cleaned_reviews.json', 'r') as infile:
    data = pd.read_json(infile, orient = 'records')
data.head()

Unnamed: 0,id,price,rating,rec_dishes,review_text,review_url
0,2,4,1,8,My dinner date had eaten in this room before....,https://www.nytimes.com/2014/04/02/dining/rest...
1,3,4,2,11,You can put up with mediocre food in a restau...,https://www.nytimes.com/2013/11/20/dining/revi...
2,4,3,3,7,A few minutes into my first dinner at Bâtard ...,https://www.nytimes.com/2014/08/27/dining/rest...
3,5,4,1,6,"After a long and grueling winter, nothing lif...",https://www.nytimes.com/2015/04/08/dining/rest...
4,6,4,Satisfactory,4,"By the early 1980s, New Yorkers knew that som...",https://www.nytimes.com/2015/11/11/dining/jams...


In [3]:
reviews = data['review_text']
reviews.head()

0     My dinner date had eaten in this room before....
1     You can put up with mediocre food in a restau...
2     A few minutes into my first dinner at Bâtard ...
3     After a long and grueling winter, nothing lif...
4     By the early 1980s, New Yorkers knew that som...
Name: review_text, dtype: object

We'll use a bag-of-words classifier to try to analyze this data. This essentially amounts to making a list of every word that appears in the reviews, and then creating a new set of features measuring how many times each word appears in a review.  These frequency counts will then be the main features in our model. The `CountVectorizer` class in sklearn performs this process.

In [4]:
counter1 = CountVectorizer(ngram_range = (1,3), token_pattern = "[a-z][a-z]+")
counter2 = CountVectorizer(ngram_range = (1,3))
vectorizers = [counter1, counter2]

In [5]:
class DumbGuess:
    def __init__(self, **kwargs):
        pass
    def predict(self, X):
        return(['2']*len(X))
    def fit(self, a, b):
        return(self)
models = [DumbGuess, LogisticRegression, MultinomialNB, SVC, RandomForestClassifier]

In [6]:
for (i, vectorizer) in enumerate(vectorizers):
    words = vectorizer.fit_transform(reviews.tolist())
    X = np.hstack((words.toarray(), data[['price', 'rec_dishes']]))
    X_train, X_test, y_train, y_test = train_test_split(X, data.rating, random_state = 0)
    for model in models:
        trained = model().fit(X_train,y_train)
        training_accuracy = accuracy_score(y_train, trained.predict(X_train))
        testing_accuracy = accuracy_score(y_test, trained.predict(X_test))
        config = "For model %s with wordlist %i" % (model.__name__, i + 1)
        print(config)
        print("-"*len(config))
        print("Training Accuracy: %f" % training_accuracy)
        print("Testing Accuracy: %f" % testing_accuracy)
        print("")

For model DumbGuess with wordlist 1
-----------------------------------
Training Accuracy: 0.519126
Testing Accuracy: 0.467742

For model LogisticRegression with wordlist 1
--------------------------------------------
Training Accuracy: 1.000000
Testing Accuracy: 0.467742

For model MultinomialNB with wordlist 1
---------------------------------------
Training Accuracy: 0.972678
Testing Accuracy: 0.483871

For model SVC with wordlist 1
-----------------------------
Training Accuracy: 0.519126
Testing Accuracy: 0.467742

For model RandomForestClassifier with wordlist 1
------------------------------------------------
Training Accuracy: 0.978142
Testing Accuracy: 0.451613

For model DumbGuess with wordlist 2
-----------------------------------
Training Accuracy: 0.519126
Testing Accuracy: 0.467742

For model LogisticRegression with wordlist 2
--------------------------------------------
Training Accuracy: 1.000000
Testing Accuracy: 0.467742

For model MultinomialNB with wordlist 2
------

In [7]:
counter1_50 = CountVectorizer(ngram_range = (1,3), token_pattern = "[a-z][a-z]+", max_features = 50)
counter2_50 = CountVectorizer(ngram_range = (1,3), max_features = 50)
vectorizers2 = [counter1_50, counter2_50]

In [8]:
for (i, vectorizer) in enumerate(vectorizers2):
    words = vectorizer.fit_transform(reviews.tolist())
    X = np.hstack((words.toarray(), data[['price', 'rec_dishes']]))
    X_train, X_test, y_train, y_test = train_test_split(X, data.rating, random_state = 0)
    for model in models:
        trained = model().fit(X_train,y_train)
        training_accuracy = accuracy_score(y_train, trained.predict(X_train))
        testing_accuracy = accuracy_score(y_test, trained.predict(X_test))
        config = "For model %s with wordlist %i" % (model.__name__, i + 1)
        print(config)
        print("-"*len(config))
        print("Training Accuracy: %f" % training_accuracy)
        print("Testing Accuracy: %f" % testing_accuracy)
        print("")

For model DumbGuess with wordlist 1
-----------------------------------
Training Accuracy: 0.519126
Testing Accuracy: 0.467742

For model LogisticRegression with wordlist 1
--------------------------------------------
Training Accuracy: 0.830601
Testing Accuracy: 0.419355

For model MultinomialNB with wordlist 1
---------------------------------------
Training Accuracy: 0.655738
Testing Accuracy: 0.483871

For model SVC with wordlist 1
-----------------------------
Training Accuracy: 1.000000
Testing Accuracy: 0.483871

For model RandomForestClassifier with wordlist 1
------------------------------------------------
Training Accuracy: 0.983607
Testing Accuracy: 0.451613

For model DumbGuess with wordlist 2
-----------------------------------
Training Accuracy: 0.519126
Testing Accuracy: 0.467742

For model LogisticRegression with wordlist 2
--------------------------------------------
Training Accuracy: 0.836066
Testing Accuracy: 0.435484

For model MultinomialNB with wordlist 2
------