1. Scrape at least 400 full reviews and ratings from Yelp for a restaurant that has mixed reviews. 

Recode the ratings. 1-3 = Negative, 4-5 = Positive. 


In [1]:
import pandas as pd
gosmans = pd.read_csv('gosmans.csv')

In [2]:
# Recoding to 1 and 0 oppose to 'pos' and 'neg' so I can feed the roc_auc when reporting metrics

def recoder(values):
  if values >3:
    return 1
  else:
    return 0

In [3]:
gosmans['rating'] = gosmans['rating'].apply(recoder)

Clean and pre-process the data (remove punctuation, convert all words to lower case, remove stopwords, lemmatize the corpus).


In [15]:
import nltk
import re
nltk.download('punkt')
nltk.download("stopwords")
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [17]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [6]:
gosmans['review_text'] = gosmans['review_text'].str.lower()

In [7]:
gosmans['review_text'] = gosmans['review_text'].str.split()

In [8]:
def punctuation(words):
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

In [9]:
gosmans['review_text'] = gosmans['review_text'].apply(punctuation)

In [10]:
def remove_stop(words):
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

In [11]:
gosmans['review_text'] = gosmans['review_text'].apply(remove_stop)

In [12]:
def join_string(list_string):
    string = ','.join(list_string)
    return string

In [13]:
gosmans['review_text'] = gosmans['review_text'].apply(join_string)

In [18]:
wordnet_lemmatizer = WordNetLemmatizer()
gosmans['review_text'] = [wordnet_lemmatizer.lemmatize(word) for word in gosmans['review_text']]

2. Build count vectorized feature representation for the reviews (hint: sklearn.feature_extraction.text.countvectorizer). (2 pts.)


3. Develop the NB, LR, DT, BT, RF, SVM, and ANN classifiers to predict review sentiment. Evaluate average recall, precision, F1, ROC AUC, and PR AUC for each model. (12 pts.)

In [20]:
from sklearn.model_selection import train_test_split

X = gosmans['review_text']
y = gosmans['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

3. Develop the NB, LR, DT, BT, RF, SVM, and ANN classifiers to predict review sentiment. Evaluate average recall, precision, F1, ROC AUC, and PR AUC for each model. (12 pts.)

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

from sklearn.metrics import classification_report, recall_score, precision_score, f1_score, roc_curve, auc, precision_recall_curve

names = ["Logistic Regression", "SVM", "Decision Tree", "Random Forest", "AdaBoost", "Neural Net", 
         "Naive Bayes"]

classifiers = [LogisticRegression(),
               SVC(probability=True),
               DecisionTreeClassifier(max_depth=5),
               RandomForestClassifier(max_depth=5, n_estimators=10),
               AdaBoostClassifier(),
               MLPClassifier(alpha=1, max_iter=1000),
               MultinomialNB()
               ]

for name, clf in zip(names, classifiers):
  clf_pipe = Pipeline([
                    ('count', CountVectorizer()), 
                    (name, clf), 
                    ])
  
  clf_pipe.fit(X_train,y_train)

  pred = clf_pipe.predict(X_test)
  pred_prob = clf_pipe.predict_proba(X_test)[:, 1]

  fpr, tpr, thresholds = roc_curve(y_test, pred_prob)
  precision, recall, thresholds_pr = precision_recall_curve(y_test, pred)

  print('\n\n', name, '\n\n')
  print(classification_report(y_test, pred))
  print('ROC AUC: ', auc(fpr, tpr))
  print('Precision/Recall AUC: ', auc(precision, recall))
  print('\n\n')



 Logistic Regression 


              precision    recall  f1-score   support

           0       0.85      0.75      0.80        68
           1       0.74      0.84      0.79        57

    accuracy                           0.79       125
   macro avg       0.79      0.80      0.79       125
weighted avg       0.80      0.79      0.79       125

ROC AUC:  0.8913828689370484
Precision/Recall AUC:  0.37028340080971656





 SVM 


              precision    recall  f1-score   support

           0       0.88      0.72      0.79        68
           1       0.72      0.88      0.79        57

    accuracy                           0.79       125
   macro avg       0.80      0.80      0.79       125
weighted avg       0.81      0.79      0.79       125

ROC AUC:  0.9064757481940144
Precision/Recall AUC:  0.3729153318077803





 Decision Tree 


              precision    recall  f1-score   support

           0       0.65      0.81      0.72        68
           1       0.68      0.4

4. Build TFIDF vectorized feature representation for the reviews and evaluate NB, LR, DT, BT, RF, SVM, and ANN classifiers using recall, precision, F1, ROC AUC, and PR AUC (12 pts.)

In [22]:
names = ["Logistic Regression", "SVM", "Decision Tree", "Random Forest", "AdaBoost", "Neural Net", 
         "Naive Bayes"]

classifiers = [LogisticRegression(),
               SVC(probability=True),
               DecisionTreeClassifier(max_depth=5),
               RandomForestClassifier(max_depth=5, n_estimators=10),
               AdaBoostClassifier(),
               MLPClassifier(alpha=1, max_iter=1000),
               MultinomialNB()
               ]

for name, clf in zip(names, classifiers):
  clf_pipe = Pipeline([
                    ('tfidf', TfidfVectorizer()), 
                    (name, clf), 
                    ])
  
  clf_pipe.fit(X_train,y_train)

  pred = clf_pipe.predict(X_test)
  pred_prob = clf_pipe.predict_proba(X_test)[:, 1]

  fpr, tpr, thresholds = roc_curve(y_test, pred_prob)
  precision, recall, thresholds_pr = precision_recall_curve(y_test, pred)

  print('\n\n', name, '\n\n')
  print(classification_report(y_test, pred))
  print('ROC AUC: ', auc(fpr, tpr))
  print('Precision/Recall AUC: ', auc(precision, recall))
  print('\n\n')



 Logistic Regression 


              precision    recall  f1-score   support

           0       0.84      0.76      0.80        68
           1       0.75      0.82      0.78        57

    accuracy                           0.79       125
   macro avg       0.79      0.79      0.79       125
weighted avg       0.80      0.79      0.79       125

ROC AUC:  0.9107327141382869
Precision/Recall AUC:  0.36929657477025896





 SVM 


              precision    recall  f1-score   support

           0       0.84      0.75      0.79        68
           1       0.73      0.82      0.78        57

    accuracy                           0.78       125
   macro avg       0.79      0.79      0.78       125
weighted avg       0.79      0.78      0.78       125

ROC AUC:  0.9076367389060886
Precision/Recall AUC:  0.36346820175438593





 Decision Tree 


              precision    recall  f1-score   support

           0       0.67      0.71      0.69        68
           1       0.62      0.

5. Identify the best performing representation/model combination. Explain your choice of metric. (4 pts.)



The best representation/model combination was the Term Frequency Index representation and the Naive Bayes model. The count vectroized representation coupled with the Naive Bayes produced a .914 AUC whereas the term frequency index representation produced a slightly better .918 AUC. The metric of choice is of course AUC as this tests the TPR and FPR at various thresholds giving the most reliable score.