# Data Mining Project - Week 5 - Restaurant Hygiene Prediction

## Data Mining Specialization - Coursera / University of Illinois at Urbana-Champaign

* Author: Michael Onishi
* Date: November, 2019

### Description
In this task, you are going to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants.




### Dataset setup

In [1]:
! wget https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task6/Hygiene.tar.gz
! tar xzf Hygiene.tar.gz

--2019-11-24 20:02:05--  https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task6/Hygiene.tar.gz
Resolving d396qusza40orc.cloudfront.net (d396qusza40orc.cloudfront.net)... 13.224.63.146, 13.224.63.17, 13.224.63.183, ...
Connecting to d396qusza40orc.cloudfront.net (d396qusza40orc.cloudfront.net)|13.224.63.146|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39134299 (37M) [application/x-gzip]
Saving to: ‘Hygiene.tar.gz’


2019-11-24 20:02:07 (31.6 MB/s) - ‘Hygiene.tar.gz’ saved [39134299/39134299]



In [2]:
! pip install unidecode

Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |█▍                              | 10kB 18.4MB/s eta 0:00:01[K     |██▊                             | 20kB 5.6MB/s eta 0:00:01[K     |████▏                           | 30kB 7.9MB/s eta 0:00:01[K     |█████▌                          | 40kB 5.6MB/s eta 0:00:01[K     |██████▉                         | 51kB 6.7MB/s eta 0:00:01[K     |████████▎                       | 61kB 7.9MB/s eta 0:00:01[K     |█████████▋                      | 71kB 7.7MB/s eta 0:00:01[K     |███████████                     | 81kB 8.6MB/s eta 0:00:01[K     |████████████▍                   | 92kB 8.1MB/s eta 0:00:01[K     |█████████████▊                  | 102kB 8.8MB/s eta 0:00:01[K     |███████████████▏                | 112kB 8.8MB/s eta 0:00:01[K     |████████████████▌               | 122kB 8.8MB/

In [0]:
import pandas as pd
import numpy as np
from unidecode import unidecode
import re
import math
import html
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier, VotingClassifier


# Plotting tools
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

# Seaborn for plotting and styling
import seaborn as sns
sns.set(style="whitegrid")

In [0]:
df = pd.read_csv('Hygiene/hygiene.dat', header=None, delimiter='\n', names=['reviews'])

# some reviews have html entities like &#160;. convert them to clean text.
df.reviews = df.reviews.apply(lambda x : html.unescape(x))
df['label'] = np.loadtxt('Hygiene/hygiene.dat.labels', dtype = str, delimiter='\n')
df2 = pd.read_csv('Hygiene/hygiene.dat.additional', header=None, names=['categories', 'zip', 'review_count', 'average_rating'])

In [0]:
df = pd.concat([df2, df], axis=1)

In [6]:
df

Unnamed: 0,categories,zip,review_count,average_rating,reviews,label
0,"['Vietnamese', 'Sandwiches', 'Restaurants']",98118,4,4.000000,"The baguettes and rolls are excellent, and alt...",1
1,"['American (New)', 'Restaurants']",98109,21,4.047619,I live up the street from Betty. When my sist...,1
2,"['Mexican', 'Restaurants']",98103,14,3.111111,I'm worried about how I will review this place...,1
3,"['Mexican', 'Tex-Mex', 'Restaurants']",98112,42,4.088889,Why can't you access them on Google street vie...,0
4,"['Mexican', 'Restaurants']",98102,12,3.071429,Things to like about this place: homemade guac...,0
...,...,...,...,...,...,...
13294,"['Dim Sum', 'Cantonese', 'Chinese', 'Restauran...",98104,1,3.000000,I keep my Dim Sum dining expectations very low...,[None]
13295,"['Breakfast & Brunch', 'Restaurants']",98116,29,4.258065,Cheap eats and veggie alterna-meats... Perfect...,[None]
13296,"['Vietnamese', 'Restaurants']",98104,1,4.000000,Everything here is awesome except for the wait...,[None]
13297,"['Italian', 'Pizza', 'Restaurants']",98109,2,4.000000,A great place to go on Queen Anne when everywh...,[None]


In [0]:
df_train = df[df.label != '[None]'].copy()
df_test = df[df.label == '[None]'].copy()

### Text only classifier

In [0]:
def preprocess(text):
    # Remove accents
    text = unidecode(text)
    # Remove line breaks and tab
    text = re.sub(r'[\t\n\r]', ' ', text)
    # Remove http links
    text = re.sub(r'http\S+', ' ', text)
    # Convert to lowercase
    text = text.lower().strip()    
    return text

In [0]:
df_train.reviews = df_train.reviews.apply(lambda x : preprocess(x))

In [0]:
def vectorize(text_list, max_features=20000, ngram_range=(1,1)):
    print(f"Vectorizing {len(text_list)} documents using {max_features} max_features")
    vectorizer = TfidfVectorizer(max_df=0.8, max_features=max_features,
                             min_df=2, stop_words='english',
                             use_idf=False,
                             ngram_range=ngram_range,
                             token_pattern='[a-zA-Z0-9]{3,}')
    
    return vectorizer, vectorizer.fit_transform(text_list)

In [114]:
%%time
vectorizer_unigram, matrix_unigram = vectorize(df_train.reviews, max_features = 100000, ngram_range=(1,1))
vectorizer_bigram, matrix_bigram = vectorize(df_train.reviews, max_features = 100000, ngram_range=(1,2))

Vectorizing 546 documents using 100000 max_features
Vectorizing 546 documents using 100000 max_features
CPU times: user 2.06 s, sys: 53.1 ms, total: 2.11 s
Wall time: 2.11 s


In [256]:
print(matrix_unigram.shape, matrix_bigram.shape)

(546, 10740) (546, 46729)


**Since we do not have access to any label in the test set, I will select 10% of the original training set to be my test set to be able to calculate the f1-score properly.**

In [0]:
X_train_unigram, X_test_unigram, y_train, y_test = train_test_split(matrix_unigram, df_train.label, test_size=0.1, random_state=42)
X_train_bigram, X_test_bigram, y_train, y_test = train_test_split(matrix_bigram, df_train.label, test_size=0.1, random_state=42)
target_names = ['passed test', 'failed test']

In [0]:
def train_predict(X_train, X_test, y_train, y_test):
  clf_multinomialNB = MultinomialNB().fit(X_train, y_train)
  y_pred_multinomialNB = clf_multinomialNB.predict(X_test)
  print('Multinomial Naive Bayes')
  print(classification_report(y_test, y_pred_multinomialNB, target_names=target_names))

  clf_sgd = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, 
                      random_state=42, max_iter=10, tol=None).fit(X_train, y_train)
  y_pred_sgd = clf_sgd.predict(X_test)
  print('\nSGD')
  print(classification_report(y_test, y_pred_sgd, target_names=target_names))

  clf_rf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X_train, y_train)
  y_pred_rf = clf_rf.predict(X_test)
  print('\nRandom Forest')
  print(classification_report(y_test, y_pred_rf, target_names=target_names))

  clf_voting = VotingClassifier([('Multinomial Naive Bayes', clf_multinomialNB), ('SGD', clf_sgd), ('Random Forest', clf_rf)]).fit(X_train, y_train)
  y_pred_voting = clf_voting.predict(X_test)
  print('\nVoting')
  print(classification_report(y_test, y_pred_voting, target_names=target_names))


In [269]:
train_predict(X_train_unigram, X_test_unigram, y_train, y_test)

Multinomial Naive Bayes
              precision    recall  f1-score   support

 passed test       0.72      0.78      0.75        23
 failed test       0.83      0.78      0.81        32

    accuracy                           0.78        55
   macro avg       0.78      0.78      0.78        55
weighted avg       0.79      0.78      0.78        55


SGD
              precision    recall  f1-score   support

 passed test       0.53      0.87      0.66        23
 failed test       0.82      0.44      0.57        32

    accuracy                           0.62        55
   macro avg       0.67      0.65      0.61        55
weighted avg       0.70      0.62      0.61        55


Random Forest
              precision    recall  f1-score   support

 passed test       0.57      0.91      0.70        23
 failed test       0.89      0.50      0.64        32

    accuracy                           0.67        55
   macro avg       0.73      0.71      0.67        55
weighted avg       0.75      0

In [270]:
train_predict(X_train_bigram, X_test_bigram, y_train, y_test)

Multinomial Naive Bayes
              precision    recall  f1-score   support

 passed test       0.68      0.65      0.67        23
 failed test       0.76      0.78      0.77        32

    accuracy                           0.73        55
   macro avg       0.72      0.72      0.72        55
weighted avg       0.73      0.73      0.73        55


SGD
              precision    recall  f1-score   support

 passed test       0.51      0.91      0.66        23
 failed test       0.86      0.38      0.52        32

    accuracy                           0.60        55
   macro avg       0.68      0.64      0.59        55
weighted avg       0.71      0.60      0.58        55


Random Forest
              precision    recall  f1-score   support

 passed test       0.56      1.00      0.72        23
 failed test       1.00      0.44      0.61        32

    accuracy                           0.67        55
   macro avg       0.78      0.72      0.66        55
weighted avg       0.82      0

### Metadata only classifier

In [0]:
df_train['average_rating_scaled'] = preprocessing.minmax_scale(df_train.average_rating)
df_train['review_count_scaled'] = preprocessing.minmax_scale(df_train.review_count)
df_train.zip = df_train.zip.apply(str)
df_train['cat_list'] = df_train.categories.apply(lambda x : x[2:-2].split("', '"))

In [204]:
df_train[['average_rating_scaled', 'review_count_scaled', 'cat_list']]

Unnamed: 0,average_rating_scaled,review_count_scaled,cat_list
0,0.750000,0.021739,"[Vietnamese, Sandwiches, Restaurants]"
1,0.761905,0.144928,"[American (New), Restaurants]"
2,0.527778,0.094203,"[Mexican, Restaurants]"
3,0.772222,0.297101,"[Mexican, Tex-Mex, Restaurants]"
4,0.517857,0.079710,"[Mexican, Restaurants]"
...,...,...,...
541,0.916667,0.014493,"[Mexican, Restaurants]"
542,0.527778,0.043478,"[Chinese, Restaurants]"
543,0.612500,0.123188,"[Pizza, Restaurants]"
544,0.825000,0.065217,"[Vietnamese, Sandwiches, Restaurants]"


In [0]:
X_train_metadata, X_test_metadata, y_train, y_test = train_test_split(
    pd.concat([df_train[['average_rating_scaled', 'review_count_scaled', 'zip']], 
               pd.DataFrame(preprocessing.MultiLabelBinarizer().fit_transform(df_train.cat_list))],
              axis = 1)
    , df_train.label, test_size=0.1, random_state=42)

In [271]:
train_predict(X_train_metadata, X_test_metadata, y_train, y_test)

Multinomial Naive Bayes
              precision    recall  f1-score   support

 passed test       0.62      0.91      0.74        23
 failed test       0.90      0.59      0.72        32

    accuracy                           0.73        55
   macro avg       0.76      0.75      0.73        55
weighted avg       0.78      0.73      0.73        55


SGD
              precision    recall  f1-score   support

 passed test       0.42      1.00      0.59        23
 failed test       0.00      0.00      0.00        32

    accuracy                           0.42        55
   macro avg       0.21      0.50      0.29        55
weighted avg       0.17      0.42      0.25        55



  'precision', 'predicted', average, warn_for)



Random Forest
              precision    recall  f1-score   support

 passed test       0.50      0.78      0.61        23
 failed test       0.74      0.44      0.55        32

    accuracy                           0.58        55
   macro avg       0.62      0.61      0.58        55
weighted avg       0.64      0.58      0.57        55


Voting
              precision    recall  f1-score   support

 passed test       0.51      0.96      0.67        23
 failed test       0.92      0.34      0.50        32

    accuracy                           0.60        55
   macro avg       0.71      0.65      0.58        55
weighted avg       0.75      0.60      0.57        55



### Combined classifier

In [0]:
clf_multinomialNB = MultinomialNB().fit(X_train_unigram, y_train)
df_train['text_prediction'] = clf_multinomialNB.predict(matrix_unigram)

In [0]:
X_train_all, X_test_all, y_train, y_test = train_test_split(
    pd.concat([df_train[['average_rating_scaled', 'review_count_scaled', 'zip', 'text_prediction']], 
               pd.DataFrame(preprocessing.MultiLabelBinarizer().fit_transform(df_train.cat_list))],
              axis = 1)
    , df_train.label, test_size=0.1, random_state=42)

In [285]:
train_predict(X_train_all, X_test_all, y_train, y_test)

Multinomial Naive Bayes
              precision    recall  f1-score   support

 passed test       0.70      0.91      0.79        23
 failed test       0.92      0.72      0.81        32

    accuracy                           0.80        55
   macro avg       0.81      0.82      0.80        55
weighted avg       0.83      0.80      0.80        55


SGD
              precision    recall  f1-score   support

 passed test       0.42      1.00      0.59        23
 failed test       0.00      0.00      0.00        32

    accuracy                           0.42        55
   macro avg       0.21      0.50      0.29        55
weighted avg       0.17      0.42      0.25        55



  'precision', 'predicted', average, warn_for)



Random Forest
              precision    recall  f1-score   support

 passed test       0.67      0.78      0.72        23
 failed test       0.82      0.72      0.77        32

    accuracy                           0.75        55
   macro avg       0.74      0.75      0.74        55
weighted avg       0.76      0.75      0.75        55


Voting
              precision    recall  f1-score   support

 passed test       0.63      0.96      0.76        23
 failed test       0.95      0.59      0.73        32

    accuracy                           0.75        55
   macro avg       0.79      0.78      0.74        55
weighted avg       0.82      0.75      0.74        55

