# Overview
Sometimes we make decisions beyond the rating of a restaurant. For example, if a restaurant has a high rating but it often fails to pass hygiene inspections, then this information can dissuade many people to eat there. Using this hygiene information could lead to a more informative system; however, it is often the case where we don’t have such information for all the restaurants, and we are left to make predictions based on the small sample of data points.

In this task, you are going to predict whether a set of restaurants will pass the public health inspection tests given the corresponding Yelp text reviews along with some additional information such as the locations and cuisines offered in these restaurants. Making a prediction about an unobserved attribute using data mining techniques represents a wide range of important applications of data mining. Through working on this task, you will gain direct experience with such an application. Due to the flexibility of using as many indicators for prediction as possible, this would also give you an opportunity to potentially combine many different algorithms you have learned from the courses in the Data Mining Specialization to solve a real world problem and experiment with different methods to understand what’s the most effective way of solving the problem.

## About the Dataset
You should first download the [dataset](https://d396qusza40orc.cloudfront.net/dataminingcapstone/Task6/Hygiene.tar.gz). The dataset is composed of a training subset containing 546 restaurants used for training your classifier, in addition to a testing subset of 12753 restaurants used for evaluating the performance of the classifier. In the training subset, you will be provided with a binary label for each restaurant, which indicates whether the restaurant has passed the latest public health inspection test or not, whereas for the testing subset, you will not have access to any labels. The dataset is spread across three files such that the first 546 lines in each file corresponding to the training subset, and the rest are part of the testing subset. Below is a description of each file:

* **hygiene.dat**: Each line contains the concatenated text reviews of one restaurant.  
* **hygiene.dat.labels**: For the first 546 lines, a binary label (0 or 1) is used where a 0 indicates that the restaurant has passed the latest public health inspection test, while a 1 means that the restaurant has failed the test. The rest of the lines have "[None]" in their label field implying that they are part of the testing subset.
* **hygiene.dat.additional**: It is a CSV (Comma-Separated Values) file where the first value is a list containing the cuisines offered, the second value is the zip code, which gives an idea about the location, the third is the number of reviews, and the fourth is the average rating, which can vary between 0 and 5 (5 being the best).

For testing, we use the F1 measure, which is the harmonic mean of precision and recall, to rank the submissions in the leaderboard. The F1 measure will be based on the macro-averages of precision and recall (macro-averaging is used here to ensure that the two classes are given equal weight as we do not want class 0 to dominate the measure).

In [1]:
hygiene_text_path= "./Hygiene/hygiene.dat"
hygiene_labels_path= "./Hygiene/hygiene.dat.labels"
hygiene_additional_path= "./Hygiene/hygiene.dat.additional"

In [2]:
import pandas as pd
import numpy as np
import gensim
import nltk
import spacy

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin


from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_numeric, strip_short
from gensim.parsing.preprocessing import strip_multiple_whitespaces, strip_non_alphanum, remove_stopwords, stem_text
from nltk.stem import WordNetLemmatizer, SnowballStemmer

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.feature_extraction import DictVectorizer

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import svm
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, GridSearchCV



from nltk.corpus import stopwords 
STOP_WORDS = set(stopwords.words('english'))


pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
SEED=26

## 1. Read in Data

Preprocessing Steps:
* tokenization and cleaning
* stopword removal
* stemming and lemmatization

In [49]:
# tokenize and preprocess
# https://radimrehurek.com/gensim/parsing/preprocessing.html
FILTERS_LIST = [lambda x: x.lower(), # lowercase  
                strip_tags, # remove tags
                strip_punctuation, # replace punctuation characters with spaces
                strip_multiple_whitespaces, # remove repeating whitespaces
                # strip_numeric, # remove numbers
                gensim.parsing.preprocessing.remove_stopwords, # remove stopwords
                strip_short, # remove words less than minsize=3 characters long]
                stem_text]
def preprocess(text):
    """
    strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, 
    """
    result_stemmed = []
    for token in gensim.parsing.preprocessing.preprocess_string(text, FILTERS_LIST):
        result_stemmed.append(WordNetLemmatizer().lemmatize(token))
    return result_stemmed

In [71]:
%%time
from tqdm import tqdm

texts = []
preprocessed_texts = []

with open(hygiene_text_path) as f:
    texts = f.readlines()
    
for _text in tqdm(texts):
    result_stemmed = preprocess(_text)
    preprocessed_texts.append(result_stemmed)
    
all_preprocessed_texts = [" ".join(_text) for _text in preprocessed_texts]

100%|██████████| 13299/13299 [01:34<00:00, 140.05it/s]


CPU times: user 1min 34s, sys: 700 ms, total: 1min 35s
Wall time: 1min 35s


In [72]:
N = 546

# labels 
with open(hygiene_labels_path, 'r') as f:
    labels = [l.rstrip() for l in f]

# texts = []
# with open(hygiene_text_path, 'r') as f:
#     texts = f.read().splitlines(True)


df = pd.DataFrame({"label":labels, "text": texts, "preprocessed_texts": all_preprocessed_texts})
hygiene_additional = pd.read_csv(hygiene_additional_path,  
                                 names=["cuisines_offered", "zipcode", "num_reviews", "avg_rating"],
                                 dtype={"cuisines_offered": str, 
                                        "zipcode": str,
                                        "num_reviews": str})
df = df.join(hygiene_additional)
df['avg_rating'] = df['avg_rating'].apply(lambda x: str(int(round(x, 0))))

print(df.info())
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13299 entries, 0 to 13298
Data columns (total 7 columns):
label                 13299 non-null object
text                  13299 non-null object
preprocessed_texts    13299 non-null object
cuisines_offered      13299 non-null object
zipcode               13299 non-null object
num_reviews           13299 non-null object
avg_rating            13299 non-null object
dtypes: object(7)
memory usage: 727.4+ KB
None


Unnamed: 0,label,text,preprocessed_texts,cuisines_offered,zipcode,num_reviews,avg_rating
0,1,"The baguettes and rolls are excellent, and alt...",baguett roll excel haven tri excit dozen plu t...,"['Vietnamese', 'Sandwiches', 'Restaurants']",98118,4,4
1,1,I live up the street from Betty. &#160;When my...,live street betti 160 sister town spring break...,"['American (New)', 'Restaurants']",98109,21,4
2,1,I'm worried about how I will review this place...,worri review place strongli think bad night pl...,"['Mexican', 'Restaurants']",98103,14,3
3,0,Why can't you access them on Google street vie...,access googl street view like medina yarrow po...,"['Mexican', 'Tex-Mex', 'Restaurants']",98112,42,4
4,0,Things to like about this place: homemade guac...,thing like place homemad guacamol varieti tast...,"['Mexican', 'Restaurants']",98102,12,3


In [73]:
%%time
train_df = df[df["label"] != "[None]"]
test_df = df[df["label"] == "[None]"]

train = train_df.drop(["label", "preprocessed_texts"], axis=1)
train_preprocessed = train_df.drop(["label", "text"], axis=1)
train_labels = train_df["label"].astype(int) # needed by sklearn

test = test_df.drop(["label", "preprocessed_texts"], axis=1)
test_preprocessed = test_df.drop(["label", "text"], axis=1)
test_labels = test_df["label"]

print(train.shape, train_preprocessed.shape, train_labels.shape)
print(test.shape, test_preprocessed.shape, test_labels.shape)
print(train.dtypes, train_preprocessed.dtypes)

(546, 5) (546, 5) (546,)
(12753, 5) (12753, 5) (12753,)
text                object
cuisines_offered    object
zipcode             object
num_reviews         object
avg_rating          object
dtype: object preprocessed_texts    object
cuisines_offered      object
zipcode               object
num_reviews           object
avg_rating            object
dtype: object
CPU times: user 131 ms, sys: 58.9 ms, total: 190 ms
Wall time: 188 ms


In [74]:
display(train.head())
display(train_preprocessed.head())

Unnamed: 0,text,cuisines_offered,zipcode,num_reviews,avg_rating
0,"The baguettes and rolls are excellent, and alt...","['Vietnamese', 'Sandwiches', 'Restaurants']",98118,4,4
1,I live up the street from Betty. &#160;When my...,"['American (New)', 'Restaurants']",98109,21,4
2,I'm worried about how I will review this place...,"['Mexican', 'Restaurants']",98103,14,3
3,Why can't you access them on Google street vie...,"['Mexican', 'Tex-Mex', 'Restaurants']",98112,42,4
4,Things to like about this place: homemade guac...,"['Mexican', 'Restaurants']",98102,12,3


Unnamed: 0,preprocessed_texts,cuisines_offered,zipcode,num_reviews,avg_rating
0,baguett roll excel haven tri excit dozen plu t...,"['Vietnamese', 'Sandwiches', 'Restaurants']",98118,4,4
1,live street betti 160 sister town spring break...,"['American (New)', 'Restaurants']",98109,21,4
2,worri review place strongli think bad night pl...,"['Mexican', 'Restaurants']",98103,14,3
3,access googl street view like medina yarrow po...,"['Mexican', 'Tex-Mex', 'Restaurants']",98112,42,4
4,thing like place homemad guacamol varieti tast...,"['Mexican', 'Restaurants']",98102,12,3


In [69]:
# # just use cross_val_score
# X_train, X_test, y_train, y_test = train_test_split(train_preprocessed, train_labels, test_size= 0.2, random_state=SEED)

## 2. Model Experiments

Models:  
* Naive Bayes
* SVM
* Logistic Regression
* Random Forest
* XGBoost

Feature Engineering:  
* Count Vectorizer
* Tfidf Vectorizer
* word embedding: GloVe and fastText


In [75]:
%%time
pipeline = Pipeline([
    ('union', ColumnTransformer(
        [('cuisines_offered', CountVectorizer(), 'cuisines_offered'),
         ('zipcode', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['zipcode']),
         ('num_reviews', CountVectorizer(token_pattern='\d+'), 'num_reviews'),
         ('avg_rating', CountVectorizer(token_pattern='\d+'), 'avg_rating'),
         ('text', TfidfVectorizer(
                stop_words='english',
                strip_accents='unicode',
                min_df=15,
                max_df=0.5,
                ngram_range=(1, 3),
                max_features=500), 'preprocessed_texts')],
        remainder='passthrough',
    )),
    ('clf', svm.SVC())
], verbose=False)

# pipeline.fit(X_train, y_train)
# y_pred = pipeline.predict(X_test)
# scores = metrics.f1_score(y_test, y_pred)
scores = cross_val_score(pipeline, train_preprocessed, train_labels, cv=5, scoring= 'f1')
print(scores)
print("Average F1-Score: %0.5f" % np.average(scores))

[0.62857143 0.7184466  0.64912281 0.62711864 0.55319149]
Average F1-Score: 0.63529
CPU times: user 12.3 s, sys: 381 ms, total: 12.7 s
Wall time: 12.7 s


In [76]:
%%time
pipeline = Pipeline([
    ('union', ColumnTransformer(
        [('cuisines_offered', CountVectorizer(), 'cuisines_offered'),
         ('zipcode', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['zipcode']),
         ('num_reviews', CountVectorizer(token_pattern='\d+'), 'num_reviews'),
         ('avg_rating', CountVectorizer(token_pattern='\d+'), 'avg_rating'),
         ('text', TfidfVectorizer(
                stop_words='english',
                strip_accents='unicode',
                min_df=25,
                max_df=0.5,
                ngram_range=(1, 3),
                max_features=500), 'text')],
        remainder='passthrough',
    )),
    ('clf', svm.SVC())
], verbose=False)

# pipeline.fit(X_train, y_train)
# y_pred = pipeline.predict(X_test)
# score = metrics.f1_score(y_test, y_pred)
scores = cross_val_score(pipeline, train, train_labels, cv=5, scoring= 'f1')
print(scores)
print("Average F1-Score: %0.5f" % np.average(scores))

[0.62857143 0.69811321 0.65486726 0.62608696 0.55319149]
Average F1-Score: 0.63217
CPU times: user 14.8 s, sys: 490 ms, total: 15.3 s
Wall time: 15.3 s


In [77]:
%%time
def test_classifier(clf, X_train, y_train, text_col='text'):
    pipeline = Pipeline([
        ('union', ColumnTransformer(
            [('cuisines_offered', CountVectorizer(), 'cuisines_offered'),
             ('zipcode', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['zipcode']),
             ('num_reviews', CountVectorizer(token_pattern='\d+'), 'num_reviews'),
             ('avg_rating', CountVectorizer(token_pattern='\d+'), 'avg_rating'),
             ('text', TfidfVectorizer(
                    stop_words='english',
                    strip_accents='unicode',
                    min_df=15,
                    max_df=0.3,
                    ngram_range=(1, 3),
                    max_features=500), text_col)],
            remainder='passthrough',
        )),
        ('clf', clf)
    ], verbose=False)
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring= 'f1')
    print(clf)
    print(scores)
    cv_score = np.average(scores)
    return cv_score

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 22.9 µs


In [78]:
classifiers = {
    'Naive Bayes': MultinomialNB(),
    'Support Vector Machine': svm.SVC(),
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}

### 2.1 No Preprocessing

In [80]:
%%time
for clf_name, clf in classifiers.items():
    cv_score = test_classifier(clf, train, train_labels)
    print('{}: {}'.format(clf_name, cv_score))

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
[0.62264151 0.69090909 0.66071429 0.64220183 0.52272727]
Naive Bayes: 0.6278387987293994
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
[0.64150943 0.71153846 0.66071429 0.63247863 0.55319149]
Support Vector Machine: 0.6398864606110692
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
[0.61946903 0.7037037  0.65517241 0.63247863 0.48351648]
Logistic Regression: 0.6188680520081192
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
      

### 2.2 Preprocessing

In [81]:
%%time
for clf_name, clf in classifiers.items():
    cv_score = test_classifier(clf, train_preprocessed, train_labels, text_col='preprocessed_texts')
    print('{}: {}'.format(clf_name, cv_score))

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
[0.64150943 0.7027027  0.64912281 0.63636364 0.52272727]
Naive Bayes: 0.630485170554684
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
[0.61538462 0.72380952 0.64912281 0.62711864 0.55319149]
Support Vector Machine: 0.6337254159282363
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
[0.61261261 0.71028037 0.63793103 0.62184874 0.52083333]
Logistic Regression: 0.6207012187512557
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
       

## Additional Attempts
* Undersampling
* FastText Word Embedding w/ Gensim
* BERT
* NN

## 3. Submission

In [None]:
def create_submission(y_pred, filepath):
    with open(filepath, 'w') as f:
        f.write('jc26\n')
        for label in y_pred:
            f.write(str(label) + '\n')

In [None]:
%%time
pipeline = Pipeline([
    ('union', ColumnTransformer(
        [('cuisines_offered', CountVectorizer(), 'cuisines_offered'),
         ('zipcode', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['zipcode']),
         ('num_reviews', CountVectorizer(token_pattern='\d+'), 'num_reviews'),
         ('avg_rating', CountVectorizer(token_pattern='\d+'), 'avg_rating'),
         ('text', TfidfVectorizer(
                stop_words='english',
                strip_accents='unicode',
                min_df=25,
                max_df=0.5,
                ngram_range=(1, 3),
                max_features=500), 'text')],
        remainder='passthrough',
    )),
    ('clf', svm.SVC())
], verbose=False)

pipeline.fit(train, train_labels)
y_pred = pipeline.predict(test)
# score = metrics.f1_score(y_test, y_pred)
# scores = cross_val_score(pipeline, train, train_labels, cv=5, scoring= 'f1')
# print(scores)
# print("Average F1-Score: %0.5f" % np.average(scores))

In [None]:
submit_path ='./submissions/submission2_SVC.txt'
create_submission(y_pred, submit_path)

### Double Check Before Submitting!!!

In [None]:
!python submit.py jc26 {submit_path}

## Discussion

### Method Comparison
Tried two or more text representation techniques and two or more learning algorithms and used the additional features; The comparison gives an insight on why some methods perform better than others.

* Text Representation: How text is represented as features (for example, unigrams is one technique, bigrams is another)
* Learning Algorithm: The classification algorithm used (for example, SVM, Naive Bayes, Logistic Regression, ... )


### Best Performing Method
* What toolkit was used?
* How was text preprocessed? (i.e., stopword removal, stemming, or any data cleaning technique)
* How was text represented as features?
* What was the learning algorithm used?

## References
* https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py