# Bag of Words with Restaurant Reviews

### Introduction

In this lesson, let's begin by classifying text.  We'll work with a set of restaurant reviews, and see if we can use them to draw insight into what makes a restaurant good or not. 

### Working with our dataset

In this lesson, we'll use our text in the context of a supervised learning problem.  Let's take a look at our dataset, and then we'll see how this fits in.

In [88]:
import pandas as pd  

reviews = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

In [4]:
dataset[:10]

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
5,Now I am getting angry and I want my damn pho.,0
6,Honeslty it didn't taste THAT fresh.),0
7,The potatoes were like rubber and you could te...,0
8,The fries were great too.,1
9,A great touch.,1


In [6]:
dataset.shape

(1000, 2)

So we can see that we have a thousand different reviews, with corresponding scores of positive or negative.  Our task will be to turn the text in these reviews into features that predict the 1 or 0.  Now before we do that, let's define a few terms particular to the natural language processing problem.

### Defining Terms

Here are a few words we should know before moving on:

In [5]:
import re  
import nltk  
nltk.download('stopwords') 
from nltk.corpus import stopwords 


[nltk_data] Downloading package stopwords to /Users/jeff/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [66]:
review = dataset.loc[0]

In [67]:
review_text = review['Review']
review_text

'Wow... Loved this place.'

## Clean Reviews

In [49]:
def remove_punctuation_and_lower(review_text):
    return re.sub('[^a-zA-Z]', ' ', review_text.lower())

In [37]:
def remove_stop_words(words):
    if isinstance(words, str):
        words = words.split()
    return [word for word in words if word not in list(stopwords.words('english'))]

In [38]:
remove_stop_words(review_text)

['Wow...', 'Loved', 'place.']

In [44]:
from nltk.stem.porter import PorterStemmer 
def stem_words(words):
    if isinstance(words, str):
        words = words.split()
    ps = PorterStemmer()  
    stemmed_words = [ps.stem(word) for word in words]  
    return stemmed_words

In [45]:
stem_words(review_text)

['wow...', 'love', 'thi', 'place.']

In [62]:
corpus = []  
for review in reviews['Review']:  
    review = remove_punctuation_and_lower(review)
    review = remove_stop_words(review)
    review_words = stem_words(review)
    review = ' '.join(review_words)   
    corpus.append(review)  

### Convert to Numbers

In [68]:
from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(max_features = 1500)  
  
X = cv.fit_transform(corpus).toarray()  
y = dataset.iloc[:, 1].values  

In [70]:
reviews[:1]

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1


In [69]:
X[:1]

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [73]:
from sklearn.model_selection import train_test_split 
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 

In [74]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [75]:
from sklearn.ensemble import RandomForestClassifier 
  

model = RandomForestClassifier(n_estimators = 100, 
                            criterion = 'entropy') 
model.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Evaluate model

In [77]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 1])

In [78]:
from sklearn.metrics import confusion_matrix 
  
cm = confusion_matrix(y_test, y_pred) 
  
cm 

array([[106,  17],
       [ 45,  82]])

### Interpret Model

In [83]:
import eli5
from eli5.sklearn import PermutationImportance

In [None]:
perm = PermutationImportance(model).fit(X_test, y_test)


In [87]:
eli5.show_weights(model, feature_names = cv.get_feature_names(), top = 40)

Weight,Feature
0.0447  ± 0.0327,great
0.0181  ± 0.0158,love
0.0180  ± 0.0158,delici
0.0172  ± 0.0176,good
0.0162  ± 0.0154,amaz
0.0108  ± 0.0131,friendli
0.0092  ± 0.0147,nice
0.0090  ± 0.0088,fantast
0.0090  ± 0.0121,bad
0.0089  ± 0.0111,place


### Resources

[Geeks for Geeks](https://www.geeksforgeeks.org/python-nlp-analysis-of-restaurant-reviews/)

[TF-IDF](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)

[used with naive bayes](https://medium.com/@baemaek/text-mining-preprocess-and-naive-bayes-classifier-da0000f633b2)