## Part 1: Predicting IMDB movie review polarity

Sentiment analysis is a hot topic in data science right now due to the immense amount of user-generated text data being created every day online.  Businesses can now look at what is being said about them on review sites to get an idea of how well they are liked, how much they are disliked, and what they can do to improve.  While most of this data is unlabeled, some sites also ask users to provide a numerical or star rating.  This allows us to build a classifier for positive/negative reviews using the star rating as a label, which could then hypothetically applied to unlabeled text.

IMDB collects information about movies and lets users write their own reviews, as well as provide a 1-10 numerical rating.  The data for this assignment can be found in hw4_IMDB.csv.  It consists of 12,500 positive and 12,500 negative reviews collected from IMDB.  The ratings have been binarized by labeling anything with score between 7 and 10 as “P” and anything between 1 and 4 with “N” (there are no “neutral” reviews in the data).  We will build and evaluate a system that classifies these movie reviews as positive or negative.


In [26]:
import pandas as pd

url = "hw4_IMDB.csv"

df = pd.read_csv(url)

df.head

<bound method NDFrame.head of                                                     text @class@
0      'If you hit your teens in the 70s as I did you...       N
1      'Excellent endearing film with Peter Falk and ...       P
2      'Oh dear what a horrid movie. The production w...       N
3      'This is a terrible production of Bartleby tho...       N
4      'I actually have a fondness for Christopher Le...       N
5      'Interesting fast-paced and amusing. Im not on...       P
6      'With Adam Sandler. This is without a doubt on...       N
7      'This movie is supposed to be taking place in ...       N
8      'I was very excited about this film when I fir...       N
9      'In my opinion this is the best stand-up show ...       P
10     'What a piece of stupid tripe. I wont even was...       N
11     'The Honey I Shrunk the Kids franchise was a h...       N
12     'Dont get me wrong the movie is beautiful the ...       N
13     'I am a huge fan of the comic book series but ...    

### 1.1 Preprocessing our data

Before we build a classifier, we normally do some text pre-processing: text is an unstructured form of data---at least more unstructured than the feature vectors we used in the previous exercises. By pre-processing we can "clean our textual data"

Do the following preprocessing on the textual data:
- Convert all upper case characters to lower case
- Remove all non alphanumeric characters
- Remove stopwords

Why is each of these steps beneficial? Explain.

In [27]:
# lower case
df["text_clean"] = df["text"].str.lower()

# remove non-alphanumeric characters
df["text_clean"] = df["text_clean"].str.replace('[^a-zA-Z ]', '')

# remove stopwords
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

for word in stop:
    df["text_clean"] = df["text_clean"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


df.head

<bound method NDFrame.head of                                                     text @class@  \
0      'If you hit your teens in the 70s as I did you...       N   
1      'Excellent endearing film with Peter Falk and ...       P   
2      'Oh dear what a horrid movie. The production w...       N   
3      'This is a terrible production of Bartleby tho...       N   
4      'I actually have a fondness for Christopher Le...       N   
5      'Interesting fast-paced and amusing. Im not on...       P   
6      'With Adam Sandler. This is without a doubt on...       N   
7      'This movie is supposed to be taking place in ...       N   
8      'I was very excited about this film when I fir...       N   
9      'In my opinion this is the best stand-up show ...       P   
10     'What a piece of stupid tripe. I wont even was...       N   
11     'The Honey I Shrunk the Kids franchise was a h...       N   
12     'Dont get me wrong the movie is beautiful the ...       N   
13     'I am a hug

### 1.2 Building a Naive Bayes classifier

We will now build our predictive model
- First, turn the text into a features vector by using 1-grams, also called bag-of-words. This can be done with the CountVectorizer function in SKLearn. What is the shape of the resulting table? What does the number of rows and what do the number of columns mean? 
- Measure the performance of a Naive Bayes classifier using 3-fold CV, and report the accuract of the classifier across each fold, as well as the average accuracy. Is accuracy a good measure of predictive performance here? If yes, why? If no, what measure would you instead use?


In [29]:
# Bag-of-words based featurization
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X = count_vect.fit_transform(df["text_clean"])
y = df["@class@"]

print(X.shape)

# Training a Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

clf = MultinomialNB()
scores = cross_val_score(clf, X, y, cv=5)

print(scores)

print(sum(scores)/len(scores))

(25000, 108451)
[ 0.8524  0.8626  0.8548  0.8616  0.8496]
0.8562


### 1.3 Interpreting the results of the classifier

Get the cross-validation predictions. Pick some instances that were incorrectly classified and read them through. Are there any words in these misclassified reviews that may have misled the classifier?  Explain with at least three examples for each type of error you can find.

In [36]:
from sklearn.model_selection import cross_val_predict

df["predicted"] = cross_val_predict(clf, X, y, cv=3)

df.head


df_incorrect = df[df['@class@'] != df['predicted']]

print(df.shape)
print(df_incorrect.shape)

df_incorrect.head

(25000, 4)
(3602, 4)


<bound method NDFrame.head of                                                     text @class@  \
0      'If you hit your teens in the 70s as I did you...       N   
12     'Dont get me wrong the movie is beautiful the ...       N   
35     'Deranged and graphically gory Japanese film a...       P   
41     'A movie you start watching as a late night ca...       P   
45     'Surprisingly well made little movie. Short in...       P   
46     'Although this film is somewhat sanitized (bec...       P   
68     'BEWARE SPOILERS. This movie was okay. Goldie ...       P   
75     'I really liked this movie despite one scene t...       P   
84     'One thing is for sure...you should not watch ...       P   
85     'Having read the novel before seeing this film...       N   
101    'Good western filmed in the rocky Arizona wild...       P   
105    'I am a big 1930s movie fan and will watch mos...       N   
122    'Now let me tell you about this movie this mov...       P   
126    'Charleton 

### 1.4 Improving the performance of your classifier

This is an open ended exercise. How far can you push the performance of your classifier? You can try some of the following:
- Use 2-grams (ordered pairs of 2 words), or higher degree n-grams
- Preproces your data differently. For example, you may choose to not remove some punctuation marks, not lowercasing the words, use a stemmer, or not remove stopwords
- Use other predictive algorithms, as you see fit.

## Part 2: Predicting Yelp star rating

Yelp is a very popular website that collects reviews of restaurants and other local businesses, along with star ratings from 1-5. Instead of classifying these reviews as positive or negative, we’ll now build a classifier to predict the star rating of a Yelp review from its text. Star rating prediction can be viewed either as a multiclass classification problem or a regression problem. For now we’ll treat it as multiclass classification. This is our first problem that is not simple binary classification, and will come with its own set of issues, which we will delve into below.

### 2.1 Interpreting the new accuracies

Read the data from "hw4_Yelp.csv" and then perform the preprocessing steps, Naive Bayes classifier fitting, and evaluation as in Part 1.

Why are the accuracies lower? 

In [1]:
import pandas as pd

url = "hw4_Yelp.csv"

df = pd.read_csv(url)

# lower case
df["text_clean"] = df["text"].str.lower()

# remove non-alphanumeric characters
df["text_clean"] = df["text_clean"].str.replace('[^a-zA-Z ]', '')

# remove stopwords
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

for word in stop:
    df["text_clean"] = df["text_clean"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


df.head

<bound method NDFrame.head of                                                    text  @rating@  \
0                'They have a nice selection of vinyl.'         4   
1     'Well not much to say about this Circle K....e...         3   
2     'My husband and I went out on a couples date l...         4   
3     'I love this place. Their chicken panang thai ...         4   
4     'I love this place! I always get a good prescr...         5   
5     'I love Noodles Ranch! I was skeptical at firs...         4   
6     'Ordered a couple sandwiches at 12:17 they did...         1   
7     'Have to give this coffee shop a big shout out...         5   
8     'Did someone just step on my sandwich?  OMG.  ...         3   
9     'This place was great. We went in after seeing...         5   
10    'This place is just  consistently great with g...         5   
11    'Went for a weekday lunch with the boyfriend b...         5   
12    'Im a fan of Zipps but the staff at this locat...         2   
13  

In [4]:
# Bag-of-words based featurization
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X = count_vect.fit_transform(df["text_clean"])
y = df["@rating@"]

print(X.shape)

# Training a Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

clf = MultinomialNB()
scores = cross_val_score(clf, X, y, cv=3)

print(scores)

print(sum(scores)/len(scores))

from sklearn.model_selection import cross_val_predict
df["predicted"] = cross_val_predict(clf, X, y, cv=3)


(10000, 32971)
[ 0.4730054   0.47584758  0.47644764]
0.475100209481


### 2.2 Confusion and cost matrices

Use the confusion_matrix function from sklearn.metrics to get the confusion matrix for your classifier.

We have provided two cost matrices below.  Apply them to your confusion matrix and report the total cost of using this classifier under each cost scheme.  Which one of these (a or b) makes more sense in the context of this multiclass classification problem, and why?

**Table 1**

|  a  |  b  |  c  |  d  |  e  |     |
| --- | --- | --- | --- | --- | --- | 
|  0  |  2  |  2  |  2  |  2  | a=1 ||
|  2  |  0  |  2  |  2  |  2  | b=2 |
|  2  |  2  |  0  |  2  |  2  | c=3 |
|  2  |  2  |  2  |  0  |  2  | d=4 |
|  2  |  2  |  2  |  2  |  0  | e=5 |

**Table 2**

|  a  |  b  |  c  |  d  |  e  |     |
| --- | --- | --- | --- | --- | --- | 
|  0  |  1  |  2  |  3  |  4  | a=1 ||
|  1  |  0  |  1  |  2  |  3  | b=2 |
|  2  |  1  |  0  |  1  |  2  | c=3 |
|  3  |  2  |  1  |  0  |  1  | d=4 |
|  4  |  3  |  2  |  1  |  0  | e=5 |



In [6]:
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(df["@rating@"], df["predicted"])

array([[ 296,   56,   52,  303,  139],
       [  76,   32,   91,  555,  122],
       [  30,   17,   90, 1038,  239],
       [  26,    4,   53, 2141, 1028],
       [  47,    4,   25, 1344, 2192]])

### 2.3 (bonus question) Using regression

Could we instead use a regression predictive model for this problem? Fit a simple linear regression to the training data, using the same pipeline as before. What are some advantages and disadvantages in using a regression instead of multi-class classification  