Lab: Movie Reviews Classification
======

![](https://imgs.xkcd.com/comics/star_ratings.png)

Overview
----

> I like my data large, my algorithms simple, and my labels weak.  
\- Andrej Karpathy


#### Data Science Workflow

1. Ask
2. Acquire
3. Process
4. Model
5. Deliver 

You are going to apply the Data Science workflow to classifying movie reviews as positive or negative. 

This is the actual Internet Movie Database [imdb](www.imdb.com) review data used in the seminal [Pang *et al.* (2002)](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf) paper.

1) Ask: 
----

Which words contribute to a movie review being positive or negative?

2) Acquire
----

Write Python code to download the [data: positive and negative processed reviews](http://www.cs.cornell.edu/People/pabo/movie-review-data/). 

Use `polarity dataset v2.0 `

<br>

<details><summary>
Click here for a hint...
</summary>
```
import os
from urllib.request import urlretrieve
```
</details>

Write Python code to unzip the files

<br>

<details><summary>
Click here for a hint...
</summary>
import tarfile
</details>
<br>
<br>
<details><summary>
Click here for the solution...
</summary>
```
import tarfile
path = "./txt_sentoken/"
if not os.path.exists(path):
    with tarfile.open(filename, "r:gz") as tar:
        tar.extractall()
```
</details>

Open the data files in your favorite text editor. How would you describe the data?

3) Process
-----

The data has been preprocessed for you (thank me later 🙌). 

What preprocessing steps have already been performed on the data?

4) Model
----

We are going to use scikit-learn to model the data.

If you not familar with using scikit-learn, check out [this tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

Load data into sci-kit learn

In [1]:
from sklearn.datasets import load_files

In [1]:
sentiment = load_files(path, 
                       encoding='utf-8',
                       random_state=42)
*sentiment.target_names

NameError: name 'load_files' is not defined

Create a train and test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
None, None, None, None = train_test_split(sentiment.data,
                                          sentiment.target,
                                          random_state=42)

> We decided to sample an equal number of positive and negative reviews—was that a good idea? — Bo Pang

Probably not, the real world is not uniform. That is the difference between academic and applied ML. 

Applied ML models the world as it is, not how it is convient for publication.

Pipelines FTW
-----

![](images/pipeline.png)

In [3]:
from sklearn.pipeline import Pipeline

The goal is to go end-to-end as quickly as possible and establish a baseline performance metric.

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [4]:
def text_classifcation(clf: Pipeline, train_data, test_data, train_target, test_target):
    "Helper function wrapping the Pipeline."
    clf.fit(train_data, train_target) 
    predicted = clf.predict(test_data)
    accuracy = np.mean(predicted==test_target)
    print(f"The accuracy on the test data is {accuracy:.2%}")

4) Vectorizing & Model Fitting
-----
We need to coverts the words into numbers.

Create an instance of [Count Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) 

Create an instance of [Multinomial Naïve Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) model

In [2]:
clf = Pipeline([('vect', CountVectorizer()),
                ('clf', MultinomialNB())])

text_classifcation(clf, train_data, test_data, train_target, test_target)

Now try with TfidfTransformer

Why the difference? What could be the reason for the change?

Evaluate the Model
-----

Create a confusion matrix in sci-kit learn

Are there more False Positives or False Negatives?

Why do you think this is?

Repeat with a "prettier" confusion matrix

In [1]:
from display_confusion_matrix import display_confusion_matrix

5) Deliver
-----

The goal of the lab is get __>82.9%__ accuracy on the test set, better than the best the score from the paper. Once you do, you can stop. 

You can use any Machine Learning technique to accomplish that goal(e.g., feature engineering, choosing a different classifer, random search, grid search, Bayesian Optimization, …).

------

Once you have a final model,

In 2-3 sentences, summarize what you do did.

In a table, list your best parameters choices for each step so someone could recreate your work

In a single sentence, summarize what you found.

Bonus Cartoon
------

![](https://imgs.xkcd.com/comics/emoji_movie_reviews.png)

<br>
<br>
---