# DAT-NYC-37 | Lab 15 | Natural Language Processing and Text Classification

In [None]:
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import feature_extraction, ensemble, cross_validation, metrics

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

%matplotlib inline
plt.style.use('ggplot')

The data is about sentiments on Amazon reviews. The data is in a "raw" format where the review and it's score are separated by tabs (`\t` character). We'll first need to parse it.

Here's a sample:

<small>
<pre>
\tIt clicks into place in a way that makes you wonder how long that mechanism would last.\t0
\tI went on Motorola's website and followed all directions, but could not get it to pair again.\t0
</pre>
</small>

In [51]:
import sklearn
print sklearn.__version__

reviews = []
sentiments = []

with open(os.path.join('..', 'datasets', 'amazon-reviews.txt')) as f:
    for line in f.readlines():
        line = line.strip('\n')
        review, sentiment = line.split('\t')
        sentiment = np.nan if sentiment == '' else int(sentiment)

        reviews.append(review)
        sentiments.append(sentiment)

df = pd.DataFrame({'review': reviews, 'sentiment': sentiments})
                                                                        


0.17.1


In [52]:
df.head()

Unnamed: 0,review,sentiment
0,I try not to adjust the volume setting to avoi...,
1,So there is no way for me to plug it in here i...,0.0
2,"Good case, Excellent value.",1.0
3,I thought Motorola made reliable products!.,
4,Battery for Motorola Razr.,


In [56]:
df.dropna(inplace = True) # Let's drop NaNs

In [57]:
df.head()

Unnamed: 0,review,sentiment
1,So there is no way for me to plug it in here i...,0.0
2,"Good case, Excellent value.",1.0
5,Great for the jawbone.,1.0
10,Tied to charger for conversations lasting more...,0.0
11,The mic is great.,1.0


In [55]:
X = df.review
y = df.sentiment



# Part I: Modeling text features using `CountVectorizer`

`CountVectorizer` converts a collection of text into a matrix of features.  Each row will be a sample (an article or piece of text) and each column will be a text feature (usually a count or binary feature per word).

**`CountVectorizer` takes a column of text and creates a new dataset.**  It generates a feature for every word in all of the pieces of text.

CAUTION: Using all of the words can be useful, but we may need to use regularization to avoid overfitting.  Otherwise, rare words may cause the model to overfit and not generalize.

(And check http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html as needed)

## Step 1: Defining and transforming the input using `CountVectorizer`

In [None]:
# TODO: Instantiate a new CountVectorizer Using english stop words

# Reminder: "Stop words" are non-content words.  (e.g. 'to', 'the', and 'it')
# They aren’t helpful for prediction, so we remove them.
# We'll almost always want to specify `stop_words = 'english'` to exclude stop words

# YOUR CODE HERE

Vectorizers are like other models in `sklearn`:
- We create a vectorizer object with the parameters of our feature space
- We fit a vectorizer to learn the vocabulary
- We transform a set of text into that feature space

Note: there is a distinction between fit and transform:
- We fit from our training set.  This is part of the model building process, so we don't look at our test set
- We transform our test set using our model fit on the training set

In [None]:
## Step 

In [None]:
# TODO: Call `fit` on your count vectorizer, passing in the appropriate input.

# YOUR CODE HERE

In [None]:
# TODO: Transform the appropriate input into a count vectorized result.

# To ensure you did this step correctly, preview first 10 feature names

While dense matrices store every entry in the matrix, sparse matrices only store the nonzero entries.  Sparse matrices don't have a lot of extra features, and some algorithms may not work for them so you use them when you need to work with matrices that would be too big for the computer to handle them, but they are mostly zero, so they compress easily.  You can convert from sparse matrices to dense matrices with `.todense()`

In [None]:
# Tip:
# X_transformed.todense()

In [None]:
# Q: What are the 10 most commonly used words in our training set?

# TODO: YOUR CODE HERE

## Step 2: The Train/Test Split

In [None]:
## Split your data via train/test split using 30% of the dataset for training. Use `random_state=1` 
# so we can compare our results as a class

# TODO: YOUR CODE HERE

## Step 3: Training a Classifier with Random Forests

We can now build a random forest model to predict "sentiment".

*Use the Sklearn documentation as necessary for these steps!*

In [None]:
# Review Q: Why might we use a Random Forest model here instead of a decision tree?

# YOUR ANSWER HERE:

In [None]:
# TODO: Define a RandomForestClassifier that will train 20 decision trees.

# YOUR CODE HERE:

In [None]:
# TODO: Fit your RandomForestClassifier to your training data

# YOUR CODE HERE:

## Step 4: Evaluating your model using AUC

In [None]:
# TODO: The following code generate an ROC curve and calculates the corresponding AUC score.
# You'll need to replace the appropriate variables to get this to run

test_y_hat = rf_model.predict_proba(test_X_transformed)

fpr, tpr, thresholds = metrics.roc_curve(test_y, test_y_hat[:, 1])

plt.figure()
plt.plot(fpr, tpr, label = 'ROC curve (area = %0.4f)' % metrics.auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([.0, 1.])
plt.ylim([.0, 1.1])
plt.xlabel('FPR/Fall-out')
plt.ylabel('TPR/Sensitivity')
plt.title('Training Sentiment ROC')
plt.legend(loc = 'lower right')
plt.show()

# import seaborn as sns

# plt.figure()
# sns.distplot(fpr, color="red", bins=10)
# sns.distplot(tpr, color="green", bins=10);

In [None]:
# TODO: Now use the same process to generate an ROC score on the training dataset. Are you overfitting? 
# Why/why not? And if so, what can you do about it?

# YOUR CODE HERE

In [None]:
# Review Q: What is AUC score? Why are we using it?

# YOUR ANSWER HERE:

# Part II: Using TF-IDF

Directions: Redo the analysis above with `TfidfVectorizer` instead of `CountVectorizer`. Use 10 estimators for your Random Forest Classifier.  What results do you get?

(Check http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html as needed)

In [None]:
# TODO: YOUR CODE AND RESULTS HERE


In [None]:
# Q: What words have the highest Tf-Idf? What does this indicate?

## Bonus Questions/Exercises:

- Which features are most important? (hint: Read the docs: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- Use `cross_val_score` instead of a simple train/test split. How much does your model improve?
- In your own words*, describe what `cross_val_score` is doing
- Try including larger n_grams (e.g. 2 or 3) in your analysis. Does this improve your results?
- What other strategies can you use to improve your results?
- Why might using KNN be a bad idea on this dataset?