# Practical 9

## Text Mining
***

Read in some packages.

In [18]:
# Import pandas to read in data
import numpy as np
import pandas as pd

# Import models and evaluation functions
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics
from sklearn import cross_validation

# Import vectorizers to turn text into numeric
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Import plotting
import matplotlib.pylab as plt
%matplotlib inline

## Text classification
We are going to look at some Amazon reviews and classify them into positive or negative.

### Data
The file `data/books.csv` contains 2,000 Amazon book reviews. The data set contains two features: the first column (contained in quotes) is the review text. The second column is a binary label indicating if the review is positive or negative.

Let's take a quick look at the file.

In [2]:
!head -3 data/books.csv

'head' 不是内部或外部命令，也不是可运行的程序
或批处理文件。


Let's read the data into a pandas data frame. You'll notice two new attributed in `pd.read_csv()` that we've never seen before. The first, `quotechar` is tell us what is being used to "encapsulate" the text fields. Since our review text is surrounding by double quotes, we let pandas know. We use a `\` since the quote is also used to surround the quote. This backslash is known as an escape character. We also let pandas now this.

In [3]:
data = pd.read_csv("data/books.csv", quotechar="\"", escapechar="\\")

In [11]:
data.tail()

Unnamed: 0,review_text,positive
1995,Both of my boys love this book and request it ...,1
1996,&quot;The Shaman's Apprentice&quot; presents m...,1
1997,John Gunther's INSIDE U.S.A. comes as close to...,1
1998,This is the sixth and final addition to the au...,1
1999,Pretty much all of the Todd Parr books we've r...,1


In [6]:
data.describe()

Unnamed: 0,positive
count,2000.0
mean,0.5
std,0.500125
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


### Text as a set of features
Going from text to numeric data is very easy. Let's take a look at how we can do this. We'll start by separating out our X and Y data.

In [30]:
X_text = data['review_text']
Y = data['positive']

Next, we will turn `X_text` into just `X` -- a numeric representation!

In [8]:
# Create a vectorizer that will track text as binary features
binary_vectorizer = CountVectorizer(binary=True)

# Let the vectorizer learn what tokens exist in the text data
binary_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = binary_vectorizer.transform(X_text)

### Modeling
We have a ton of features, let's use them in some different models.

In [10]:
# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.848


Let's try using full counts instead of a binary representation. I've just copy and pasted what is above and removed the `binary=True` from the vectorizer.

In [15]:
# Create a vectorizer that will track text as binary features
count_vectorizer = CountVectorizer()

# Let the vectorizer learn what tokens exist in the text data
count_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = count_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.849


Let's try using TF-IDF.

In [16]:
# Create a vectorizer that will track text as binary features
tfidf_vectorizer = TfidfVectorizer()

# Let the vectorizer learn what tokens exist in the text data
tfidf_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = tfidf_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.87


### Group work
#### Features
Tfidf is looking pretty good! How about adding n-grams? Stop words? Lowercase transforming?

`CountVectorizer()` and `TfidfVectorizer()` can be modified to handle all of these things. Work in groups and try a few different combinations of these settings for anything you want: binary counts, numeric counts, tf-idf counts. Here is how you would use these settings:

- "`ngram_range=(1,2)`": would include unigrams and bigrams
- "`stop_words="english"`": would use a standard set of English stop words
- "`lowercase=False`": would turn off lowercase transformation (it is actually on by default)!

You can use some of these like this:

`tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), lowercase=False)`

#### Models
You can also swap out the line creating a logistic regression with one making a naive Bayes. This is also one line:

`naive_bayes = BernoulliNB()`

You can then go ahead and use `naive_bayes` inplace of `logistic_regression`.

In [None]:
# Work with your teams here!
# Try different features, models, or both!
# What is the highest AUC you can get?

In [31]:
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
count_vectorizer.fit(X_text)
X = count_vectorizer.transform(X_text)
logistic_regression = LogisticRegression()
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.856


In [22]:
count_vectorizer = CountVectorizer(ngram_range=(2, 3))
count_vectorizer.fit(X_text)
X = count_vectorizer.transform(X_text)
logistic_regression = LogisticRegression()
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.835


In [24]:
count_vectorizer = CountVectorizer(stop_words="english")
count_vectorizer.fit(X_text)
X = count_vectorizer.transform(X_text)
logistic_regression = LogisticRegression()
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.84


In [32]:
count_vectorizer = CountVectorizer(ngram_range=(1, 2),stop_words="english",lowercase=False)
count_vectorizer.fit(X_text)
X = count_vectorizer.transform(X_text)
logistic_regression = LogisticRegression()
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.847


In [33]:
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
count_vectorizer.fit(X_text)
X = count_vectorizer.transform(X_text)
naive_bayes = BernoulliNB()
aucs = cross_validation.cross_val_score(naive_bayes, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.87


In [42]:
count_vectorizer = CountVectorizer(ngram_range=(1, 3))
count_vectorizer.fit(X_text)
X = count_vectorizer.transform(X_text)
naive_bayes = BernoulliNB()
aucs = cross_validation.cross_val_score(naive_bayes, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.876


In [48]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
tfidf_vectorizer.fit(X_text)
X = tfidf_vectorizer.transform(X_text)
logistic_regression = LogisticRegression()
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.875


In [51]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
tfidf_vectorizer.fit(X_text)
X = tfidf_vectorizer.transform(X_text)
naive_bayes = BernoulliNB()
aucs = cross_validation.cross_val_score(naive_bayes, X, Y, scoring="roc_auc", cv=5)
print ("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

Area under the ROC curve for our classifier is 0.838


**My Answer**

After multiple attempts, the highest AUC I can get is 0.88, achieved with TFIDF vectorizer and ngram range from 1 to 3.

Some observations:
1. Turning on or off the lowercase setting hardly changes the result dramatically (<0.01)
2. Adding the list of stop words usually decreases the AUC
3. There is a peak of AUC when tuning the right end of ngram range.

## Feature Engineering
We have examined two ways of dealing with categorical data: binarizing/dummy variables and numerical scaling. We will practice these here.

In [52]:
data = pd.read_csv("data/categorical.csv")

In [53]:
data

Unnamed: 0,Minutes,Gender,Marital,Satisfaction,Churn
0,100,Male,Single,Low,0
1,220,Female,Married,Very Low,0
2,500,Female,Divorced,High,1
3,335,Male,Single,Neutral,0
4,450,Male,Married,Very High,1


### Binarizing
Get a list of features you want to binarize, go through each feature and create new features for each level.

In [54]:
features_to_binarize = ["Gender", "Marital"]

# Go through each feature
for feature in features_to_binarize:
    # Go through each level in this feature (except the last one!)
    for level in data[feature].unique()[0:-1]:
        # Create new feature for this level
        data[feature + "_" + level] = pd.Series(data[feature] == level, dtype=int)
    # Drop original feature
    data = data.drop([feature], 1)

In [55]:
data

Unnamed: 0,Minutes,Satisfaction,Churn,Gender_Male,Marital_Single,Marital_Married
0,100,Low,0,1,1,0
1,220,Very Low,0,0,0,1
2,500,High,1,0,0,0
3,335,Neutral,0,1,1,0
4,450,Very High,1,1,0,1


### Numeric scaling
We can also replace text levels with some numeric mapping we create

In [56]:
data['Satisfaction'] = data['Satisfaction'].replace(['Very Low', 'Low', 'Neutral', 'High', 'Very High'], 
                                                    [-2, -1, 0, 1, 2])

In [57]:
data

Unnamed: 0,Minutes,Satisfaction,Churn,Gender_Male,Marital_Single,Marital_Married
0,100,-1,0,1,1,0
1,220,-2,0,0,0,1
2,500,1,1,0,0,0
3,335,0,0,1,1,0
4,450,2,1,1,0,1
