# Homework 3 - Text Classification
Author: Kandarp Khandwala - kkhandw1

Text Mining

* Data: books.csv contains 2,000 Amazon book reviews. Each row represents a review for one book. The data set contains two columns: the first column (contained in quotes) is the review text. The second column is a binary label indicating if the review is positive or negative.

* Tasks: Described below


In [6]:
import string

# Import pandas to read in data
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

## Text classification
We are going to look at some Amazon reviews and classify them into positive or negative.

### Data
The file `books.csv` contains 2,000 Amazon book reviews. The data set contains two features: the first column (contained in quotes) is the review text. The second column is a binary label indicating if the review is positive or negative.

Let's take a quick look at the file.

In [2]:
!head -3 books.csv

review_text,positive
"THis book was horrible.  If it was possible to rate it lower than one star i would have.  I am an avid reader and picked this book up after my mom had gotten it from a friend.  I read half of it, suffering from a headache the entire time, and then got to the part about the relationship the 13 year old boy had with a 33 year old man and i lit this book on fire.  One less copy in the world...don't waste your money.I wish i had the time spent reading this book back so i could use it for better purposes.  THis book wasted my life",0
"I like to use the Amazon reviews when purchasing books, especially alert for dissenting perceptions about higly rated items, which usually disuades me from a selection.  So I offer this review that seriously questions the popularity of this work - I found it smug, self-serving and self-indulgent, written by a person with little or no empathy, especially for the people he castigates. For example, his portrayal of the family therapist see

Let's read the data into a pandas data frame. You'll notice two new attributed in `pd.read_csv()` that we've never seen before. The first, `quotechar` is tell us what is being used to "encapsulate" the text fields. Since our review text is surrounding by double quotes, we let pandas know. We use a `\` since the quote is also used to surround the quote. This backslash is known as an escape character. We also let pandas now this.

In [3]:
data = pd.read_csv("books.csv", quotechar="\"", escapechar="\\")

In [4]:
data.head()

Unnamed: 0,review_text,positive
0,THis book was horrible. If it was possible to...,0
1,I like to use the Amazon reviews when purchasi...,0
2,THis book was horrible. If it was possible to...,0
3,"I'm not sure who's writing these reviews, but ...",0
4,I picked up the first book in this series (The...,0


### Task 1: Preprocessing the text (25 points)

Change text to lower case and remove stop words, then transform the row text collection into a matrix of token counts.

Hint: sklearn's function CountVectorizer has built-in options for these operations. Refer to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html for more information.

In [7]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]


In [11]:
data['review_text'].head(5).apply(text_process)

0    [book, horrible, possible, rate, lower, one, s...
1    [like, use, Amazon, reviews, purchasing, books...
2    [book, horrible, possible, rate, lower, one, s...
3    [Im, sure, whos, writing, reviews, read, repit...
4    [picked, first, book, series, Eyre, Affair, ba...
Name: review_text, dtype: object

In [22]:
tokenizer = CountVectorizer(lowercase=True, analyzer=text_process)
messages = tokenizer.fit_transform(data['review_text'])
messages

<2000x30903 sparse matrix of type '<class 'numpy.int64'>'
	with 147810 stored elements in Compressed Sparse Row format>

In [29]:
print('Sample Feature names: ', tokenizer.get_feature_names()[-10:])
print('Shape of Sparse Matrix: ', messages.shape)
print('Amount of Non-Zero occurences: ', messages.nnz)
print('Sparsity: %.2f%%' % (100.0 * messages.nnz / (messages.shape[0] * messages.shape[1])))

Sample Feature names:  ['zillion', 'zing', 'zinger', 'zip', 'zipped', 'zit', 'zombie', 'zombies', 'zone', 'zooming']
Shape of Sparse Matrix:  (2000, 30903)
Amount of Non-Zero occurences:  147810
Sparsity: 0.24%


### Task 2: Build a logistic regression model using token counts (25 points)

Build a logistic regression model using the token counts from task 1. Perform a 5-fold cross-validation (train-test ratio 80-20), and compute the mean AUC (Area under Curve).

In [48]:
from sklearn import cross_validation

kf = KFold(n_splits = 5) # Define the split into 5 folds
for train_index, test_index in kf.split(messages.toarray()):
    print('TRAIN:', len(train_index), 'TEST:', len(test_index))

TRAIN: 1600 TEST: 400
TRAIN: 1600 TEST: 400
TRAIN: 1600 TEST: 400
TRAIN: 1600 TEST: 400
TRAIN: 1600 TEST: 400


### Task 3: Build a logistic regression model using TFIDF (25 points)

Transform the training data into a TFIDF matirx, and use it to build a new logistic regression model. Again, perform a 5-fold cross-validation, and compute the mean AUC.

Hint: Similar to CountVectorizer, sklearn's TfidfVectorizer function can do all the transformation work for you. Don't forget using the stop_words option.

### Task 4: Build a logistic regression model using TFIDF over n-grams (25 points)

We still want to use the TFIDF matirx, but instead of using TFIDF over single tokens, this time we want to go further and use TFIDF values of both 1-gram and 2-gram tokens. Then use this new TFIDF matrix to build another logistic regression model. Again, perform a 5-fold cross-validation, and compute the mean AUC.

Hint: You can configure the n-gram range using an option of the TfidfVectorizer function