# Text Classification By Sklearn

# Fire up packages

In [1]:
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
import pandas
from sklearn.cross_validation import train_test_split
import numpy

# Load Data

In [2]:
product = pandas.read_csv('amazon_baby.csv')

In [3]:
product.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


# Data preprocessing: Natural Language Processing

**I will focus on how to preprocess text in Python. The standard feature extraction and data cleaning procedures will be created below.**

## Step 1: Remove punctuation, stopwords and numbers

In [4]:
import re

**Remove puntuation and create word list**

In [5]:
#product['review']= product['review'].apply(lambda x:re.sub("[^a-zA-Z]"," ",str(x)))#

In [6]:
#product['review']= product['review'].apply(lambda x: str(x).lower().split())

**Remove stopwords**

**Then we remove stop word by selecting items not in stopwords list**

In [7]:
import nltk
from nltk.corpus import stopwords # Import the stop word list

**Overall, we can create a function to finish all steps above at once**

In [8]:
def review_to_words(raw_review):
    letters_only = re.sub("[^a-zA-Z]", " ", raw_review) 
    words = letters_only.lower().split()                             
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops] 
    return( " ".join( meaningful_words )) 

## Step2: Clean the useless comment

In [9]:
def review_to_words_ote(raw_review):
    letters_only = re.sub("[^a-zA-Z]", " ", raw_review) 
    words = letters_only.lower().split()                             
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops] 
    return meaningful_words

In [10]:
product['length']=product['review'].apply(lambda x: len(review_to_words_ote(str(x))))

In [11]:
product=product[product['length']>=10]

## Step3: Create Response Variable

In [12]:
product=product[product['rating']!=3]

In [13]:
product['sentiment']=product['rating'].apply(lambda x: 1 if x>=4 else 0)

## Step 4: Split data

In [14]:
train_data,test_data = train_test_split(product,test_size=0.2)

## Step 5: Create Features

**Put cleaned review together**

In [15]:
test_clean_reviews = []
train_clean_reviews = []
for review in train_data['review']:
    train_clean_reviews.append(review_to_words(review))

for review in test_data['review']:
    test_clean_reviews.append(review_to_words(review))

**Create wordcount vector**

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(analyzer = "word")
train_features= v.fit_transform(train_clean_reviews)
test_features = v.transform(test_clean_reviews)

**Although the inbuilt word count method is not as convenient as in Graphlab, the vectorized function can also provide fast and high quality word count function.**

# Fit the model

**We will try to fit the training set with 5 different classifiers, namely the logistic regression, decision trees, Gradient Boosting classifier, random forest classifier and Linear Discriminant classifier.**

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score

Classifiers = [LogisticRegression(C=0.000000001,solver='liblinear',max_iter=200), DecisionTreeClassifier(),RandomForestClassifier(n_estimators=200),GradientBoostingClassifier(n_estimators=200),LinearDiscriminantAnalysis()]

In [19]:
Accuracy=[]
Model=[]
for classifier in Classifiers:
    try:
        fit = classifier.fit(train_features,train_data['sentiment'])
        pred = fit.predict(test_features)
        accuracy = accuracy_score(pred,test_data['sentiment'])
        Accuracy.append(accuracy)
        Model.append(classifier.__class__.__name__)
        print classifier.__class__.__name__ + 'successful with accuracy of'+ ' '+str(accuracy)
    except Exception:
        print classifier.__class__.__name__ + 'failed'

LogisticRegressionsuccessful with accuracy of 0.842853524159
DecisionTreeClassifiersuccessful with accuracy of 0.858621999873
RandomForestClassifiersuccessful with accuracy of 0.880881514787
GradientBoostingClassifierfailed
LinearDiscriminantAnalysisfailed


**We can see that later two models failed to run normally. For the sake of time, I will not track the corresponding problems in this project. We can see that the random forest model may be able to provide the most accurate result. Therefore, I will focus on the parameter tuning in the next step.**

# Parameter tuning with sklearn

In [None]:
from sklearn.grid_search import GridSearchCV
rf = RandomForestClassifier(n_jobs=-1)
param_grid={
    'n_estimators': [200,700],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None,3,5]
}
CV_rfc = GridSearchCV(estimator=rf, param_grid=param_grid, cv= 5)
CV_rfc.fit(train_features,train_data['sentiment'])

In [None]:
print CV_rfc.best_params_