#### Kaggle competition: Bag of Words Meets Bags of Popcorn

Kaggle competition tutorials: https://www.kaggle.com/c/word2vec-nlp-tutorial

I have adapted this tutorial: http://fastml.com/classifying-text-with-bag-of-words-a-tutorial/


<b>Goal:</b> To experiment with text classification. Try out different models and vectorization.

<b>Overview:</b>
- use labeled data from this kaggle competition
- split data into train and test sets
- use different models (random forest, linear regression)
- use different feature vectorizers (word count, TF-IDF)
- evaluate using AUC

Note:
TF-IDF stands for “term frequency / inverse document frequency” and is a method for emphasizing words that occur frequently in a given document, while at the same time de-emphasising words that occur frequently in many documents.

<b>Results:</b> 
(see end of notebook):

- random forest model is not ideal for high-dimensional sparce data
- linear regression works better and faster
- TD-IDF gives a big improvement keeping in stopwords and using n-grams (compared with simple word count vectorization)

- better to remove stopwords when using random forest and word count
- better to keep stopwords when using TF-IDF


In [1]:
# All imports

import pandas as pd
import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import roc_auc_score as AUC

import re
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords


In [2]:
# Function to read data

def read_data():
    return pd.read_csv('data/labeledTrainData.tsv', header = 0, delimiter = "\t", quoting = 3 )


In [3]:
# Functions for cleaning reviews

stop_words = set(stopwords.words("english"))                                                   

def clean_include_stopwords(review):
    text = BeautifulSoup(review, 'lxml').get_text() 
    text = re.sub("[^a-zA-Z]", " ", text)       
    words = text.lower().split()
    return " ".join(words)   

def clean_exclude_stopwords(review):
    text = BeautifulSoup(review, 'lxml').get_text() 
    text = re.sub("[^a-zA-Z]", " ", text)       
    words = text.lower().split()
    words = [word for word in words if not word in stop_words]   
    return " ".join(words)   

# Function to clean data 
def clean_data(data, cleaner):
    data['review'] = data['review'].apply(cleaner)    
    return data
    

In [4]:
# Function to split data into train and test sets

def split_train_test(data):
    train_i, test_i = train_test_split(np.arange(len(data)), train_size = 0.8, random_state = 44)
    train = data.ix[train_i]
    test = data.ix[test_i]
    return train, test


In [5]:
# Prepare data

data_include_stopwords = clean_data(read_data(), clean_include_stopwords)
data_exclude_stopwords = clean_data(read_data(), clean_exclude_stopwords)


In [40]:
# Function to analyse model with vectorizator

def analyse(vectorizer, model, remove_stopwords):
    
    data = data_exclude_stopwords if remove_stopwords else data_include_stopwords    
    train, test = split_train_test(data)
        
    train_features = vectorizer[0].fit_transform(train['review']).toarray()    
    model[0].fit(train_features, train["sentiment"])
    
    test_features = vectorizer[0].transform(test['review']).toarray()    
    predictions = model[0].predict_proba(test_features)
    
    return AUC(test['sentiment'].values, predictions[:,1] )


In [41]:
# Function to make vectorizers

def make_count_vectorizer(max_features):
    return CountVectorizer(max_features=max_features, analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None)
    
def make_tfidf_vectorizer(max_features):
    return TfidfVectorizer(max_features=max_features, ngram_range = (1, 3), sublinear_tf = True)


In [48]:
# Run analysis of various models and vectorization
# NOTE: THIS TAKES A LONG TIME TO RUN!

def results(m, v, s, a):
    print("model: ", m[1], "\tvectorizer: ", v[1], "\tstopwords: ", s, "\tauc: ", a)


model_linear_regres = LR()
model_random_forest = RandomForestClassifier(n_estimators = 100)

models = [(model_random_forest, "Random Forest", ), 
          (model_linear_regres, "Linear Regres", )]

vectorizers = []
max_features = [5000, 10000, 20000, 30000, 40000]
for m in max_features:
    vectorizers.append((make_count_vectorizer(m), "Count, \tmax features: " + str(m)))
for m in max_features:
    vectorizers.append((make_tfidf_vectorizer(m), "TFIDF, \tmax features: " + str(m)))

for m in models:
    for v in vectorizers:
        for s in [True, False]:
            results(m, v, s, analyse(v, m, s))


model:  Random Forest 	vectorizer:  Count, 	max features: 5000 	stopwords:  True 	auc:  0.916471261383
model:  Random Forest 	vectorizer:  Count, 	max features: 5000 	stopwords:  False 	auc:  0.916682627098
model:  Random Forest 	vectorizer:  Count, 	max features: 10000 	stopwords:  True 	auc:  0.920125840203
model:  Random Forest 	vectorizer:  Count, 	max features: 10000 	stopwords:  False 	auc:  0.918946288308
model:  Random Forest 	vectorizer:  Count, 	max features: 20000 	stopwords:  True 	auc:  0.924231071208
model:  Random Forest 	vectorizer:  Count, 	max features: 20000 	stopwords:  False 	auc:  0.918999809755
model:  Random Forest 	vectorizer:  Count, 	max features: 30000 	stopwords:  True 	auc:  0.925930397158
model:  Random Forest 	vectorizer:  Count, 	max features: 30000 	stopwords:  False 	auc:  0.919006129926
model:  Random Forest 	vectorizer:  Count, 	max features: 40000 	stopwords:  True 	auc:  0.925807353831
model:  Random Forest 	vectorizer:  Count, 	max features: 4000