#  Kaggle Challenge
### Nic Mon and Michelle Carney
For this assignment, we used the SciKitLearn tutorial from class, along with the CountVectorizer and SVM documentation. Nic tried to create an SVM predictor but it took too long and kept killing his kernel. Nic also tried pulling out the synsets, but we found that running a synset function on every review would take too long to run in order to be valuable.

We decided to go with the logistic regression model because it took the least amount of time to run and we could test more options, and it gave us the best prediction (0.86 right away compared to other methods that gave us 0.30 on the first try). We played around with CountVectorizer and TfidfVectorizer - testing different hypotheses (see below for how well they predicted). We found that if we trained the vecotrizers on unigrams, with the token pattern removing numbers and punctuations and removing stopwords with a df of 2 predicted the test data the best.

We found that SVM took too long - we couldn't get it to work without the kernel dying. We also found that reducing the df allowed for more features in the vectorizer, and that increased our accuracy score.

Michelle worked on getting the sklearn tutorial from class to work with the data, and getting the CountVectorizer and Tdidfvectorizer to remove stopwords, and for certain strings (using regex), running naive bayes and logistic regression on our training set, and printing the output for the class.

Nic worked on training the logistic regression using the whole training set, trying to build synset feature functions, trying to get SVM to work, looking at the text of reviews where our predictors were wrong, and reducing the df to add more features, which yielded a better prediction.

We found that SVM took too long - we couldn't get it to work without the kernel dying. We also found that reducing the df allowed for more features in the vectorizer, and that increased our accuracy score.


In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import TreebankWordTokenizer
import re
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


df = pd.read_csv("yelp_data_official_training.csv", low_memory=False, delimiter="|")
df_test = pd.read_csv("yelp_data_official_test_nocategories.csv", low_memory=False, delimiter="|")

In [2]:
#df_test

In [11]:
random_index = np.random.permutation(df.index)
random_index[:10]
df.ix[random_index, ['ID', 'Category', 'Review Text']][:5]
df_shuffled = df.ix[random_index, ['ID', 'Category', 'Review Text']]
df_shuffled.reset_index(drop=True, inplace=True)

In [17]:
rows, columns = df_shuffled.shape
print("Rows:", rows)
print("Columns:", columns)
#train_size = round(rows*.6)
train_size = round(rows*.9)
#dev_size   = round(rows*.2)
dev_size   = round(rows*.1)
df_train = df_shuffled.loc[:train_size]
df_train.shape
df_dev = df_shuffled.loc[train_size:dev_size+train_size].reset_index(drop=True)
df_dev.shape
df_test = df_shuffled.loc[dev_size+train_size:].reset_index(drop=True)
df_test.shape

Rows: 48001
Columns: 3


(0, 3)

# Here's a bunch of things we tested.....
### Let's make a vectorizer to get the array of features! (or let's try a bunch of different vectorizers!)

In [51]:
vec = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', analyzer=u'word', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##running this got accuracy score of 0.88699093844391208
##On the kaggle leaderboard, we got 0.87028
###with 1 df on entire training set 0.87833

In [None]:
vec = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', analyzer=u'word', stop_words = 'english', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
#testing w removed stop_words
##running this got accuracy score of 0.89

In [55]:
vec = CountVectorizer(ngram_range=(1, 1), token_pattern=r'\b\w+\b', analyzer=u'word', stop_words = 'english', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##testing only unigrams accuracy score of 0.89636496198312676

In [None]:
vec = CountVectorizer(ngram_range=(1, 1), token_pattern=r'\b\w+\b', analyzer=u'word', stop_words = 'english', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##testing only unigrams, removed stopwords accuracy score of 0.89709405270284348

In [63]:
vec = CountVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##testing only unigrams, removed stopwords, remove numbers from token pattern accuracy score of 0.89740652015415057

In [84]:
vec = CountVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=3)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##testing only unigrams, removed stopwords, remove numbers from token pattern, df=3 accuracy score of 0.8978231434225602

In [90]:
vec = CountVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=2)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##testing only unigrams, removed stopwords, remove numbers from token pattern, df=2 accuracy score of 0.89803145505676496

In [94]:
vec = CountVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=1)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##testing only unigrams, removed stopwords, remove numbers from token pattern, df=1 accuracy score of 0.89813561087386728

### tfidf vectorizer

In [89]:
vec = TfidfVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', stop_words=None, analyzer=u'word', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##running this got accuracy score of 0.87751275908759507

In [47]:
vec = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r'\b\w+\b', stop_words=None, analyzer=u'word', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##try only unigrams accuracy score of 0.89303197583585048

In [72]:
vec = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=5)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##try only unigrams, remove stop words and numbers, accuracy score of 0.89542755962920528

In [76]:
vec = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=7)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##try only unigrams, remove stop words and numbers, df=7 accuracy score of 0.89511509217789809

In [6]:
vec = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=2)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##try only unigrams, remove stop words and numbers, df=3 accuracy score of 0.93917300281220706
#with full training set, kaggle score 0.87972

In [18]:
vec = TfidfVectorizer(ngram_range=(1, 2), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=2)
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##try unigrams bigrams, remove stop words and numbers, df=2 accuracy score of 0.94042287261743573
## try unigrams bigrams remove stop words and numbers df = 2, train size .6 0.88792834079783356

In [None]:
vec = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=1)
#Highest predictor of 0.88028
df_train = df_train.fillna("")
df_dev = df_dev.fillna("")
df_test = df_test.fillna("")
##try unigrams bigrams, remove stop words and numbers, df=1 Kaggle score of 0.88028


### Let's make an array of features and test it on the dev set

In [None]:
arr_train_feature_sparse = vec.fit_transform(df_train['Review Text'])
arr_train_feature_sparse
arr_train_feature = arr_train_feature_sparse.toarray()
feature_labels = vec.get_feature_names()

In [None]:
arr_dev_feature_sparse = vec.transform(df_dev["Review Text"])
arr_dev_feature = arr_dev_feature_sparse.toarray()

### Let's test this with logistic regression on the dev set and get the accuracy score!

In [16]:
logreg = LogisticRegression()
logreg_model = logreg.fit(arr_train_feature, df_train['Category']) #defining features (from reviews) and passing in Category label
logreg_predictions = logreg_model.predict(arr_dev_feature)
accuracy_score(df_dev['Category'], logreg_predictions)

0.88792834079783356

### now let's run this on the test data!

In [None]:
arr_test_feature_sparse = vec.transform(df_test["Review Text"]) #change to test
arr_test_feature = arr_test_feature_sparse.toarray()

In [139]:
logreg = LogisticRegression()
logreg_predictions_test = logreg_model.predict(arr_test_feature)

In [69]:
logreg_predictions_test

array([3, 3, 2, ..., 4, 4, 4])

# This is our winning predictor

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
import csv, sqlite3
from string import punctuation
from nltk import word_tokenize
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from nltk.corpus import stopwords

def read_file():
    with open("yelp_data_official_training.csv") as csvfile:
        data = pd.read_csv(csvfile, "r", delimiter="|")
    return data
data = read_file()

def create_data_sets(data):
    random_index = np.random.permutation(data.index)
    data_shuffled = data.ix[random_index]
    data_shuffled.reset_index(drop=True, inplace=True)
    rows, columns = data_shuffled.shape
    train_size = round((rows-1)*.75)
    dev_size   = round((rows-1)*.25)
    train_data = data_shuffled.loc[:train_size]
    dev_data = data_shuffled.loc[train_size:dev_size+train_size].reset_index(drop=True)
    return train_data, dev_data

train_data, dev_data = create_data_sets(data)

### Let's test on our dev data and return where our model got it wrong

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import TreebankWordTokenizer
import re

vec = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=2)

tokenizer = vec.build_tokenizer()

train_data = train_data.fillna("")
dev_data = dev_data.fillna("")

arr_train_feature_sparse = vec.fit_transform(train_data['Review Text'])
arr_train_feature = arr_train_feature_sparse.toarray()

arr_dev_feature_sparse = vec.transform(dev_data['Review Text'])
arr_dev_feature = arr_dev_feature_sparse.toarray()

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression()
logreg_model = log_reg.fit(arr_train_feature, train_data['Category'])
logreg_predictions = logreg_model.predict(arr_dev_feature)

for real, pred, text in zip(dev_data['Category'], logreg_predictions, dev_data['Review Text']):
    if real != pred:
        print(real, pred)
        print(text)

### Now let's use our whole training data to train the log reg and predict the test data

In [None]:
def read_test_file():
    with open("yelp_data_official_test_nocategories.csv") as csvfile:
        data = pd.read_csv(csvfile, "r", delimiter="|")
    return data

test_data = read_test_file()

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import TreebankWordTokenizer
import re

vec = TfidfVectorizer(ngram_range=(1, 1), token_pattern=r'\b[^\d\W]+\b', analyzer=u'word', stop_words = 'english', min_df=1)
#Highest predictor of 0.88028

tokenizer = vec.build_tokenizer()

data = data.fillna("")
test_data = test_data.fillna("")

arr_train_feature_sparse = vec.fit_transform(data['Review Text'])
arr_train_feature = arr_train_feature_sparse.toarray()

arr_test_feature_sparse = vec.transform(test_data['Review Text'])
arr_test_feature = arr_test_feature_sparse.toarray()

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression()
logreg_model = log_reg.fit(arr_train_feature, data['Category'])
logreg_predictions_test = logreg_model.predict(arr_test_feature)

## Let's create a CSV to submit!

In [70]:
df_test['Category'] = logreg_predictions_test
cols = ["ID", "Category"]
finalsubmission = df_test[cols]
finalsubmission.to_csv('yelp_data_official_test_submission.csv', index = False)

In [71]:
finalsubmission.tail()

Unnamed: 0,ID,Category
11995,11995,4
11996,11996,4
11997,11997,4
11998,11998,4
11999,11999,4
