# Assignment 3: Kaggle Assignment

## Info
* This assignment is solving a data challenge on Kaggle. 
* You will graded based upon the score you get on Kaggle. 

## Setup

* Download [Anaconda Python 3.6](https://www.anaconda.com/download/) for consistent environment.
* If you use pip environment then make sure your code is compatible with versions of libraries provided withing Anaconda's Python 3.6 distribution.

## Submission
* Make sure you submit all your code in ZIP file on the learn, it must contain a TEXT file (.txt) containing explanation for your approach (just paragraph or two nicely explained in bullet points).


## Submission Notes
(Please write any notes here that you think I should know during marking)

In [53]:
import pandas as pd
import numpy as np
import nltk



In [54]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn import tree

from sklearn.model_selection import cross_val_score

ps = PorterStemmer()
stop_words = set(stopwords.words('english'))



In [55]:
# Stems words to their root words and removes all characters that are not alphabets
def stem_str(str):
    ret_str = ""
    for w in word_tokenize(str.lower()):
        if w not in stop_words and w.isalpha():
            ret_str = ret_str + " " + ps.stem(w)
    return ret_str.strip()

# Gets the count of most frequent words give a dataframe
def word_freq(df):
    word_frequency = {}
    for index,row in df.iterrows():
        for w in word_tokenize(row['stemmed_sms']):
            if w not in word_frequency:
                word_frequency[w] = 1
            else:
                word_frequency[w] += 1
    return word_frequency

In [56]:
#reading in the data and renaming columns    
data = pd.read_csv('./spam.csv',encoding = "ISO-8859-1")
data.columns = ['category', 'text']

print(data)

     category                                               text
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
5        spam  FreeMsg Hey there darling it's been 3 week's n...
6         ham  Even my brother is not like to speak with me. ...
7         ham  As per your request 'Melle Melle (Oru Minnamin...
8        spam  WINNER!! As a valued network customer you have...
9        spam  Had your mobile 11 months or more? U R entitle...
10        ham  I'm gonna be home soon and i don't want to tal...
11       spam  SIX chances to win CASH! From 100 to 20,000 po...
12       spam  URGENT! You have won a 1 week FREE membership ...
13        ham  I've been searching for the right words to tha...
14        ham            

In [57]:
#replace ham and spam with 0 and 1
data['category'] = data['category'].replace(['ham','spam'],[0,1])

y = data['category'].as_matrix()
X_text = data['text'].as_matrix() 
data['stemmed_sms'] = data.loc[:,'text'].apply(lambda x: stem_str(str(x)))
X_text_stem = data['stemmed_sms'].as_matrix() 



In [58]:
#CountVectorizer alone better accuracy than with TfidfVectorizer
sw = stopwords.words("english")
cv = CountVectorizer(stop_words =sw)
X_stem = cv.fit_transform(X_text).toarray()

X = cv.fit_transform(X_text_stem).toarray()

print(X_stem)
print(X)



[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]


In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


In [None]:
#train tree
clf_gini = tree.DecisionTreeClassifier(criterion="gini")
clf_entropy = tree.DecisionTreeClassifier(criterion="entropy")
#gini
print("Gini")
clf_gini.fit(X_train,y_train)
predtree = clf_gini.predict(X_test)

#print('accuracy:')
print(accuracy_score(y_test,predtree))

#print('precision:')
print(precision_score(y_test,predtree))

print("entropy")
clf_entropy.fit(X_train,y_train)
predtree = clf_entropy.predict(X_test)
print(accuracy_score(y_test,predtree))
print(precision_score(y_test,predtree))


Gini
0.964114832536
0.905829596413
entropy
0.960526315789
0.918660287081


In [None]:
scores = []
cv_acc_scores = []
cv_prec_scores = []
print("Gini")
# perform 10-fold cross validation
acc_scores = cross_val_score(clf_gini, X_train, y_train, cv=10, scoring='accuracy')
prec_scores = cross_val_score(clf_gini, X_train, y_train, cv=10, scoring='precision')

scores.append([100, round(acc_scores.mean(), 6), round(prec_scores.mean(), 6)])
print(scores)

scores = []

print("entropy")
acc_scores = cross_val_score(clf_entropy, X_train, y_train, cv=10, scoring='accuracy')
prec_scores = cross_val_score(clf_entropy, X_train, y_train, cv=10, scoring='precision')

scores.append([100, round(acc_scores.mean(), 6), round(prec_scores.mean(), 6)])
print(scores)


Gini
[[100, 0.96306999999999998, 0.85955499999999996]]
entropy
