### CS183 Data Science Winter 2018 - Copyright @ Dr. Sukanya Manna###
## Homework 3 ##
**Due - 03/07/18 midnight**
<br>
The main aim of this homework is to get you conversant with two different classifiers, Naive Bayes and Logistic Regression using scikit-learn. 
<br> You are given a collection of spam and non-spam (known as ham) text messages. You have to use Naive Bayes and logistic regression to classify whether a given text message is spam or ham. 

**Submission:** Please make sure you complete this ``ipynb`` file and upload it in Camino under Assignments->hw3. Make sure your file has your ``name`` and ``email`` on top. Every step that you would work on should have comments.

**Honor Code:** I encourage students to discuss the programming assignments including specific algorithms and data structures required for the assignments. However, students should not share any source code for solution.

Code exists on the web for many problems including some that we may pose in problem sets or assignments. Students are expected to come up with the answers on their own, rather than extracting them from code on the web. This also means that we ask that you do not share your solutions to any of the homework, programming assignments, or problem sets with any other students. This includes any sort of sharing, whether face to face, by email, uploading onto public sites, etc. Doing so will drastically detract from the learning experience of your fellow students.

*Please note that you are not allowed to share this homework or code anywhere (github/or any repository!)*

### Part 1: Warming up 
**Get conversant with different functionalities of scikit-learn for processing texts.**

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
# here is a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [20]:
# learn the 'vocabulary' of the training data
# for this, look for CountVectorizer(), fit, and get_feature_names()
# TODO
vect = CountVectorizer()
vect.fit(simple_train)
vect.get_feature_names()

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [21]:
# transform training data into a 'document-term matrix'
# for this look for transform()
# TODO
dtm = vect.transform(simple_train)
dtm

<3x6 sparse matrix of type '<type 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [22]:
# print the sparse matrix
# TODO
print dtm

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


In [23]:
# convert sparse matrix to a dense matrix
# TODO
print dtm.toarray()


[[0 1 0 0 1 1]
 [1 1 1 0 0 0]
 [0 1 1 2 0 0]]


In [24]:
# examine the vocabulary and document-term matrix together
# you need to import pandas first
# and then create a DataFrame with the dense matrix you created, 
# where columns will have feature names, rows will begin from 0 index
import pandas as pd
# TODO

pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


*Read this:* [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [25]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"] # A sample test data
# transform the test data to an array as you did previously for train data
# TODO
dtm2 = vect.transform(simple_test)
print dtm2.toarray()

[[0 1 1 1 0 0]]


In [26]:
# examine the vocabulary and document-term matrix together
# hint: you will need to use pandas DataFrame to represent the test data into document-term matrix
# TODO
pd.DataFrame(dtm2.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


**Summary:**

- `vect.fit(train)` learns the vocabulary of the training data
- `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data
- `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

## Part 2: Reading SMS data

In [27]:
# read tab-separated file 
data = 'text.tsv'
col_names = ['label', 'message']
sms = pd.read_table(data, sep='\t', header=None, names=col_names)
# this is done for you here. Feel free to experiment with this to check rows, cols etc
# TODO

In [28]:
# print first 20 rows of your dataset
# TODO
sms.head(20)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [29]:
# print how many data points are there in each label
# TODO
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [30]:
# convert label to a numeric variable
# for example "ham" as 0 and "spam" as 1
# TODO

sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
sms.head(20)
# I was getting errors when trying to switch the labels into 0 and 1. 
# I would get NaNs instead of 0's or 1's so I just made a seperate column instead.


Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [31]:
# define X and y
# X should be the messages
# y should be the label
# TODO
X = sms.message
y = sms.label_num


In [32]:
# split into training and testing sets
# An example of how you can split your dataset into test and train
# you might get "DeprecationWarning" 

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print X_train.shape
print X_test.shape
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


## Part 3: Vectorizing SMS data

In [33]:
# instantiate the vectorizer
# TODO
vect = CountVectorizer()

In [34]:
# learn training data vocabulary, then create document-term matrix
# TODO
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
print X_train_dtm

  (0, 50)	1
  (0, 264)	1
  (0, 509)	1
  (0, 1552)	1
  (0, 1572)	1
  (0, 2022)	1
  (0, 2864)	2
  (0, 3170)	1
  (0, 3181)	1
  (0, 3880)	1
  (0, 3971)	1
  (0, 4375)	1
  (0, 4662)	1
  (0, 4743)	1
  (0, 4779)	1
  (0, 4781)	1
  (0, 4983)	1
  (0, 4987)	1
  (0, 5193)	1
  (0, 5479)	1
  (0, 6656)	1
  (0, 6892)	1
  (0, 7424)	1
  (1, 2222)	1
  (1, 3316)	1
  :	:
  (4177, 2744)	1
  (4177, 2786)	1
  (4177, 3629)	1
  (4177, 3700)	1
  (4177, 3738)	1
  (4177, 4255)	1
  (4177, 4446)	1
  (4177, 4508)	1
  (4177, 4778)	1
  (4177, 4934)	1
  (4177, 5403)	1
  (4177, 5490)	1
  (4177, 5656)	1
  (4177, 5796)	1
  (4177, 6034)	1
  (4177, 6514)	1
  (4177, 6577)	1
  (4177, 6656)	1
  (4177, 6662)	1
  (4177, 6887)	1
  (4177, 7257)	1
  (4178, 1691)	1
  (4178, 4238)	1
  (4178, 5999)	1
  (4178, 7257)	1


In [35]:
# transform testing data (using fitted vocabulary) into a document-term matrix
# TODO
X_test_dtm = vect.transform(X_test)
print X_test_dtm

  (0, 1538)	1
  (0, 5189)	1
  (0, 6542)	1
  (0, 7405)	1
  (1, 1016)	1
  (1, 3050)	1
  (1, 4163)	1
  (1, 4238)	1
  (1, 4370)	1
  (1, 5200)	1
  (1, 6656)	1
  (1, 7407)	1
  (1, 7420)	1
  (2, 986)	1
  (2, 3244)	1
  (2, 7162)	1
  (3, 3237)	1
  (4, 887)	2
  (4, 1060)	1
  (4, 1595)	1
  (4, 2066)	1
  (4, 2833)	1
  (4, 3388)	1
  (4, 3623)	1
  (4, 3921)	1
  :	:
  (1391, 4373)	1
  (1391, 4413)	1
  (1391, 4441)	1
  (1391, 4743)	1
  (1391, 4778)	1
  (1391, 6017)	1
  (1391, 6057)	1
  (1391, 6829)	1
  (1391, 6904)	1
  (1391, 7012)	1
  (1391, 7120)	1
  (1391, 7230)	2
  (1391, 7239)	1
  (1391, 7287)	1
  (1391, 7357)	1
  (1392, 848)	1
  (1392, 2400)	1
  (1392, 2873)	1
  (1392, 3158)	1
  (1392, 4238)	1
  (1392, 4255)	2
  (1392, 4487)	1
  (1392, 4802)	1
  (1392, 5565)	1
  (1392, 7075)	1


## Part 4: Examining the tokens and their counts

In [36]:
# store token names into a variable
# TODO
token = vect.get_feature_names()
len(token)

7456

In [45]:
# print first 50 tokens
# TODO
print(token[0:50])

[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']


In [None]:
# print last 50 tokens
# TODO
print(token[-50:])

In [None]:
# view X_train_dtm (training document term matrix) as a dense matrix
# TODO
X_train_dtm_dense = X_train_dtm.toarray()
print X_train_dtm_dense 

In [None]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
# TODO
count = X_train_dtm_dense.sum(axis = 0)
print count.tolist()[0]


In [None]:
# print the number of data points in training set
len(X_train)

In [None]:
# create a DataFrame of tokens with their counts
# such that you will have two columns -- count and token
# TODO
pd.DataFrame(np.transpose([count, token]), columns = ['count', 'token'])

**Calculating the "spamminess" of each token**

In [None]:
# create separate DataFrames for ham and spam
# TODO
df_ham = sms[sms.label_num==0]
df_spam = sms[sms.label_num==1]

In [None]:
# learn the vocabulary of ALL messages and save it
# TODO
sms_vocab =  vect.fit(sms['message'])
# put the names of all features (tokens) into a variable
# TODO
sms_features = sms_vocab.get_feature_names()
sms_features

In [None]:
# create document-term matrices for ham and spam
# TODO
ham_dtm = sms_vocab.transform(df_ham['message'])
spam_dtm = sms_vocab.transform(df_spam['message'])
print ham_dtm

In [None]:
# count how many times EACH token appears across ALL ham messages
# TODO
train_ham_dense = ham_dtm.toarray()
ham_count = np.sum(train_ham_dense , axis = 0)
print ham_count

In [None]:
# count how many times EACH token appears across ALL spam messages
# TODO
train_spam_dense = spam_dtm.toarray()
spam_count = np.sum(train_spam_dense , axis = 0)
print spam_count

In [None]:
# create a DataFrame of tokens with their separate ham and spam counts
# TODO
#sms_df = pd.DataFrame(np.transpose([sms_features, ham_count, spam_count]), columns = ['token','ham_count', 'spam_count'])
sms_df = pd.DataFrame({'tokens':sms_features, 'ham_count': ham_count, 'spam_count':spam_count})
sms_df.head()

In [None]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
# TODO
sms_df['ham_count'] = sms_df.ham_count + 1
sms_df['spam_count'] = sms_df.spam_count +1

In [128]:
# calculate ratio of spam-to-ham for each token
# i.e. token counts for spam/ token counts for ham
# TODO
spam_ham_ratio = sms_df.spam_count / sms_df.ham_count
sms_df['ratio'] = spam_ham_ratio
sms_df.head()

Unnamed: 0,ham_count,spam_count,tokens,ratio
0,1,11,00,11.0
1,1,30,000,30.0
2,2,1,000pes,0.5
3,1,3,008704050406,3.0
4,1,2,0089,2.0


## Part 5: Building a Naive Bayes model

We will use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [38]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
# TODO
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [41]:
# make class predictions for X_test_dtm
# TODO
y_pred_class = nb.predict(X_test_dtm)

In [42]:
# calculate accuracy of class predictions
# compute the accuracy scores
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class) 

0.9885139985642498


In [44]:
# confusion matrix
# TODO
print metrics.confusion_matrix(y_test, y_pred_class)

[[1203    5]
 [  11  174]]


In [84]:
# predict (poorly calibrated) probabilities (optional)
# TODO

In [46]:
# print message text for the false positives
# TODO
X_test[y_test < y_pred_class]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [47]:
# print message text for the false negatives
# TODO
X_test[y_test > y_pred_class]

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

## Part 6: Comparing Naive Bayes with logistic regression

In [49]:
# import/instantiate/fit
from sklearn.linear_model import LogisticRegression
# TODO
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [53]:
# class predictions and predicted probabilities
# TODO
y_pred_class = logreg.predict(X_test_dtm)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
print y_pred_class
print y_pred_prob 


[0 0 0 ... 0 1 0]
[0.01269556 0.00347183 0.00616517 ... 0.03354907 0.99725053 0.00157706]


In [51]:
# calculate accuracy
# TODO
metrics.accuracy_score(y_test, y_pred_class)

0.9877961234745154