### Step 1: Understanding the dataset

1. Read SMSSpamCollection.csv

2. Print shape of the whole CSV

3. Print first 10 lines to study the data


1. `label` explains if a record is spam or not

2. `message` is the actual data to be classified

In [34]:
import pandas as pd

df = pd.read_csv('SMSSpamCollection.csv', sep='\t', header=None, names=['label', 'sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Step 2: Data Pre-processing

In [35]:
# Since algorithm feeds on numerical data, we have to convert label as 0 or 1
df['label'] = df.label.map({ 'ham': 0, 'spam': 1 })

df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Step 3: Implementing Bag of Words

In [36]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

print count_vector

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [26]:
# Performs following:

# 1. Remove all punctuations
# 2. Lowercase all of the words
# 3. Converts phrases into tokens

# Marks these tokens as headers

In [38]:
count_vector.fit(documents)
count_vector.get_feature_names()

[u'are',
 u'call',
 u'from',
 u'hello',
 u'home',
 u'how',
 u'me',
 u'money',
 u'now',
 u'tomorrow',
 u'win',
 u'you']

In [27]:
# Performs following:

# 1. Capture the occurrences of words matching headers from documents

In [39]:
doc_array = count_vector.transform(documents).toarray()
print doc_array

[[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]


In [None]:
# Represent these words in the form of DataFrame
# This is known as a Frequency Matrix

In [40]:
frequency_matrix = pd.DataFrame(doc_array, columns=count_vector.get_feature_names())

frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


### Step 3: Prepare training and testing datasets

In [55]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], df['label'], random_state=42)

print "Total number of rows in dataset: {}".format(df.shape[0])
print "Total number of rows in training dataset: {}".format(X_train.shape[0])
print "Total number of rows in testing dataset: {}".format(X_test.shape[0])

Total number of rows in dataset: 5572
Total number of rows in training dataset: 4179
Total number of rows in testing dataset: 1393


In [45]:
# Now convert out training and testing datasets into frequency matrix respectively

In [56]:
count_vector = CountVectorizer()
training_data = count_vector.fit_transform(X_train)
testing_data = count_vector.transform(X_test)

### Step 4: Implement Naive Bayes

In [57]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [58]:
predictions = classifier.predict(testing_data)

In [59]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

('Accuracy score: ', '0.988513998564')
('Precision score: ', '0.977528089888')
('Recall score: ', '0.935483870968')
('F1 score: ', '0.956043956044')


#### Step 7: Conclusion

One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words.

Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them. The other major advantage it has is its relative simplicity.

Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known. It rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle.

All in all, Naive Bayes' really is a gem of an algorithm!

In [62]:
print training_data
print testing_data

  (0, 4771)	1
  (0, 3376)	1
  (0, 268)	1
  (0, 6999)	1
  (0, 3804)	1
  (0, 1814)	1
  (0, 193)	1
  (0, 1549)	1
  (0, 1757)	2
  (0, 5586)	1
  (0, 5243)	1
  (0, 698)	1
  (0, 5446)	1
  (0, 6690)	2
  (0, 5803)	1
  (0, 1244)	1
  (0, 3230)	1
  (0, 7453)	1
  (0, 2051)	1
  (0, 4582)	1
  (0, 7003)	1
  (0, 1046)	1
  (0, 7277)	1
  (1, 2353)	1
  (1, 6578)	1
  :	:
  (4174, 4802)	1
  (4174, 4474)	1
  (4174, 7453)	1
  (4175, 6128)	1
  (4175, 3881)	1
  (4175, 4008)	1
  (4175, 1549)	1
  (4176, 6132)	1
  (4176, 6133)	1
  (4176, 5171)	1
  (4176, 5432)	1
  (4176, 3251)	1
  (4176, 2894)	1
  (4177, 3699)	1
  (4177, 6706)	1
  (4177, 5832)	1
  (4177, 4674)	1
  (4178, 4525)	1
  (4178, 6094)	1
  (4178, 6474)	1
  (4178, 5732)	1
  (4178, 2220)	1
  (4178, 3600)	1
  (4178, 3728)	1
  (4178, 3239)	1
  (0, 1149)	1
  (0, 1746)	1
  (0, 2055)	1
  (0, 2163)	1
  (0, 2999)	1
  (0, 3226)	1
  (0, 3397)	2
  (0, 3456)	2
  (0, 3600)	1
  (0, 3962)	1
  (0, 4103)	1
  (0, 4244)	1
  (0, 4505)	1
  (0, 4641)	1
  (0, 4965)	1
  (0, 6079)	