# Understanding the dataset

>** Instructions: **
* Import the dataset into a pandas dataframe using the read_table method. Because this is a tab separated dataset we will be using '\t' as the value for the 'sep' argument which specifies this format. 
* Also, rename the column names by specifying a list ['label, 'sms_message'] to the 'names' argument of read_table().
* Print the first five values of the dataframe with the new column names.

In [6]:
#reading de tab separete values file with padas, converting in a data frame
import pandas as pd
sms_df = pd.read_table('smsspamcollection/SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms'])
sms_df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Data Preprocessing

>**Instructions: **
* Convert the values in the 'label' colum to numerical values using map method as follows:
{'ham':0, 'spam':1} This maps the 'ham' value to 0 and the 'spam' value to 1.
* Also, to get an idea of the size of the dataset we are dealing with, print out number of rows and columns using 
'shape'.

In [7]:
binary_label_column = sms_df.label.map({'ham':0, 'spam':1})
sms_df['label'] = binary_label_column
print(sms_df)
sms_df.head()

      label                                                sms
0         0  Go until jurong point, crazy.. Available only ...
1         0                      Ok lar... Joking wif u oni...
2         1  Free entry in 2 a wkly comp to win FA Cup fina...
3         0  U dun say so early hor... U c already then say...
4         0  Nah I don't think he goes to usf, he lives aro...
5         1  FreeMsg Hey there darling it's been 3 week's n...
6         0  Even my brother is not like to speak with me. ...
7         0  As per your request 'Melle Melle (Oru Minnamin...
8         1  WINNER!! As a valued network customer you have...
9         1  Had your mobile 11 months or more? U R entitle...
10        0  I'm gonna be home soon and i don't want to tal...
11        1  SIX chances to win CASH! From 100 to 20,000 po...
12        1  URGENT! You have won a 1 week FREE membership ...
13        0  I've been searching for the right words to tha...
14        0                I HAVE A DATE ON SUNDAY WITH

Unnamed: 0,label,sms
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


# Spliting Dataset into Training and Testing

>**Instructions:**
Split the dataset into a training and testing set by using the train_test_split method in sklearn. Split the data
using the following variables:
* `X_train` is our training data for the 'sms_message' column.
* `y_train` is our training data for the 'label' column
* `X_test` is our testing data for the 'sms_message' column.
* `y_test` is our testing data for the 'label' column
Print out the number of rows we have in each our training and testing data.

In [10]:
from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(sms_df['sms'],
                                                   sms_df['label'],
                                                   random_state=1)
print('# rows in full dataset: ', str(sms_df.shape[0]))
print('# rows in traning dataset: ', str(x_train.shape[0]))
print('# rows in test dataset: ', str(x_test.shape[0]))

# rows in full dataset:  5572
# rows in traning dataset:  4179
# rows in test dataset:  1393


# Applying Bag of Words processing to our dataset

>**Instructions:**
* Firstly, we have to fit our training data (`X_train`) into `CountVectorizer()` and return the matrix.
* Secondly, we have to transform our testing data (`X_test`) to return the matrix. 

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
training_data = count_vector.fit_transform(x_train)
testing_data = count_vector.transform(x_test)

# Naive Bayes implementation

>**Instructions:**
* We have loaded the training data into the variable 'training_data' and the testing data into the 
variable 'testing_data'.
* Import the MultinomialNB classifier and fit the training data into the classifier using fit(). Name your classifier
'naive_bayes'. You will be training the classifier using 'training_data' and y_train' from our split earlier. 
* Now that our algorithm has been trained using the training data set we can now make some predictions on the test data stored in 'testing_data' using predict(). Save your predictions into the 'predictions' variable.

In [25]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

# Evaluating our model

>**Instructions:**
* Compute the accuracy, precision, recall and F1 scores of your model using your test data 'y_test' and the predictions you made earlier stored in the 'predictions' variable.

In [28]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate(y_test, predictions):
    accuracy = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions)
    recall = recall_score(y_test, predictions)
    f1 = f1_score(y_test,predictions)
    return accuracy, precision, recall, f1

def show_evalution(y_test, predictions):
    accuracy,precision,recall,f1 = evaluate(y_test,predictions)
    print('Accuracy score: ', str(accuracy))
    print('Precision score: ', str(precision))
    print('Recall score: ', str(recall))
    print('F1 score: ', str(f1))

In [29]:
show_evalution(y_test, predictions)

Accuracy score:  0.988513998564
Precision score:  0.972067039106
Recall score:  0.940540540541
F1 score:  0.956043956044


# Conclusion

One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them. The other major advantage it has is its relative simplicity. Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known. It rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle. All in all, Naive Bayes' really is a gem of an algorithm!