# Spam detection algorithm #

Spam detection is one of the major applications of Machine Learning, here I'll be using the Naive Bayes algorithm to create a model that can classify [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) SMS messages as spam or not spam, based on the training we give to the model.

## Overview

This project has been divided into the following steps:

- 1: Understanding our dataset
- 2: Data Preprocessing
- 3: Training and testing sets
- 4: Applying Bag of Words processing to our dataset.
- 5: Naive Bayes implementation using scikit-learn
- 6: Evaluating our model
- 7: Conclusion

## 1 - Understanding our dataset ##

In [1]:
import pandas as pd

df = pd.read_table('sms_collection',
                   sep='\t',
                   header=None, 
                   names=['label', 'sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 2 - Data Preprocessing ##

In [None]:
df['label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

## 3 - Training and testing sets ##

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

## 4 - Applying Bag of Words processing to our dataset ##

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
training_data = count_vector.fit_transform(X_train)
testing_data = count_vector.transform(X_test)

## 5 - Naive Bayes implementation using scikit-learn ##

In [None]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)

## 6 - Evaluating our model ##

In [21]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562
