# Example for using Naive Bayes for spam detection

## Contents:
- Data description
- Load packages and data
- Data preprocessing 
- Split into train-test sets
- Use Naive Bayes for classification
- Evaluate the performance of the model

## Data Set Information:

The collection is composed by just one text file, where each line has the correct class followed by the raw message. We offer some examples bellow: 

- ham What you doing?how are you? 
- ham Ok lar... Joking wif u oni... 
- ham dun say so early hor... U c already then say... 
- ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H* 
- ham Siva is in hostel aha:-. 
- ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor. 
- spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop 
- spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B 
- spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU 

Note: the messages are not chronologically sorted.

## Load necessary packages

In [1]:
import pandas as pd

# to split the data into training and test sets
from sklearn.model_selection import train_test_split

# to count the frequency of the words
from sklearn.feature_extraction.text import CountVectorizer

# to classify the data
from sklearn.naive_bayes import MultinomialNB

# to evaluate the model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## Reading the data

In [2]:
df = pd.read_csv('smsspamcollection/SMSSpamCollection', sep='\t', names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
print('Example Ham:\n\t' + str(df.iloc[0]['message']))
print('Example Spam:\n\t' + str(df.iloc[2]['message']))

Example Ham:
	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Example Spam:
	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


## Data preprocessing

In [4]:
df.dtypes

label      object
message    object
dtype: object

The *label* column is **object** data type. Since there are only two classes - 'ham' and 'spam' - it is a binary classification problem. This means we need to convert the class labels to integer values

In [5]:
df['label'] = df.label.map({'ham' : 0, 'spam' : 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Splitting the data into training and test sets

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'])

print("Number of rows in the total set: {}".format(df.shape[0]))
print("Number of rows in the training set: {}".format(X_train.shape[0]))
print("Number of rows in the test set: {}".format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


### Illustration for count vectorization
- How it is used and why it is appropriate for Naive Bayes.
- Each vector maintains a count of how many times each word occurs in a given document.

In [7]:
count_vector = CountVectorizer()
documents = ['How are you!',
            'Win money, win from home.',
            'Call me now',
            'Hello, Can I call you tomorrow?']
count_vector.fit(documents)
doc_array = count_vector.transform(documents).toarray()
frequency_matrix = pd.DataFrame(doc_array, columns=count_vector.get_feature_names())
frequency_matrix.head()

Unnamed: 0,are,call,can,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,0,0,0,1,0,0,0,0,0,1
1,0,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,0,1,0,1,0,0,0
3,0,1,1,0,1,0,0,0,0,0,1,0,1


Applying this technique to our data corpus.

In [8]:
count_vectorizer = CountVectorizer()
training_data = count_vectorizer.fit_transform(X_train)
testing_data = count_vectorizer.transform(X_test)

## Use Naive Bayes model for classification

Now that we have converted the texts to count vectorized form, we can train the multinomial Naive Bayes classifier on them.

(This classifier is suitable for classification with discrete features, i.e., word counts for text classification. It takes in integer word counts as its input. On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.)

In [9]:
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
predictions = naive_bayes.predict(testing_data)

## Evaluation of the model

In [12]:
print("Accuracy: {}".format(round(accuracy_score(y_test.values, predictions), 4)))
print("Precision: {}".format(round(precision_score(y_test.values, predictions), 4)))
print("Recall: {}".format(round(recall_score(y_test.values, predictions), 4)))
print("F1 score: {}".format(round(f1_score(y_test.values, predictions), 4)))

Accuracy: 0.9821
Precision: 0.9641
Recall: 0.8944
F1 score: 0.928


## References:
1. I downloaded the data from the <a href = https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection > UCI Machine Learning Repository </a>. The paper is Almeida, Tiago A., José María G. Hidalgo, and Akebo Yamakami. "Contributions to the study of SMS spam filtering: new collection and results." Proceedings of the 11th ACM symposium on Document engineering. ACM, 2011.
2. The <a href = https://medium.com/coinmonks/spam-detector-using-naive-bayes-c22cc740e257 > medium article </a> I found helful. 