# Spam Detection

Recently unsolicited commercial / bulk e-mail also known as spam, become a big trouble over
the internet. Spam is waste of time, storage space and communication bandwidth. The problem
of spam e-mail has been increasing for years. 

In this mission we will be using the Naive Bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like.

### STEP 1 : UNDERSTANDING THE DATA

In [1]:
import pandas as pd
df = pd.read_table('smsspamcollection/SMSSpamCollection', sep='\t',names=['category','mail-content'])
df.head()
df.describe()

Unnamed: 0,category,mail-content
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


#### DATA PREPROCESSING

We should convert the "class" values to integers, as scikit-learn only deals withnumerical values. Hence, to ensure that no problem pops up in future, we should make sure that all class values are either 0 (indicating ham) and 1 (indicating spam).

In [2]:
df['category'] = df.category.map({'ham':0, 'spam':1})
print(df.shape)
df.head()


(5572, 2)


Unnamed: 0,category,mail-content
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### STEP 2 : BAG OF WORDS

In this model, a text(such as a sentence or a doc) is represented as a bag(multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Let us suppose, we have two documents :
    
    Eg. : (1) Nishu likes music. Aadhya likes music too.
          (2) Nishu also likes to play guitar.
          
As an output, we'll be getting the following list.

["Nishu", "likes", "music", "Aadhya", "too", "also", "to", "play", "guitar"]

More info can be found here : https://en.wikipedia.org/wiki/Bag-of-words_model

#### COUNT VECTORIZER

•It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.

•It counts the occurrance of each of those tokens.


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

#### Splitting the data

In [4]:
from sklearn.model_selection import train_test_split

mail_train, mail_test, category_train, category_test = train_test_split(df['mail-content'], 
                                                    df['category'],random_state=9)

# random_state should be initialized to a number (any number). 
#It ensures that each time we split, the data in the training and testing set is same each time.

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(mail_train.shape[0]))
print('Number of rows in the test set: {}'.format(mail_test.shape[0]))

ImportError: No module named model_selection

#### Applying bag of words to our data

In [None]:
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(mail_train)

'''
# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
'''
testing_data = count_vector.transform(mail_test)

### STEP 3 : APPLYING BAYES THEOREM

In [None]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, category_train)

Now, we are ready to make predictions.

In [None]:
predictions = naive_bayes.predict(testing_data)

### STEP 4 : EVALUATING OUR MODEL

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(category_test, predictions)))
print('Precision score: ', format(precision_score(category_test, predictions)))
print('Recall score: ', format(recall_score(category_test, predictions)))
print('F1 score: ', format(f1_score(category_test, predictions)))

### CONCLUSION

Advantages of Naive-Bayes : 
- Able to handle data with extremely large no. of features
- Relatively simple
- Model training and prediction times are very fast