##### Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions. 

##### In the dataset. the first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam. The second column is the text content of the SMS message that is being classified.

#### PREPARE DATA

#### Load dataset

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [46]:
df = pd.read_table('Data/SMSSpamCollection', 
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

print('No.of rows:', df.shape[0])
print('No.of columns:', df.shape)
df.head()

No.of rows: 5572
No.of columns: (5572, 2)


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data Preptocessing

#### Convert string label spam and ham into numerical value (Spam:1, Ham:0)

In [47]:
df['label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Bag of Words

Here we'd like to introduce the Bag of Words(BoW) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. 

The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter. 

Using a process which we will go through now, we can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.

Here as we can see, the documents are numbered in the rows, and each word is a column name, with the corresponding value being the frequency of that word in the document.

Lets break this down and see how we can do this conversion using a small set of documents.

To handle this, we will be using sklearns COUNT VECTORIZER method which does the following:

* It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.
* It counts the occurrance of each of those tokens.

** Please Note: ** 

* The CountVectorizer method automatically converts all tokenized words to their lower case form so that it does not treat words like 'He' and 'he' differently. It does this using the lowercase parameter which is by default set to True.

* It also ignores all punctuation so that words followed by a punctuation mark (for example: 'hello!') are not treated differently than the same words not prefixed or suffixed by a punctuation mark (for example: 'hello'). It does this using the token_pattern parameter which has a default regular expression which selects tokens of 2 or more alphanumeric characters.

* The third parameter to take note of is the stop_words parameter. Stop words refer to the most commonly used words in a language. They include words like 'am', 'an', 'and', 'the' etc. By setting this parameter value to english, CountVectorizer will automatically ignore all words(from our input text) that are found in the built in list of english stop words in scikit-learn. This is extremely helpful as stop words can skew our calculations when we are trying to find certain key words that are indicative of spam.

### Implementing Bag of Words from Scratch

##### ** Step 1: Convert all strings to their lower case form. **

In [48]:
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

In [49]:
lower_case_documents = []
for i in documents:
    lower_case_documents.append(i.lower())
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


##### ** Step 2: Removing all punctuations **

In [50]:
sans_punctuation_documents = []
import string

for i in lower_case_documents:
    sans_punctuation_documents.append(i.translate(str.maketrans('', '', string.punctuation)))
print(sans_punctuation_documents)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


##### ** Step 3: Tokenization **

* Tokenizing a sentence in a document set means splitting up a sentence into individual words using a delimiter. The delimiter specifies what character we will use to identify the beginning and the end of a word(for example we could use a single space as the delimiter for identifying words in our document set.)

In [51]:
preprocessed_documents = []

for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(' '))
print(preprocessed_documents)

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


##### ** Step 4: Count frequencies **

Now that we have our document set in the required format, we can proceed to counting the occurrence of each word in each document of the document set (or counting the occurence of each word of a sms message). We will use the counter() method from the Python collections library for this purpose. 

* counter() counts the occurrence of each item in the list and returns a dictionary with the key as the item being counted and the corresponding value being the count of that item in the list. 

In [52]:
frequency_list = []
import pprint
from collections import Counter

for i in preprocessed_documents:
    frequency_counts = Counter(i)
    frequency_list.append(frequency_counts)
pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


Now that we have implemented the Bag of Words process from scratch! As we can see in our previous output, we have a frequency distribution dictionary which gives a clear view of the text that we are dealing with.

 
#### We will now implement sklearn.feature_extraction.text.CountVectorizer method in the next step.

### Implementing Bag of Words in scikit-learn

Now that we have implemented the BoW concept from scratch, let's go ahead and use scikit-learn to do this process in a clean and succinct way. We will use the same document set as we used in the previous step. 

In [53]:
documents = ['Hello, how are you!', 'Win money, win from home.', 'Call me now.', 'Hello, Call hello you tomorrow?']

In [54]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

Previously, we implemented a version of the CountVectorizer() method from scratch that entailed cleaning our data first. This cleaning involved converting all of our data to lower case and removing all punctuation marks. CountVectorizer() has certain parameters which take care of these steps for us. They are:

* lowercase = True

The lowercase parameter has a DEFAULT value of True which converts all of our text to its lower case form.

* token_pattern = (?u)\\b\\w\\w+\\b

The token_pattern parameter has a DEFAULT regular expression value of (?u)\\b\\w\\w+\\b which ignores all punctuation marks and treats them as delimiters, while accepting alphanumeric strings of length greater than or equal to 2, as individual tokens or words.

* stop_words

The stop_words parameter, if set to english will remove all words from our document set that match a list of English stop words which is defined in scikit-learn. Considering the size of our dataset and the fact that we are dealing with SMS messages and not larger text sources like e-mail, we will not be setting this parameter value.

##### We can have a look at all the parameter values of our count_vector object by simply printing out the object as follows:

In [55]:
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


Now we will fit our document dataset to the CountVectorizer object we have created using fit(), and get the list of words which have been categorized as features using the get_feature_names() method.

The get_feature_names() method returns our feature names for this dataset, which is the set of words that make up our vocabulary for 'documents'.

In [56]:
count_vector.fit(documents)
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

Now we will create a matrix with the rows being each of the 4 documents (or messages), and the columns being each word (of message). The corresponding (row, column) value is the frequency of occurrance of that word(in the column) in a particular document(in the row). We can do this using the transform() method and passing in the document data set as the argument. 

* The transform() method returns a matrix of numpy integers, you can convert this to an array using toarray(). 

In [57]:
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

Now we have a clean representation of the documents in terms of the frequency distribution of the words in them. To make it easier to understand our next step is to convert this array into a dataframe and name the columns appropriately.

In [58]:
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())

frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


##### We have successfully implemented a Bag of Words problem for a document dataset that we created. 

### Create Train and test dataset

In [59]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1,
                                                   test_size=0.3)

print('Total no. of data points: {}'.format(df.shape[0]))
print('No. of data points in training dataset: {}'.format(X_train.shape[0]))
print('No. of data points in testing dataset: {}'.format(X_test.shape[0]))

Total no. of data points: 5572
No. of data points in training dataset: 3900
No. of data points in testing dataset: 1672


#### Applying Bag of Words to our dataset

Now that we have split the data, our next objective is to follow the steps from Bag of words implemented earlier and convert our data into the desired matrix format. To do this we will be using CountVectorizer() as we did before. There are two steps to consider here:

* Firstly, we have to fit our training data (X_train) into CountVectorizer() and return the matrix.

* Secondly, we have to transform our testing data (X_test) to return the matrix. 

Note that X_train is our training data for the 'sms_message' column in our dataset and we will be using this to train our model. 

X_test is our testing data for the 'sms_message' column and this is the data we will be using(after transformation to a matrix) to make predictions on. We will then compare those predictions with y_test in a later step. 

##### Instantiate the CountVectorizer method


In [60]:
count_vector = CountVectorizer()

##### Fit the training data and then return the matrix (Conerting text dataset (messages) as matrix of word frequency)


In [61]:
training_data = (count_vector.fit_transform(X_train)).toarray()
print(training_data)
print(training_data.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(3900, 7155)


##### Transform testing data and return the matrix. 

* Note we are not fitting the testing data into the CountVectorizer()

In [62]:
testing_data = (count_vector.transform(X_test)).toarray()
print(testing_data)
print(testing_data.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(1672, 7155)


##### Train Model

In [63]:
# Instantiate our model
naive_bayes = MultinomialNB()

# Fit our model to the training data
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

##### Predict

In [64]:
# Predict on the test data
predictions = naive_bayes.predict(testing_data)

##### Evaluate the Model

In [65]:
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9874401913875598
Precision score:  0.9728506787330317
Recall score:  0.9347826086956522
F1 score:  0.9534368070953437


##### In general, there is a five step process that can be used each type you want to use a supervised learning method (which you actually used above):

1. Import the model.
2. Instantiate the model with the hyperparameters of interest.
3. Fit the model to the training data.
4. Predict on the test data.
5. Evaluate the Score of the model by comparing the predictions to the actual values.

Follow the steps through this notebook to perform these steps using each of the ensemble methods: 
##### BaggingClassifier, RandomForestClassifier, AdaBoostClassifier and SVM

##### Step 1: Import all 3 models

In [66]:
# Import the Bagging, RandomForest, and AdaBoost Classifiefrom sklearn

from sklearn import ensemble
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import svm
from sklearn.svm import SVC

##### Step 2: Instantiate each model with hyperparameters

In [67]:
# Instantiate a BaggingClassifier with:
# 200 weak learners (n_estimators) and everything else as default values

Bag = BaggingClassifier(n_estimators=200)

# Instantiate a RandomForestClassifier with:
# 200 weak learners (n_estimators) and everything else as default values

RF = RandomForestClassifier(n_estimators=200)

# Instantiate an a AdaBoostClassifier with:
# With 300 weak learners (n_estimators) and a learning_rate of 0.2

ADBoost = AdaBoostClassifier(n_estimators=300, learning_rate=0.2)

# Instantiate an a SVM with default parameter values:
SVM = SVC()

##### Step 3: Fit each model with training_data and y_train

In [68]:
# Fit your BaggingClassifier to the training data
Bag.fit(training_data, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
         verbose=0, warm_start=False)

In [69]:
# Fit your RandomForestClassifier to the training data
RF.fit(training_data, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [70]:
# Fit your AdaBoostClassifier to the training data
ADBoost.fit(training_data, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.2, n_estimators=300, random_state=None)

In [71]:
# Fit your SVM to the training data
SVM.fit(training_data, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

##### Step 4: Predict for each model using testing_data

In [72]:
# Predict using BaggingClassifier on the test data
y_Bag = Bag.predict(testing_data)

# Predict using RandomForestClassifier on the test data
y_RF = RF.predict(testing_data)

# Predict using AdaBoostClassifier on the test data
y_ADBoost = ADBoost.predict(testing_data)

# Predict using SVM on the test data
y_SVM = SVM.predict(testing_data)

##### Step 5: Evaluate score (how well each of your models is performing) for each model by comparing prediction to actual values

In [73]:
def print_metrics(y_true, preds, model_name=None):
    '''
    INPUT:
    y_true = y_test = the y values that are actually true in the dataset (numpy array or pandas series)
    preds = predicted value (y_Bag, y_RF, y_ADBoost) = the predictions for those values from some model (numpy array or pandas series)
    model_name = BaggingClassifier/RandomForestClassifier/AdaBoostClassifier = a name associated with the model if you would like to add it to the print statements 
    
    OUTPUT:
    prints the accuracy, precision, recall, and F1 score
    '''
    if model_name == None:
        print('Accuracy score: ', format(accuracy_score(y_true, preds)))
        print('Precision score: ', format(precision_score(y_true, preds)))
        print('Recall score: ', format(recall_score(y_true, preds)))
        print('F1 score: ', format(f1_score(y_true, preds)))
        print('\n\n')
    
    else:
        print('Accuracy score for ' + model_name + ' :' , format(accuracy_score(y_true, preds)))
        print('Precision score ' + model_name + ' :', format(precision_score(y_true, preds)))
        print('Recall score ' + model_name + ' :', format(recall_score(y_true, preds)))
        print('F1 score ' + model_name + ' :', format(f1_score(y_true, preds)))
        print('\n\n')

In [74]:
# Print Bagging scores
print_metrics(y_true=y_test, preds=y_Bag, model_name="Bagging Classifier")

# Print Random Forest scores
print_metrics(y_true=y_test, preds=y_RF, model_name="Randomforest Classifier")

# Print AdaBoost scores
print_metrics(y_true=y_test, preds=y_ADBoost, model_name="AdaBoost Classifier")

# Naive Bayes Classifier scores
print_metrics(y_true=y_test, preds=predictions, model_name="Naive Bayes Classifier")

# SVM Classifier scores

print('Accuracy score for SVM :' , format(accuracy_score(y_test, y_SVM)))
print('Precision score for SVM :', format(precision_score(y_test, y_SVM, average= 'weighted', labels=np.unique(y_SVM))))
print('Recall score for SVM :', format(recall_score(y_test, y_SVM, average= 'weighted', labels=np.unique(y_SVM))))
print('F1 score for SVM :', format(f1_score(y_test, y_SVM, average= 'weighted', labels=np.unique(y_SVM))))

Accuracy score for Bagging Classifier : 0.97188995215311
Precision score Bagging Classifier : 0.9216589861751152
Recall score Bagging Classifier : 0.8695652173913043
F1 score Bagging Classifier : 0.8948545861297539



Accuracy score for Randomforest Classifier : 0.9784688995215312
Precision score Randomforest Classifier : 0.98989898989899
Recall score Randomforest Classifier : 0.8521739130434782
F1 score Randomforest Classifier : 0.9158878504672896



Accuracy score for AdaBoost Classifier : 0.9760765550239234
Precision score AdaBoost Classifier : 0.9611650485436893
Recall score AdaBoost Classifier : 0.8608695652173913
F1 score AdaBoost Classifier : 0.9082568807339451



Accuracy score for Naive Bayes Classifier : 0.9874401913875598
Precision score Naive Bayes Classifier : 0.9728506787330317
Recall score Naive Bayes Classifier : 0.9347826086956522
F1 score Naive Bayes Classifier : 0.9534368070953437



Accuracy score for SVM : 0.8624401913875598
Precision score for SVM : 0.862440191387