## Classification

Classification is the machine learning problem of assigning a label to the datapoints. 

A machine learning classifier, learns an approximate mapping function $(f)$  from input variables $(X)$ to discrete output variables $(y)$.

### Classifying an SMS either as spam or not spam

In this tutorial, we attempt to build a binary classifer which can identify whether an sms recieved on your phone is either spam or not-spam.


### Workflow


*   Loading data
*   Building model
*   Evaluating model



---

## Step 1: Getting data ready

---


We are going to use [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) dataset from [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [0]:
'''
Downloading the dataset
'''

! wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
! unzip smsspamcollection.zip
! ls

In [0]:
'''
Loading the data into a pandas dataframe
'''

import pandas as pd

data = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
data.head()

### Creating a train-test split

In [0]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data[1], data[0], test_size=.2)

In [0]:
print data.shape
print x_train.shape
print x_test.shape

### Vectorizing the data

In [0]:
print x_train.head()

In [0]:
'''
Understanding CountVectorizer
'''

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus) # Learn the vocabulary dictionary and return term-document matrix.

print(vectorizer.get_feature_names())
print(X.toarray())  

In [0]:
'''
Vectorising the training data
'''

vectorizer = CountVectorizer()
x_train_vec = vectorizer.fit_transform(x_train)

print x_train_vec.shape

---

## Step 2: Building Model

---

We would like to use a **Naive Bayes model** to classify whether an SMS is spam or ham (not-spam).

### Brief Explainiation

* SMS1: How are you
* SMS2: Congratulations you have won 100000

$p(\text{is_spam}=true |  \text{ How are you}) \approx p(How|\text{is_spam}=true) \times p(are|\text{is_spam}=true)\times p(you|\text{is_spam}=true) \times p(\text{is_spam}=true)$

$p(\text{is_spam}=false |  \text{ How are you}) \approx p(How|\text{is_spam}=false) \times p(are|\text{is_spam}=false)\times p(you|\text{is_spam}=false) \times p(\text{is_spam}=false)$

Whichever is higher, classify the SMS into it.

In [0]:
'''
Learning the model
'''

from sklearn.naive_bayes import BernoulliNB

classifier = BernoulliNB()
classifier.fit(x_train_vec, y_train)

---

## Step 3: Evaluating Model

---

In [0]:
'''
Testing the model
Evaluation metric: Accuracy
'''
x_test_vec = vectorizer.transform(x_test)
score = classifier.score(x_test_vec, y_test)
print score

In [0]:
'''
Confusion Matrix
'''
from sklearn.metrics import confusion_matrix
prediction = classifier.predict(x_test_vec)
confusion_matrix(prediction, y_test)

#### Try your own data

In [0]:
x_test_vec = vectorizer.transform(['Hello Grab this Offer!! With Club Mahindra Membership get Free 3N/4D Singapore Cruise Trip. Join Now !  http://p3v.in/PHNKLNKYMMP'])
classifier.predict(x_test_vec)

In [0]:
x_test_vec = vectorizer.transform(['Hey! How was this tutrial?'])
classifier.predict(x_test_vec)

---

## Further notes

---

### K-fold cross validation 

Here, some part of the data has been kept aside as test data, to evaluate the performance of the model. 

In K-fold cross validation, the dataset is randomly split into $k$ mutually exclusive subsets. Each fold is then used once as a validation while the $k - 1$ remaining folds form the training set. Average accuracy across folds is reported. 

In [0]:
from sklearn.model_selection import KFold

# Creating four folds
kf = KFold(n_splits=4)

# Reinitalizaing the classifier
classifier = BernoulliNB()

scores = []
for train_index, test_index in kf.split(data):
  # data for each fold
  x_train, x_test, y_train, y_test = data[1][train_index], data[1][test_index], data[0][train_index], data[0][test_index]
  
  # Vectorizing it
  vectorizer = CountVectorizer()
  x_train_vec = vectorizer.fit_transform(x_train)
  
  # Fitting the model
  classifier.fit(x_train_vec, y_train)
  
  # Evaluating
  x_test_vec = vectorizer.transform(x_test)
  score = classifier.score(x_test_vec, y_test)
  scores.append(score)

avg_score = sum(scores) / len(scores) 
print 'Average accuracy:',avg_score

---

## Exercise

---

Can you try using a support vector machine to solve the same problem. Report its accuracy. 

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
