<a href="https://colab.research.google.com/github/Trantracy/Spam-Classification/blob/master/Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://developers.google.com/machine-learning/guides/text-classification/images/TextClassificationExample.png)

### “To spam, or not to spam, that is the question”

Before you start, please keep in mind the flow of our project today:  


1.   Get data
2.   Clean data
3.   Transform text data to numbers so that computer can understand (**vectorization**)
3.   Build our model using sklearn 
4.   Tune our model for higher accuracy 



## 1. Get the data

In [1]:
# we import our libraries

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# hello world 

In [2]:
data_url = 'https://raw.githubusercontent.com/accidentallydoesnotwork/Spam_message/master/spam.csv'
data = pd.read_csv(data_url,  delimiter=',',encoding='latin-1')

HTTPError: HTTP Error 404: Not Found

In [None]:
data.head()

In [None]:
data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'],axis=1,inplace=True)

In [None]:
data.rename(columns={'v1':'label', 
                     'v2':'text'}, inplace=True)

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.head()

## 2. Clean our Data

In [None]:
import nltk 
nltk.download('stopwords')
from nltk import stem 
from nltk.corpus import stopwords 
nltk.download('wordnet')

#### Stopwords




remove all the words that do not contain information

Quick question: Name me 10 stopwords?

In [None]:
# get stopwords
stopwords = set(stopwords.words('english'))

In [None]:
# remove all the stopwords function
def remove_stopwords(text):
    clean_text=[word for word in text if word not in stopwords]
    return clean_text 

In [None]:
def demon_stopwords(text):
    clean_text=' '.join(word for word in text if word not in stopwords)
    return clean_text

In [None]:
text = "What is love? Baby don't hurt me. don't hurt me No more. Baby don't hurt me, don't hurt me No more. What is love?"
demon_stopwords(text.split())

### Text normalization 

<img src="https://i.imgur.com/0zNf5Ln.png" width="600">

---



#### Stemmer

Cut the end of the words to remove the surfixes (hopefully!)

In [None]:
# stemmer = stem.SnowballStemmer('english')
stemmer = stem.PorterStemmer()

In [None]:
def stemming(text): 
    stem = [stemmer.stem(word) for word in text]
    return stem

In [None]:
test = 'rock rocks corpora corpus foot feet beautiful beauty available availability loved loving love. loving. '
print("Clean with stem:", stemming(test.split()), sep='\n')

**What we learn here:** punctuation affects the ability to clean data. So we need to remove the punctuation

In [None]:
import string
import re
# string.punctuation is just a list of special characters 
string.punctuation


In [None]:
# remove all the special characters above 
def remove_punctuation(text):
    new_text=''.join([char for char in text if char not in string.punctuation])
    return new_text

In [None]:
# split the sentence into a list of words
def tokenize(text):
    tokens=re.split('\W+',text.lower())
    return tokens 

In [None]:
# Visualize our cleaning process 
new_data = data.copy()
new_data['punctuation_free'] = new_data['text'].apply(remove_punctuation)
new_data['tokenize'] = new_data['text'].apply(tokenize)
new_data['remove_stopwords'] = new_data['tokenize'].apply(remove_stopwords)
new_data['stemming'] = new_data['remove_stopwords'].apply(stemming)

In [None]:
new_data.head()

**Now that we understand the concept of each cleanning method, let's apply it to our data**

In [None]:
# Clean our data using stopwords and stemming 

def clean_message_stem(msg):
  # all lower case 
  msg = msg.lower()
  # removing stopwords
  msg = [word for word in msg.split() if word not in stopwords]
  # using stemmer
  msg = " ".join([stemmer.stem(word) for word in msg])
  
  return msg 

In [None]:
# Clean our data using stopwords and lemmatization 

def clean_message_lemma(msg):
  # all lower case 
  msg = msg.lower()
  # removing stopwords
  msg = [word for word in msg.split() if word not in stopwords]
  # using lemma
#   print(word for word in msg)
  msg = " ".join([lemmatizer.lemmatize(word) for word in msg])
  
  return msg 

In [None]:
# remove special characters and clean message using stemming.

data['text'] = data['text'].apply(remove_punctuation)
data['text'] = data['text'].apply(clean_message_stem)

## 3. Transform text data to numbers so that computer can understand

In [None]:
# prep our data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size = 0.1, random_state = 1)

![Visualize train test split](https://www.researchgate.net/profile/Brian_Mwandau/publication/325870973/figure/fig6/AS:639531594285060@1529487622235/Train-Test-Data-Split.png)

**Visualize train test split.**

In [None]:
X_test.head()

### Bag of word

Bag of word (BOW) counts the frequency of each word in the documents. Then a text will be represented as the bag of its word (each word frequency), disregarding both grammar and word order. 

![alt text](https://cdn-media-1.freecodecamp.org/images/1*j3HUg18QwjDJTJwW9ja5-Q.png)

In [None]:
# we import the CountVectorizer, which is the bag of word method
from sklearn.feature_extraction.text import CountVectorizer

# assign it to a variable 
count_vectorizer = CountVectorizer()

# transform our sentence into a vector of number 
X_train_bag = count_vectorizer.fit_transform(X_train)

In [None]:
# visualize our data after transform 
BoW_data = pd.DataFrame(X_train_bag.toarray())
BoW_data.head()

### TfIDF

TFIDF (Term Frequency-Inverse Document Frequency):  
This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus

**Example: Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.**

In [None]:
# Import tfidf method 
from sklearn.feature_extraction.text import TfidfVectorizer

# assign the tfidf method to a variable 
vectorizer = TfidfVectorizer()

# transform our sentences into vectors of numbers using tfidf method 
X_train_tfidf = vectorizer.fit_transform(X_train)

In [None]:
# visualize our vectors
tfidf_data = pd.DataFrame(X_train_tfidf.toarray())
tfidf_data.head()

**Now we have finish both cleaning our data and transform it into numbers that machine can understand. The only thing left is to build the classifier to separate spam emails**


![](https://i.imgur.com/A12EMUd.png)

## 4. Build our model using sklearn

### RandomForest



<img src="https://matthewdharriscom.files.wordpress.com/2016/07/dog_2layer.png?w=350&h=300" width="400">

**A simple decision tree**

TF-IDF performance

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report


In [None]:
rfc = RandomForestClassifier(n_estimators=20, random_state=0)
rfc.fit(X_train_tfidf, y_train)
X_test_tfidf = vectorizer.transform(X_test)
y_pred = rfc.predict(X_test_tfidf)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Bag of word performance 

In [None]:
rfc = RandomForestClassifier(n_estimators=20, random_state=0)
rfc.fit(X_train_bag, y_train)
X_test_bow = count_vectorizer.transform(X_test)
y_pred = rfc.predict(X_test_bow)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### K-nearest neighbors

![alt text](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/KNN_final_a1mrv9.png)

**TF-IDF performance**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=2, weights='distance')
knn.fit(X_train_tfidf, y_train)
y_pred = knn.predict(X_test_tfidf)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

**Bag of word performance**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=2, weights='distance')
knn.fit(X_train_bag, y_train)
y_pred = knn.predict(X_test_bow)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

What you can learn from the building model session: 


1.   You can fine tune your model by adjusting its hyperparameters. So it is extremely crucial to understand how the models works. Sklearn is easy to use, but you need to understand the model
2.   You can have better result just by making your input better. Simple model can outperform complex model with better input. **"Garbage in, Garbage out"**, please remember that



![alt text](https://static.packt-cdn.com/products/9781788838535/graphics/e6a200ba-94b2-4c9f-a9af-5e5812cfa8a6.jpg)

**Congratulations, you have finished building your first email classification model. Now you can test and have fun with it**

In [None]:
# a function to 
def pred(msg):
    msg = vectorizer.transform([msg])
    prediction = knn.predict(msg) # you can change between knn and rfc
    return prediction[0]

In [None]:
pred('hey baby, wanna netflix and chill tonight, please call me 0123123. Please click yes')

In [None]:
pred("Join our workshop now. Available only in limited time")

In [None]:
pred("You have 1 new voicemail. Please call 08719181503")

In [None]:
pred(spam)

In [None]:
spam = data[data['label']=='spam'].sample(10)['text'].values[9]

## 5. Improve your accuracy

Your model has been finished, but you discovered that despite the high accuracy, your model still fails to recognize spam messages. Therefore your next task is to try and improve your model. Here are some problems that your model may has and way to fix them: 

1. Poor prediction due to lack of data.

  **Solution**: ***Get more data***, or try to tune the parameter
2. Data imbalance: one class has more data than the others. 

  **Solution**: ***Get more data***, or use undersampling, oversampling technique 
3. Poor input: either due to vectorization or unrealistic input. 

  **Solution**: ***Get more data***, clean data more carefully, or try different technique to manipulate the input (Hint: Word2Vect)
4. Your current model is not strong enough. 

  **Solution**: ***Get more data***, or tune the parameter, or switch to a different model. 

## Bag of word example 

In [None]:
import nltk 
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer
import collections, re

In [None]:
bow = CountVectorizer()
text = 'I love this dog'
text1 = 'What is love? Baby dont hurt me, dont hurt me, no more'
text2 = 'We loving this dog so much'
text3 = 'I loved that toy dog. Too bad it broke'

In [None]:
dictionary = [text, text1, text2, text3]
dictionary

In [None]:
lst = ' '.join(dictionary).split()

In [None]:
bagofwords = collections.Counter(' '.join(dictionary).split())
bagofwords
# bagofwords.get('dog')

In [None]:
def vectorize(text, bagofword):
  vector = []
  for word in text.split():
    if word in bagofword.keys():
      vector.append(bagofword.get(word))
    else:
      vector.append(0)
  print(text, vector)


In [None]:
test = 'Dog is man best friend. We should not eat dog'
vectorize(test, bagofwords)

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

In [None]:
stopword = set(stopwords.words('english'))


In [None]:
def clean_stopwords(dictionary):
  dictionary = [word for word in dictionary if word not in stopword]
  bow = collections.Counter(dictionary)
  return bow

In [None]:
bow = clean_stopwords(lst)
bow

In [None]:
vectorize(test, bow)

In [None]:
import string 
special_char = string.punctuation + '?'
special_char

In [None]:
def remove_punctuation(text):
    new_text=''.join([char for char in text if char not in string.punctuation])
    return new_text


In [None]:
for i in lst: 
  lst.append(remove_punctuation(i))