# Project Overview

We will attempt to predict if a text is "Ham" or "Spam"

# Feature Extraction From Text

1. Most classic machine learning algorithms cannot take in raw text


2. Instead we need to perform feature extraction from the raw text in order to pass numerical features to the ML algorithm


3. For example, we could count the occurence of each word to map text to a number


4. Count Vectorization: Count the occurences of all of the unique words. Each unique word will be treated as a feature. 
    
    
5. Then it will count each time the word shows up in the data via a matrix. This matrix will be "sparse"


6. TfidfVectorizer: Alternative to CountVectorizer. It also creates a sparse matrix.
    
    
7. However, instead of filling the matrix with token counts, it calcuates term frequency inverse document frequency (TF-IDF) value for each word.


8. Term frequency is the raw count of a term in a document (the number of times that term occurs in the document)


9. Because the term "the" is so common, TF will tend to incorrectly emphaize documents that use "the" more frequently. We need to find a way to give more weight to meaningful terms.


10. An inverse document frequency factor is incorporated, which diminishes the weight of terms that occur very frequently and increases the weight of terms that occur rarely


11. TF-IDF is the logarithmically scaled inverse fraction of the documents that contain the word


12. This is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. 


13. TD-IDF allows us to understand the context of words across an entire corpus of documents instead of just its relative importance in a single document

# Importing Basic Libraries

In [1]:
# These are the libraries I typically use in my analysis so I find it easier to import them all at once
# If I need more libraries I will import them as needed

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline

# Importing the Dataset

In [2]:
# Our dataset is smsspamcollection.tsv, where the tsv stands for "tab separated variables"
# Hence in order to import the file correctly we need to add delimiter = "\t"
# We will name the dataframe "emails"

emails =  pd.read_csv('smsspamcollection.tsv', delimiter = '\t')

In [3]:
# Here is a brief look at the dataset
# We have the dependent variable "label" with values ham and spam
# We then have the actual email message, the length of the message, and the amount of punctuation in the message

emails.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [4]:
# There are 5572 total emails in our dataset

emails.shape

(5572, 4)

In [5]:
# There are no missing values in this dataset. Certainly makes things easier

emails.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [6]:
# Looks like we have 4825 "ham" labels and 747 "spam" labels
# This is definitely an unbalanced dataset and we will have to be careful about using accuracy as our metric

emails['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

# Creating Our Classification Model 

In [7]:
# In a similar exercise we used: X = emails[['length','punct']]
# Now we are just focused on the message column
X = emails['message']  

# y is our "label" data
y = emails['label']

In [8]:
# Here we will split our data into a training and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Training Data Shape:', X_train.shape)
print('Testing Data Shape: ', X_test.shape)

Training Data Shape: (4457,)
Testing Data Shape:  (1115,)


In [9]:
# Here we are going to perform feature extraction on our message column
# CountVectorizer has text pre-processing, tokenizing, and the ability to filter out stop words
# CV builds a dictionary of features and transforms documents to feature vectors

from sklearn.feature_extraction.text import CountVectorizer
# We want to pass X into the instance of the CountVectorizer class
count_vect = CountVectorizer()

# Here we are fitting the vectorizer to the training data
# This builds a vocabulary, count the number of words, etc
# We then transform the text message to a vector
X_train_counts = count_vect.fit_transform(X_train)
# X_train_counts is a HUGE sparse matrix. Jupyter won't allow us to see the whole thing
# There are 4457 messages in our training set, with 7702 unique words that are in those messages
X_train_counts.shape

(4457, 7702)

While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using TfidfTransformer:

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

# So you see we are passing in our X_train_counts through here, NOT the original X_train
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# The shape is still the same, but the values are no longer just counts
# This is because we multiplied those values by the TF-IDF
X_train_tfidf.shape

(4457, 7702)

In [11]:
# In the future, we can combine the CountVectorizer and TfidTransformer steps into one using TfidVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

# Remember to use the original X_train set
X_train_tfidf = vectorizer.fit_transform(X_train) 

# Once again, same shape. But everything was done in one step now. 
X_train_tfidf.shape

(4457, 7702)

In [12]:
# Here we will build a Support Vector Machine classifier.
# The Linear SVC handles sparse input better, and scales well to large numbers of samples

from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

In [13]:
# Remember that only our training set has been vectorized into a full vocabulary. 
# In order to perform an analysis on our test set we'll have to submit it to the same procedures. 
# Fortunately scikit-learn offers a Pipeline class that behaves like a compound classifier.

from sklearn.pipeline import Pipeline

# Pipeline object takes a list of tuples
# Each tuple will have a string name that you decide, then you call in what you want to do
# Basically you can run the model on the training set in one step
# Pipline extra useful when you have stop words, lemmatization, etc
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [14]:
# Creating predictions

y_pred = text_clf.predict(X_test)

In [15]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,y_pred))

[[964   2]
 [  7 142]]


In [16]:
# Print a classification report
print(metrics.classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         ham       0.99      1.00      1.00       966
        spam       0.99      0.95      0.97       149

    accuracy                           0.99      1115
   macro avg       0.99      0.98      0.98      1115
weighted avg       0.99      0.99      0.99      1115



In [17]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,y_pred))

0.9919282511210762


In [18]:
# Here we are going to use our classifier to predict whether a new message is spam or ham
# Our classifier predicts that it is "ham" or a legitimate message

text_clf.predict(["Hi how are you doing today?"])

array(['ham'], dtype=object)

In [19]:
# Here we are going to use our classifier to predict whether a new message is spam or ham
# Our classifier predicts that it is a spam message

text_clf.predict(["Congratulations! You've been selected as a winner. TEXT WON to 44255"])

array(['spam'], dtype=object)