# Analysis for Text Spam Detection using Supervised Learning

An NLP analysis project for text spam detection using Scikit-Learn and Pandas.


## Table of Contents

- [Home](#Text-Spam-Detection-using-Supervised-Learning)
- [Description](#Description)
- [Installation and Usage](#Installation-and-Usage)
- [Tools and Algorithms](#Tools-and-Algorithms)
- [Credits](#Credits)
- First Model:
    - [Import Modules and Files](#Import-Modules-and-Files)
    - [Clean Dataset](#Clean-Dataset)
    - [Standardize Labels](#Standardize-Labels)
    - [Split Train and Test Datasets](#Split-Train-and-Test-Datasets)
    - [Train Model](#Train-Model)
    - [Test Results](#Test-Results)
- Second Model:
    - [Second Dataset Process](#Second-Dataset-Process)
- [Analysis](#Analysis)
    - [Overall Accuracy](#Overall-Accuracy)
    - [Models Against Each Other](#Models-Against-Each-Other)
    - [Email vs SMS Spam Detection](#Email-vs-SMS-Spam-Detection)
    - [Third Dataset Process](#Third-Dataset-Process)
    - [Two Models Against Each Other](#Two-Models-Against-Each-Other)
    
- [References](#References)


## Description

In this project we use a few labeled email/sms datasets, clean our datasets, split them into train and test sets, train and fit the text classification model and finally get prediction results and analyze it.

## Installation and Usage

Set your environment and install dependencies from [here](install/README.md).

## Tools and Algorithms

We used Supervised Machine Learning to train and test our created models. For model creation we used LinearSVC model from sklearn module (read more [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)) which was selected to be our main model after trying out several different other models from sklearn models. We also used sklearn pipline object to connect the streams of input and output of each job to speed up the whole training execution process. We used TfidfVectorizer object (read more [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)) to extract features from raw-text and convert them into TF-IDF features to pass to the next job in pipeline - which is LinearSVC job. Pandas module was used for working with our datasets and converting them into DataFrame objects. Sklearn metrics was used for generating reports and printing accuracy scores of the models against the tests. Markdown features is used for prettifying the obtained results.
    
## Credits

Developed by:

- [Mohammad Salek](mailto:salek.mohmd@gmail.com)

Published on: 2021/07/21.

## Import Modules and Files

In [1]:
# Import modules:
from IPython.display import Markdown, display
import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


# Disable jedi for fast jupyter suggestions:
%config Completer.use_jedi = False


# prettifier function(s):
def printmd(s):
    display(Markdown(s))


In [2]:
# Load email.csv files and view the top fields:
def load_df(file):
    df = pd.read_csv(file, usecols=['text','label'])
    df.head(n=10)
    return df


file = "dataset/first.csv"
df = load_df(file)
df.head(n=5)

Unnamed: 0,text,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,ham
1,martin a posted tassos papadopoulos the greek ...,ham
2,man threatens explosion in moscow thursday aug...,ham
3,klez the virus that won t die already the most...,ham
4,in adding cream to spaghetti carbonara which ...,ham


## Clean Dataset

In [3]:
# Check for missing values:
def clean_dataset(df, label1, label2):
    missing_label1_count = df.isnull().sum()[label1]
    missing_label2_count= df.isnull().sum()[label2]
    if missing_label1_count > 0 or missing_label2_count > 0:
        # Find them:
        blank_idx = []
        for idx, label1_var, label2_var in list(df.itertuples())[:3]:
            if (type(label1_var) == str and label1_var.isspace()) or (type(label2_var) == str and label2_var.isspace()):
                blank_idx.append(idx)
        # Clean them:
        print(f"Found {missing_label1_count} bad {label1} and {missing_label2_count} bad {label2} fields. Now deleting them...")
        df.dropna(inplace=True)
        # Re-check the dataset:
        missing_label1_count = df.isnull().sum()[label1]
        missing_label2_count= df.isnull().sum()[label2]
        if missing_label1_count or missing_label2_count:
            raise Exception("Could not delete the fields in dataframe.")
        else:
            print("Successfully cleaned dataframe!")
    return df


df = clean_dataset(df, label1="text", label2="label")

Found 1 bad text and 0 bad label fields. Now deleting them...
Successfully cleaned dataframe!


In [4]:
# Check label type:
print("unique labels:", df["label"].unique())

# Check dataset balance:
print(df["label"].value_counts())

unique labels: ['ham' 'spam']
ham     2500
spam     499
Name: label, dtype: int64


## Standardize Labels

In [5]:
# Convert dataframe to specific format:
# This function was used to label 0/1 labels to ham/spam labels accordingly.
def convert_binary_label_to_ham_spam(file):
    csv_lines = None
    with open(file, "r") as f:
        csv_lines = f.readlines()
    for idx, line in enumerate(csv_lines):
        if idx == 0:
            continue
        splitted_line = line.replace("\n", "").split(',')
        if splitted_line[1] == "0":
            splitted_line[1] = "ham"
        elif splitted_line[1] == "1":
            splitted_line[1] = "spam"
        new_line = ",".join(splitted_line) + "\n"    
        if new_line != line:
            csv_lines[idx] = new_line

    with open(file, "w") as f:
        f.writelines(csv_lines)


# convert_binary_label_to_ham_spam("dataset/email.csv")

## Split Train and Test Datasets

In [6]:
# Split data into train and test sets: (Ratio of X_test to overall is set by test_size param)
def train_split(df, label1, label2, test_size=0.25, random_state=7):
    X = df[label1]
    y = df[label2]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return (X_train, X_test, y_train, y_test)


X_train, X_test, y_train, y_test = train_split(df, "text", "label")

## Train Model

In [7]:
# Create classifier object:
def create_clf(X_train, X_test, y_train, y_test):
    # Build pipeline:
    text_clf = Pipeline([("tfidf", TfidfVectorizer()),
                        ("clf", LinearSVC())])
    # Fit the data:
    text_clf.fit(X_train, y_train)
    # Form prediction set:
    predictions = text_clf.predict(X_test)
    return predictions


predictions = create_clf(X_train, X_test, y_train, y_test)

## Test Results

In [8]:
# Print results:
def print_results(y_test, predictions):
    printmd("**Confusion Matrix:**")
    print(confusion_matrix(y_test, predictions))
    printmd("\n**Classification Report:**")
    print(classification_report(y_test, predictions))
    printmd(f"<span>**Overall Accuracy:**</span><br><span style='color:red'>**{accuracy_score(y_test, predictions) * 100:.3f}%**</span>")

    
# Save result:
accuracy = {"first": accuracy_score(y_test, predictions) * 100}
print_results(y_test, predictions)

**Confusion Matrix:**

[[612   2]
 [  9 127]]



**Classification Report:**

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       614
        spam       0.98      0.93      0.96       136

    accuracy                           0.99       750
   macro avg       0.99      0.97      0.97       750
weighted avg       0.99      0.99      0.99       750



<span>**Overall Accuracy:**</span><br><span style='color:red'>**98.533%**</span>

## Second Dataset Process

The process is the same as it was for first dataset.

In [9]:
# Load dataframe:
df2 = load_df("dataset/second.csv")[['text','label']]
# Clean it:
df2 = clean_dataset(df2, "text", "label")
# Check the distribution:
df2["label"].value_counts()
# Train split:
X_train2, X_test2, y_train2, y_test2 = train_split(df2, "text", "label")
# Create classifier object:
predictions2 = create_clf(X_train2, X_test2, y_train2, y_test2)
# Save result:
accuracy["second"] = accuracy_score(y_test2, predictions2) * 100
# Check result
print_results(y_test2, predictions2)

**Confusion Matrix:**

[[908  15]
 [  5 365]]



**Classification Report:**

              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       923
        spam       0.96      0.99      0.97       370

    accuracy                           0.98      1293
   macro avg       0.98      0.99      0.98      1293
weighted avg       0.98      0.98      0.98      1293



<span>**Overall Accuracy:**</span><br><span style='color:red'>**98.453%**</span>

## Analysis

### Overall Accuracy

By choosing the same random_state for all training processes above, we are sure that we get the same results every time we run the codes with the same datasets and algorithms. With that being said, we can confidently talk about the printed numbers above.

The overall accuracy for each model is pretty good as each were **above 98%** in accuracy. But that does not mean we have a good trained model. Actually we could have three over-trained models which are good at their own datasets. So how could we test the accuracy of a model outside of its dataset? Simply we can test the test_y of each model with predictions of another classifier object:

In [10]:
# Print each model's overall accuracy:
print("Individual Model Overall Accuracy:")
for model in accuracy:
    print(f"{model:7}: {accuracy[model]:.3f}")

Individual Model Overall Accuracy:
first  : 98.533
second : 98.453


### Models Against Each Other

In [11]:
# Find minimum dataset records (in order to be able to calculate predictions):
min_df_len = min(y_test.size, y_test2.size)
# Init models:
models_info = [
    ("first", y_test[:min_df_len], predictions[:min_df_len]),
    ("second", y_test2[:min_df_len], predictions2[:min_df_len]),
]
# Check predictions against each test_y of every model:
new_accuracies = []
for model in models_info:
    new_accuracy_sublist = []
    for model2 in models_info:
        score = accuracy_score(model[1], model2[2]) * 100
        new_accuracy_sublist.append((model, model2, score))
    new_accuracies.append(new_accuracy_sublist)
# Print the accuracies:
print(f"test_y{'':7}", f"predictions{'':5}", f"accuracy score{'':5}", end="\n\n")
for row in new_accuracies:
    for t in row:
        print(f"{t[0][0]:<15} {t[1][0]:15}", f"{t[2]:.3f}")

test_y        predictions      accuracy score     

first           first           98.533
first           second          63.200
second          first           64.267
second          second          98.267


As we can see the accuracy scores have dropped even **down to 65%**. This could be a good indicator of several reasons:
* a bigger dataset is needed to train our models to gain a stronger model
* the models are over-trained
* the datasets are not cleaned properly

or probably other factors that we are not be aware of.

### Email vs SMS Spam Detection

The first dataset we used is a spam/ham labeled for email texts. The third one is for SMS texts. Now there is a good chance to detect if our email related model can detect SMS related ham/spams and vice-versa. We just need to take a look at first and third models accuracy scores to figure it out:

### Third Dataset Process

The process is the same as it was for first dataset.

In [12]:
# Load dataframe:
df3 = load_df("dataset/third.csv")[['text','label']]
# Clean it:
df3 = clean_dataset(df3, "text", "label")
# Check the distribution:
df3["label"].value_counts()
# Train split:
X_train3, X_test3, y_train3, y_test3 = train_split(df3, "text", "label")
# Create classifier object:
predictions3 = create_clf(X_train3, X_test3, y_train3, y_test3)
# Save result:
accuracy["third"] = accuracy_score(y_test3, predictions3) * 100
# Check result
print_results(y_test3, predictions3)

**Confusion Matrix:**

[[1210    1]
 [  21  161]]



**Classification Report:**

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1211
        spam       0.99      0.88      0.94       182

    accuracy                           0.98      1393
   macro avg       0.99      0.94      0.96      1393
weighted avg       0.98      0.98      0.98      1393



<span>**Overall Accuracy:**</span><br><span style='color:red'>**98.421%**</span>

### Two Models Against Each Other

In [13]:
# Find minimum dataset records (in order to be able to calculate predictions):
min_df_len = min(y_test.size, y_test3.size)
# Init models:
models_info = [
    ("first", y_test[:min_df_len], predictions[:min_df_len]),
    ("third", y_test3[:min_df_len], predictions3[:min_df_len])
]
# Check predictions against each test_y of every model:
new_accuracies = []
for model in models_info:
    new_accuracy_sublist = []
    for model2 in models_info:
        score = accuracy_score(model[1], model2[2]) * 100
        new_accuracy_sublist.append((model, model2, score))
    new_accuracies.append(new_accuracy_sublist)
# Print accuracy scores:
print(f"test_y{'':7}", f"predictions{'':5}", f"accuracy score{'':5}", end="\n\n")
for row in new_accuracies:
    for t in row:
        if ((t[0][0] == "first" or t[0][0] == "third") and
            (t[1][0] == "first" or t[1][0] == "third") and
            (t[0][0] == "first" or t[1][0] == "first") and
            (t[0][0] == "third" or t[1][0] == "third")
        ):
            print(f"{t[0][0]:<15} {t[1][0]:15}", f"{t[2]:.3f}")

test_y        predictions      accuracy score     

first           third           75.200
third           first           75.067


It's an interesting result! We can see that **by training email spam detection model, we could detect SMS spams, too! And vice-versa.** Some minor difference in accuracy scores, but it's really close. We could guess that spams surely have some similarities in emails and SMS texts. Figuring out these similiarities is another project to work on and it's about Topic Modeling subject.

## References

### Inspired by:

- A great tutorial from Udemy: [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/)

### Documents:

- Spacy Documents: [Spacy Website](https://spacy.io)
- Scikit-Learn Documents: [Scikit-Learn Website](https://scikit-learn.org)

### Datasets:

- Spam or Not Spam Dataset (first): https://www.kaggle.com/ozlerhakan/spam-or-not-spam-dataset
- Spam Mails Dataset (second): https://www.kaggle.com/venky73/spam-mails-dataset
- Spam Text Message Classification (third): https://www.kaggle.com/team-ai/spam-text-message-classification
