#### **DATA 620 - Assignment 6**

Date: 4/13/2024  
Author: Kory L. Martin

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


## **1. Setup**

For this project, the goal is to explore the use of different classification algorithms to evaluate the effectiveness of different classifiers in being able to identify new emails as being **spam** or **ham**.

We import general libraries as well as various libraries that are used to train our machine learning classifiers and to help in  evaluating them. Additionally, we pull in the data from the UCI Repo which houses the training data that is used for our project.

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import wordpunct_tokenize

Import Libraries for Machine Learning Library

In [None]:
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import CategoricalNB
from sklearn.neighbors  import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [3]:
from ucimlrepo import fetch_ucirepo 

{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

In [None]:
 
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 


In [None]:
  
# metadata 
print(spambase.metadata) 

  
# variable information 
print(spambase.variables) 


{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

We see that our data consists of 4601 records and has 57 features. Based on the feature names, it appears that the first 54 features are based on frequencies of words and special characters

In [307]:
X.shape

(4601, 57)

In [308]:
X.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [32]:
X.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.031869,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.285735,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,10.0,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0


## **2. Building Classifiers**

Since the preprocessing script only includes the first 54 columns, I need to adjust the training data so that it only includes the relevant features

In [240]:
X_mod = X.iloc[:,:54]

In [None]:
y = np.array(y['Class'])

Next we split our data into training and test data

In [241]:
#Split text for training purposes
X_train, X_test, y_train, y_test = train_test_split(X_mod,y, test_size=0.33, random_state=1211)

In [318]:
len(y_test[y_test == 0])

921

In [319]:
len(y_test[y_test == 1])

598

In [391]:
baseline = 598/(598+921)

Our training test data set has 40% spam messages. This is helpful for use as a baseline for measuring the performance of our models

We are going to train several different classifiers to train our data. For this assignment we are using the following:
- Logistic Regression
- Decision Tree
- Neural Networks
- Nearest Neighbors
- Random Forest
- Ada Boost
- Bagging

### **A. Logistic Regression**

When building this model, there were issues with the model being able to evaluate the data and therefore we used a scaler to standardize the data and this allowed us to train the Logistic Regression model on the data. The model performed pretty well and resulted in a accuracy score of 93%

In [309]:
clf_a = Pipeline([('scaler',StandardScaler()),('log_reg',LogisticRegression(random_state=1211))])

#scaler = StandardScaler().fit(X_train)
#X_scaled = scaler.transform(X_train)
clf_a 
clf_a.fit(X_train,y_train)
clf_a.score(X_test,y_test)

0.9275839368005266

### **B. Decision Tree**

In [323]:
clf_b = DecisionTreeClassifier(random_state=1211)
clf_b.fit(X_train, y_train)
clf_b.score(X_test, y_test)

0.9203423304805793

### **C. Neural Networks**

In [322]:
clf_c = MLPClassifier(random_state=1211)
clf_c.fit(X_train, y_train)
clf_c.score(X_test, y_test)



0.9493087557603687

### **D. Nearest Neighbors**

In [325]:
clf_d = KNeighborsClassifier(n_neighbors=5)
clf_d.fit(X_train, y_train)
clf_d.score(X_test, y_test)

0.9078341013824884

### **E. Random Forest**

In [326]:
clf_e = RandomForestClassifier(random_state=1211)
clf_e.fit(X_train, y_train)
clf_e.score(X_test, y_test)

0.9493087557603687

### **F. AdaBoost Classifier**

In [327]:
clf_f = AdaBoostClassifier(random_state=1211)
clf_f.fit(X_train, y_train)
clf_f.score(X_test, y_test)

0.9400921658986175

### **G. Bagging Classifier**

In [328]:
clf_g = BaggingClassifier(random_state=1211)
clf_g.fit(X_train, y_train)
clf_g.score(X_test, y_test)

0.9289005924950625

Create a table to display the performance of each of our models

In [392]:
classifier_results_training = [
    {'Classifier':'Baseline','Score':baseline},
    {'Classifier':'Logistic Regression','Score':clf_a.score(X_test,y_test)},
    {'Classifier':'Decision Tree','Score':clf_b.score(X_test,y_test)},
    {'Classifier':'KNN', 'Score': clf_c.score(X_test,y_test)},
    {'Classifier':'Random Forest', 'Score':clf_d.score(X_test,y_test)},
    {'Classifier':'Neural Network', 'Score':clf_e.score(X_test,y_test)},
    {'Classifier':'Ada Boost', 'Score':clf_f.score(X_test,y_test)},
    {'Classifier':'Bagging', 'Score':clf_g.score(X_test,y_test)}
]

In [394]:
model_training_performance = pd.DataFrame(classifier_results_training).sort_values(by='Score').reset_index(drop=True)

In [395]:
model_training_performance

Unnamed: 0,Classifier,Score
0,Baseline,0.39368
1,Random Forest,0.907834
2,Decision Tree,0.920342
3,Logistic Regression,0.927584
4,Bagging,0.928901
5,Ada Boost,0.940092
6,KNN,0.949309
7,Neural Network,0.949309


## **3. Model Predictions**

In order to evaluate our classifiers, we will use an imported set of sample emails retrieved from [Kaggle](https://www.kaggle.com/datasets/karthickveerakumar/spam-filter?resource=download), and preprocess those to generate the features and then run those through our trained classifiers.

In [249]:
sample_emails = pd.read_csv('emails.csv')

In [250]:
sample_emails_text = sample_emails.loc[:,'text'].tolist()
sample_emails_labels = sample_emails.loc[:,'spam'].copy()

Create list of tokens from rules to amtch data

In [251]:
rules = X.columns.tolist()
keywords = []
characters = []
for rule in rules:
    if re.match('(word_freq_)',rule) != None:
        result = re.split('(word_freq_)',rule)
        keywords.append(result[2])
    elif re.match('(char_freq_)',rule) != None:
        result = re.split('(char_freq_)',rule)
        #print(result)
        characters.append(result[2])        

This function is created to try and mimick the methodology used to convert the text emails into a set of features for evaluation. The code is used to count the word frequency for the keywords identified in the training data as well as the character tokens that were identified. However, the interpretation of the **capital_run_length** features was not straight-forward, so no attempts were made to replicate those features

In [252]:
def text_preprocessing(email_text):

    tokenized_words = wordpunct_tokenize(email_text)
    tokenized_words = [word.lower() for word in tokenized_words] 

    frequency = nltk.FreqDist(tokenized_words)

    num_tokens = len(set(tokenized_words))

    keyword_count = []
    keyword_count_dict = {}
    for keyword in keywords:
        if keyword in tokenized_words:
            
            word_count = frequency[keyword]
            keyword_count_dict['word_freq_'+keyword] = round(word_count/num_tokens,4)*100
            #print(mini_dict)
        else:
            keyword_count_dict['word_freq_'+keyword] = 0
        
    character_count = []
    character_count_dict = {}
    for character in characters:
        if character in tokenized_words:
            
            num_character = frequency[character]
            character_count_dict['char_freq_'+character] =round(num_character/num_tokens,4)*100
            #print(mini_dict)
        else:
            character_count_dict['char_freq_'+character] =0
    
    
    merged_dict = dict(keyword_count_dict|character_count_dict)
    cols = list(merged_dict.keys())
    values = list(merged_dict.values())

    tokenized_df = pd.DataFrame(np.array(values).reshape(1,len(values)), columns=cols)

    return tokenized_df

Use the preprocessing algorithm and the sample emails to generate a features database that can be used to process the emails with our classifiers

In [254]:
sample_email_dataframe = pd.DataFrame()

for text in sample_emails_text:
    tokenized_df = text_preprocessing(text)
    sample_email_dataframe = pd.concat([sample_email_dataframe,tokenized_df])

index_list = [i for i in range(len(sample_email_dataframe))]
sample_email_dataframe = sample_email_dataframe.set_axis(index_list, axis='index')

Create evaluation data set

In [266]:
X_eval = sample_email_dataframe.sample(1000,random_state=1211)
eval_index = X_eval.index.tolist()
y_eval = sample_emails_labels.iloc[eval_index]


spam
0    769
1    231
Name: count, dtype: int64

The baseline score is calculated in order to generate a dividing line to evaluate the performance of our classifiers

In [396]:
baseline = pd.DataFrame(y_eval).value_counts('spam')[1]/(pd.DataFrame(y_eval).value_counts('spam')[0] + pd.DataFrame(y_eval).value_counts('spam')[1])

#### **Evaluation of Classifiers on Test Data**

Below are the accuracy scores for each of the classifiers based on a set of 1000 emails that were pre-processed. The baseline score represents the percent of the evaluation data that were classified as spam

In [397]:
classifier_results_eval = [
    {'Classifier':'Baseline','Score':baseline},
    {'Classifier':'Logistic Regression','Score':clf_a.score(X_eval,y_eval)},
    {'Classifier':'Decision Tree','Score':clf_b.score(X_eval,y_eval)},
    {'Classifier':'KNN', 'Score': clf_c.score(X_eval,y_eval)},
    {'Classifier':'Random Forest', 'Score':clf_d.score(X_eval,y_eval)},
    {'Classifier':'Neural Network', 'Score':clf_e.score(X_eval,y_eval)},
    {'Classifier':'Ada Boost', 'Score':clf_f.score(X_eval,y_eval)},
    {'Classifier':'Bagging', 'Score':clf_g.score(X_eval,y_eval)}
]


In [398]:
model_evaluation_performance = pd.DataFrame(classifier_results_eval).sort_values(by='Score').reset_index(drop=True)

In [399]:
model_evaluation_performance

Unnamed: 0,Classifier,Score
0,Baseline,0.231
1,Random Forest,0.712
2,Decision Tree,0.724
3,Bagging,0.781
4,KNN,0.784
5,Logistic Regression,0.792
6,Ada Boost,0.826
7,Neural Network,0.83


In [None]:
model_training_performance = model_training_performance.assign(Rank=range(1,len(model_training_performance)+1))

model_evaluation_performance = model_evaluation_performance.assign(Rank=range(1,len(model_evaluation_performance)+1))
model_evaluation_performance = model_evaluation_performance.add_suffix(' (Evaluation)')
model_training_performance = model_training_performance.add_suffix(' (Training)')


In [401]:
pd.merge(model_training_performance,model_evaluation_performance, how='left', left_on='Rank (Training)', right_on='Rank (Evaluation)')

Unnamed: 0,Classifier (Training),Score (Training),Rank (Training),Classifier (Evaluation),Score (Evaluation),Rank (Evaluation)
0,Baseline,0.39368,1,Baseline,0.231,1
1,Random Forest,0.907834,2,Random Forest,0.712,2
2,Decision Tree,0.920342,3,Decision Tree,0.724,3
3,Logistic Regression,0.927584,4,Bagging,0.781,4
4,Bagging,0.928901,5,KNN,0.784,5
5,Ada Boost,0.940092,6,Logistic Regression,0.792,6
6,KNN,0.949309,7,Ada Boost,0.826,7
7,Neural Network,0.949309,8,Neural Network,0.83,8


### **Conclusion**

Based on the results of the accuracy scores for the classifiers, we find the that KNN, Logistic Regression, Neural Network and Ada Boost classifiers generated predictions that were better than the baseline performance of 23% - calculated based on the percent of Spam messages that were present in our evaluation data set. 

We see from the table above that the relative performance of the models (based on their ranked performance) held up for the most part with the training and evaluation data, with the exception of the models in the 4-7 spot.

It's worth noting that the performance for these classifiers on the evaluation data is significantly worse than the 90% or so accuracy scores that were generated from some of these models on the test data. Two possible reason for this are: 

1. Given the lack of documentation on the the actual methodology used for pre-processing the emails, it's highly likely that the algorithm that I wrote to score the text was not exactly the same as that used to create the features for the test and training data. Additionally, given the inability to interpret the last three features - capital_run_length_average, capital_run_length_longest, and capital_run_length_total - these features were omitted from the final dataset.

2. Also, as mentioned in the metadata provided by the authors, the email corpus and the features were tied to a specific type of emails that they were receiving given their professional domain. As a result, some of the features that were used in our training data would not be applicable to the corpus of emails that were used in the evaluation data. 

Finally, there was limited tuning done to our classifiers to improve the performance - given that they performed relatively well with the actual training data. Therefore, that may represent an additional area of improvement in the performance of our classifiers. 