# Project 4: Model Development and Prediction Service

## CS 4981 ML Production Systems

## Kyle Robinson, Michael Salgado, Noah Stiemke

In this project, we are training different ham/spam classifier models using the labled emails stored in Minio. Throughout our training, we are testing different combinations of hyperparameters for both the models and the Count Vectorizer to find the model with the highest accuracy on our data. 

## Imports

In [1]:
import os
import json
import pickle
import warnings
from datetime import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    roc_curve,
    roc_auc_score,
    make_scorer,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    KFold,
    cross_val_score,
)
from sklearn.ensemble import RandomForestClassifier
from numpy import mean

## Loading in JSON file

In [2]:
pathname = "C:/Users/salgadom/Documents/CS4981/P1-repository/cs4981_lab1_email_classifier/part-00000-a994d54f-2d6b-4f3e-b9d1-21c14ae09d0a-c000.json"

email_data = []
with open(pathname, encoding='utf8') as json_file:
    for row in json_file:
        row_data = json.loads(row)

        row_dict = {'email_id': row_data['email_id'],
                    'received_timestamp': row_data['received_timestamp'],
                    'body': json.loads(row_data['email_object'])['body'],
                    'label': row_data['label']}

        email_data.append(row_dict)

In [3]:
print(email_data[0])

{'email_id': 1, 'received_timestamp': '2023-05-04T09:59:48.492-05:00', 'body': '\n\n\n\n\n\n\nDo you feel the pressure to perform and not rising to the occasion??\n\n\n\n\nTry Viagra.....\nyour anxiety will be a thing of the past and you will\nbe back to your old self.\n\n', 'label': 'spam'}


## Creating Pandas DataFrame

In [4]:
emails_df = pd.DataFrame.from_dict(email_data)

In [5]:
emails_df.head()

Unnamed: 0,email_id,received_timestamp,body,label
0,1,2023-05-04T09:59:48.492-05:00,\n\n\n\n\n\n\nDo you feel the pressure to perf...,spam
1,2,2023-05-04T09:59:48.795-05:00,"Hi, i've just updated from the gulus and I che...",ham
2,8,2023-05-04T09:59:50.782-05:00,\n\n\n\n\n\n HoodiaLife - Start Losing Weight ...,spam
3,3,2023-05-04T09:59:48.956-05:00,authentic viagra\n\nMega authenticV I A G R A...,spam
4,4,2023-05-04T09:59:49.240-05:00,"\nHey Billy, \n\nit was really fun going out t...",spam


In [6]:
emails_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10801 entries, 0 to 10800
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   email_id            10801 non-null  int64 
 1   received_timestamp  10801 non-null  object
 2   body                10801 non-null  object
 3   label               10801 non-null  object
dtypes: int64(1), object(3)
memory usage: 337.7+ KB


In [7]:
print(f"Number of rows in dataframe: {len(emails_df)}")

Number of rows in dataframe: 10801


## Cleaning data
- Convert to lowercase
- Stop words?
- Lemmitization
- Balancing classes

### Undersampling data

In [8]:
spam_emails = emails_df[emails_df['label'] == 'spam']
ham_emails = emails_df[emails_df['label'] == 'ham']

num_spam = len(spam_emails)
num_ham = len(ham_emails)

print(f"Original number of spam labels: {num_spam}")
print(f"Original number of ham labels: {num_ham}")

undersample = None
if num_spam > num_ham:
    undersample = spam_emails.sample(n=len(ham_emails))
    emails_df = pd.concat([ham_emails,undersample],axis=0)
else:
    undersample = ham_emails.sample(n=len(spam_emails))
    emails_df = pd.concat([spam_emails,undersample],axis=0)

print(f"\nAdjusted number of spam labels: {emails_df['label'].value_counts()['spam']}")
print(f"Adjusted number of ham labels: {emails_df['label'].value_counts()['ham']}")
emails_df.head()

Original number of spam labels: 7604
Original number of ham labels: 3197

Adjusted number of spam labels: 3197
Adjusted number of ham labels: 3197


Unnamed: 0,email_id,received_timestamp,body,label
1,2,2023-05-04T09:59:48.795-05:00,"Hi, i've just updated from the gulus and I che...",ham
11,1220,2023-05-04T10:04:36.681-05:00,----------------------------------------------...,ham
13,1292,2023-05-04T10:04:55.913-05:00,\nGoogle News Alert for: bush\n\n\nThe Bush ad...,ham
14,9,2023-05-04T09:59:51.142-05:00,\nHi...\n\nI have to use R to find out the 90%...,ham
23,17,2023-05-04T09:59:52.271-05:00,Hm... sounds like a homework problem to me...\...,ham


In [9]:
print(f"Number of rows in dataframe: {len(emails_df)}")

Number of rows in dataframe: 6394


### Sorting by time


#### Converting 'received_timestamp' to datetime objects

In [10]:
emails_df.head()

Unnamed: 0,email_id,received_timestamp,body,label
1,2,2023-05-04T09:59:48.795-05:00,"Hi, i've just updated from the gulus and I che...",ham
11,1220,2023-05-04T10:04:36.681-05:00,----------------------------------------------...,ham
13,1292,2023-05-04T10:04:55.913-05:00,\nGoogle News Alert for: bush\n\n\nThe Bush ad...,ham
14,9,2023-05-04T09:59:51.142-05:00,\nHi...\n\nI have to use R to find out the 90%...,ham
23,17,2023-05-04T09:59:52.271-05:00,Hm... sounds like a homework problem to me...\...,ham


In [11]:
emails_df['received_timestamp'] = pd.to_datetime(emails_df['received_timestamp'], format='%Y-%m-%dT%H:%M:%S')
emails_df.head()

Unnamed: 0,email_id,received_timestamp,body,label
1,2,2023-05-04 09:59:48.795000-05:00,"Hi, i've just updated from the gulus and I che...",ham
11,1220,2023-05-04 10:04:36.681000-05:00,----------------------------------------------...,ham
13,1292,2023-05-04 10:04:55.913000-05:00,\nGoogle News Alert for: bush\n\n\nThe Bush ad...,ham
14,9,2023-05-04 09:59:51.142000-05:00,\nHi...\n\nI have to use R to find out the 90%...,ham
23,17,2023-05-04 09:59:52.271000-05:00,Hm... sounds like a homework problem to me...\...,ham


#### Sorting dataframe by datetime

In [12]:
emails_df.sort_values(by='received_timestamp')

Unnamed: 0,email_id,received_timestamp,body,label
0,1,2023-05-04 09:59:48.492000-05:00,\n\n\n\n\n\n\nDo you feel the pressure to perf...,spam
1,2,2023-05-04 09:59:48.795000-05:00,"Hi, i've just updated from the gulus and I che...",ham
3,3,2023-05-04 09:59:48.956000-05:00,authentic viagra\n\nMega authenticV I A G R A...,spam
6,5,2023-05-04 09:59:49.511000-05:00,"\n\n\n\n\n\n\nsystem"" of the home. It will ha...",spam
14,9,2023-05-04 09:59:51.142000-05:00,\nHi...\n\nI have to use R to find out the 90%...,ham
...,...,...,...,...
10794,10792,2023-05-04 10:41:31.756000-05:00,\n\n\n\n\n\n\n\nModeling make specify fashion?...,spam
10798,10795,2023-05-04 10:41:33.034000-05:00,Author: metze\nDate: 2007-04-20 10:57:13 +0000...,ham
10784,10796,2023-05-04 10:41:33.193000-05:00,,ham
10795,10798,2023-05-04 10:41:33.629000-05:00,Author: metze\nDate: 2007-04-20 11:00:20 +0000...,ham


## Model Training

## Logistic Regression

The hyper-parameters that will be tested for this model are as follows:
  - Regularization (0.01, 0.1, 1.0)
  - Penalty Types (L1, L2)
  - Solver (lbfgs, liblinear, newton-cg)
  - min_df (0, 1, 10, 20)

### Spliting dataset by time
Split by time, split first 80% for older timestamps, 20% for the latest timestamps


In [13]:
x_train, x_test, y_train, y_test = train_test_split(emails_df['body'].values, emails_df['label'], test_size=0.2, shuffle=False)

### Creating pipeline

In [14]:
lr_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression())
])

### Setting hyperparameters

In [15]:
lr_parameters = {
    'vectorizer__min_df': [10, 20],
    'classifier__C': [0.01, 0.1, 1.0],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear']
}

### Training models

In [16]:
warnings.filterwarnings('ignore')

lr_clf = GridSearchCV(lr_pipeline, lr_parameters, cv=10)
lr_clf.fit(x_train, y_train)

### Retrieving best model

In [17]:
best_lr_output = lr_clf.best_estimator_

best_lr_vectorizer = best_lr_output.named_steps['vectorizer']
best_lr_model = best_lr_output.named_steps['classifier']
best_params = lr_clf.best_params_
print(best_params)
print(best_lr_model)
print(best_lr_vectorizer)

{'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'newton-cg', 'vectorizer__min_df': 10}
LogisticRegression(C=0.1, solver='newton-cg')
CountVectorizer(min_df=10)


### Evaluating best model on testing set

In [18]:
x_test_counts = best_lr_vectorizer.transform(x_test)
y_test_pred = best_lr_model.predict(x_test_counts)

# Evaluate the performance on the testing set
lr_accuracy = accuracy_score(y_test, y_test_pred)
lr_precision = precision_score(y_test, y_test_pred, pos_label="spam")
lr_recall = recall_score(y_test, y_test_pred, pos_label="spam")
lr_f1 = f1_score(y_test, y_test_pred, pos_label="spam")

# Output the evaluation metrics
lr_evaluation_metrics = {
    'accuracy': lr_accuracy,
    'precision': lr_precision,
    'recall': lr_recall,
    'f1': lr_f1
}

print("Testing Metrics:", lr_evaluation_metrics)

Testing Metrics: {'accuracy': 0.9890539483971853, 'precision': 1.0, 'recall': 0.9890539483971853, 'f1': 0.9944968553459119}


## Random Forest

### Splitting data into training and testing sets by time

In [19]:
x_train, x_test, y_train, y_test = train_test_split(emails_df['body'].values, emails_df['label'], test_size=0.2, shuffle=False)

### Creating pipeline

In [20]:
rfc_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', RandomForestClassifier())
])

### Defining parameters for grid search

In [21]:
rfc_parameters = {
    'vectorizer__min_df': [0, 1, 10, 20],
    "classifier__n_estimators": [10, 50, 100],
    "classifier__max_depth": [5, 10, 20],
    "classifier__min_samples_leaf": [1, 3, 5]
}

### Initializing random forest classifier and performing grid search

In [22]:
warnings.filterwarnings('ignore')

rfc_clf = GridSearchCV(rfc_pipeline, rfc_parameters, cv=10)
rfc_clf.fit(x_train, y_train)

### Retrieving best model

In [23]:
best_rfc_output = rfc_clf.best_estimator_
best_rfc_vectorizer = best_rfc_output.named_steps['vectorizer']
best_rfc_model = best_rfc_output.named_steps['classifier']
best_rfc_params = rfc_clf.best_params_
print(best_rfc_params)
print(best_rfc_model)
print(best_rfc_vectorizer)

{'classifier__max_depth': 20, 'classifier__min_samples_leaf': 1, 'classifier__n_estimators': 50, 'vectorizer__min_df': 20}
RandomForestClassifier(max_depth=20, n_estimators=50)
CountVectorizer(min_df=20)


### Evaluating best model on testing set

In [24]:
# Predict labels for the testing data using the trained random forest model
x_test_counts = best_rfc_vectorizer.transform(x_test)
y_test_pred = best_rfc_model.predict(x_test_counts)

# Evaluate the performance on the testing set
rf_accuracy = accuracy_score(y_test, y_test_pred)
rf_precision = precision_score(y_test, y_test_pred, pos_label="spam")
rf_recall = recall_score(y_test, y_test_pred, pos_label="spam")
rf_f1 = f1_score(y_test, y_test_pred, pos_label="spam")

# Output the evaluation metrics
rf_evaluation_metrics = {
    'accuracy': rf_accuracy,
    'precision': rf_precision,
    'recall': rf_recall,
    'f1': rf_f1
}

print("Testing Metrics:", rf_evaluation_metrics)

Testing Metrics: {'accuracy': 0.9788897576231431, 'precision': 1.0, 'recall': 0.9788897576231431, 'f1': 0.9893322797313315}


# Model Performance Evaluation

The models with the best metrics for each approach are as follows: 
 - **Logistic Regression**: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'newton-cg', 'vectorizer__min_df': 10}
   - Testing Metrics: {'accuracy': 0.9890539483971853, 'precision': 1.0, 'recall': 0.9890539483971853, 'f1': 0.9944968553459119}
 - **Random Forest**: {'classifier__max_depth': 20, 'classifier__min_samples_leaf': 1, 'classifier__n_estimators': 50, 'vectorizer__min_df': 20}
    - Testing Metrics: {'accuracy': 0.9788897576231431, 'precision': 1.0, 'recall': 0.9788897576231431, 'f1': 0.9893322797313315}
    
According to these results, the Logistic Regression model performed the best with an accuracy of 0.986 and an f1-score of 0.993

# Training Logistic Regression Model

## Splitting data into training and testing sets

In [25]:
x_train, x_test, y_train, y_test = train_test_split(emails_df['body'].values, emails_df['label'], test_size=0.2, shuffle=False)

## Fitting data using Count Vectorizer

In [26]:
x_train_counts = best_lr_vectorizer.fit_transform(x_train)

## Training LR model using best parameters

In [27]:
best_lr_model.fit(x_train_counts, y_train)

## Evaluating on the testing set

In [28]:
x_test_counts = best_lr_vectorizer.transform(x_test)
y_test_pred = best_lr_model.predict(x_test_counts)

# Evaluate the performance on the testing set
accuracy = accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred, pos_label="spam")
recall = recall_score(y_test, y_test_pred, pos_label="spam")
f1 = f1_score(y_test, y_test_pred, pos_label="spam")

# Store the parameters and metrics in a dictionary
results = {
    'best_params': best_params,
    'metrics': {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }
}

print("Testing Metrics:", results)

Testing Metrics: {'best_params': {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'newton-cg', 'vectorizer__min_df': 10}, 'metrics': {'accuracy': 0.9890539483971853, 'precision': 1.0, 'recall': 0.9890539483971853, 'f1': 0.9944968553459119}}


## Saving results as a JSON file

In [29]:
# Save the results as a JSON file
output_filename = datetime.now().strftime("model_results_%Y-%m-%d_%H-%M-%S.json")
with open(output_filename, 'w') as f:
    json.dump(results, f)