# Bert for Email Spam Detection

As per the paper, we use the simpletransformers library to instantiate our bert model. More information, including other available models can be found here: https://simpletransformers.ai/docs/classification-specifics/

We are running this notebook on kaggle using GPU P100

In [None]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.70.0-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.33.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit->simpletransformers)
  Downloading pydeck-0.8.1b0-py2.py3-none-any.whl.metadata (3.9 kB)
Collecting watchdog>=2.1.5 (from streamlit->simpletransformers)
  Downloading watchdog-4.0.0-py3-none-manylinux2014_x86_64.whl.metadata (37 kB)
Downloading simpletransformers-0.70.0-py3-none-any.whl (315 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import os
import torch
import numpy as np

torch.cuda.is_available()

2024-04-13 18:35:10.802894: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 18:35:10.802996: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 18:35:10.934515: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


True

In [None]:
# Load your training data into a pandas DataFrame

train_df = pd.read_csv("/kaggle/input/email-spams/train.csv")
train_df.rename(columns={'spam': 'labels'}, inplace=True)
train_df = train_df[['text', 'labels']]
train_df.head()

Unnamed: 0,text,labels
0,subject institute international finance annual...,0
1,subject mortgage even worst credit zwzm detail...,1
2,subject partnership mr edward moko independenc...,1
3,subject de la part de enfants ama rue de marty...,1
4,subject synfuel option valuation lenny believe...,0


In [None]:
# Load your training data into a pandas DataFrame
test_df = pd.read_csv("/kaggle/input/email-spams/test.csv")
test_df.rename(columns={'spam': 'labels'}, inplace=True)
test_df = test_df[['text', 'labels']]
test_df.head()

Unnamed: 0,text,labels
0,subject perfect logo charset koi r thinking br...,1
1,subject storage model security stinson added t...,0
2,subject wall street micro news report homeland...,1
3,subject logo stationer website design much lt ...,1
4,subject video conference ross mcintyre vince r...,0


## Instantiate Model

We set our hyperparameters based on the paper's guidelines

In [None]:
train_args = ClassificationArgs()

train_args.learning_rate = 4e-5
train_args.num_train_epochs = 3
train_args.train_batch_size = 32
train_args.max_seq_length = 300
train_args.optimizer = "AdamW"
train_args.eval_batch_size = 32

# https://github.com/ThilinaRajapakse/simpletransformers/issues/638#issuecomment-1060211019
train_args.use_multiprocessing=False
train_args.use_multiprocessing_for_evaluation=False
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# Instantiate the BERT model
model = ClassificationModel(
    "bert",
    "bert-base-cased",
    num_labels=2,  # Binary Classification
    args=train_args,
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
# Train the model
model.train_model(train_df)

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/157 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/157 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/157 [00:00<?, ?it/s]

(471, 0.09091328793144024)

## Results

In [None]:
# run evaluation for the training dataset
result, model_outputs, wrong_predictions = model.eval_model(train_df)

# Extract predicted labels and true labels
predictions = np.argmax(model_outputs, axis=1)
true_labels = train_df['labels']

# Calculate accuracy and F1 score
conf_matrix = confusion_matrix(true_labels, predictions)
class_report = classification_report(true_labels, predictions)

print("Train Set Results:")
print(f"Accuracy: {result['accuracy']}")
print(f"F1 Score: {result['f1_score']}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

print(f"Raw Result information: {result}")

Running Evaluation:   0%|          | 0/157 [00:00<?, ?it/s]

Train Set Results:
Accuracy: 0.9992
F1 Score: 0.999
Confusion Matrix:
[[2998    2]
 [   2 1998]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3000
           1       1.00      1.00      1.00      2000

    accuracy                           1.00      5000
   macro avg       1.00      1.00      1.00      5000
weighted avg       1.00      1.00      1.00      5000

Raw Result information: {'mcc': 0.9983333333333333, 'accuracy': 0.9992, 'f1_score': 0.999, 'tp': 1998, 'tn': 2998, 'fp': 2, 'fn': 2, 'auroc': 0.9999986666666666, 'auprc': 0.9999980007493756, 'eval_loss': 0.00378631795667539}


In [None]:
# Then run evaluation for the test dataset
result, model_outputs, wrong_predictions = model.eval_model(test_df)

# Extract predicted labels and true labels
predictions = np.argmax(model_outputs, axis=1)
true_labels = test_df['labels']

# Calculate accuracy and F1 score
conf_matrix = confusion_matrix(true_labels, predictions)
class_report = classification_report(true_labels, predictions)

print("Test Set Results:")
print(f"Accuracy: {result['accuracy']}")
print(f"F1 Score: {result['f1_score']}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

print(f"Raw Result information: {result}")

Running Evaluation:   0%|          | 0/8 [00:00<?, ?it/s]

Test Set Results:
Accuracy: 0.9823008849557522
F1 Score: 0.9823008849557522
Confusion Matrix:
[[111   2]
 [  2 111]]
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       113
           1       0.98      0.98      0.98       113

    accuracy                           0.98       226
   macro avg       0.98      0.98      0.98       226
weighted avg       0.98      0.98      0.98       226

Raw Result information: {'mcc': 0.9646017699115044, 'accuracy': 0.9823008849557522, 'f1_score': 0.9823008849557522, 'tp': 111, 'tn': 111, 'fp': 2, 'fn': 2, 'auroc': 0.9992168533166262, 'auprc': 0.9992408476866799, 'eval_loss': 0.41572882048785686}


In [None]:
predicted_df = test_df.copy()
predicted_df['predicted_spam'] = predictions
predicted_df['prediction'] = ['Spam' if x == 1 else 'Ham' for x in predicted_df['predicted_spam']]
predicted_df

Unnamed: 0,text,labels,predicted_spam,prediction
0,subject perfect logo charset koi r thinking br...,1,1,Spam
1,subject storage model security stinson added t...,0,0,Ham
2,subject wall street micro news report homeland...,1,1,Spam
3,subject logo stationer website design much lt ...,1,1,Spam
4,subject video conference ross mcintyre vince r...,0,0,Ham
...,...,...,...,...
221,subject sorry see hyatt lobby vince j kaminski...,0,0,Ham
222,subject yyyy know hgh difference hello jm netn...,1,1,Spam
223,subject try ouut hello welcome pharmon content...,1,1,Spam
224,subject department energy deploying corporate ...,0,1,Spam


In [None]:
# View mispredicted emails in testing dataset
mispredictions_df = predicted_df[predicted_df['labels'] != predicted_df['predicted_spam']]
mispredictions_df

Unnamed: 0,text,labels,predicted_spam,prediction
30,subject jif,1,0,Ham
105,subject get costco gold membership one best me...,1,0,Ham
111,subject subscribed frbnyrmagl list mon sep sub...,0,1,Spam
224,subject department energy deploying corporate ...,0,1,Spam


In [None]:
with open ("BERT_formatted_example_email_spam_predictions.txt", "w") as predictions_file:
  for i in range(0,50,2):
    pred = "Email: "+ predicted_df['text'][i] + ".\nPrediction: This is a "+ predicted_df['prediction'][i]+ " email.\n"
    print(pred)
    predictions_file.write(pred+'\n')

Email: subject perfect logo charset koi r thinking breathing new life business start revamping front end logo visuai identity loqodentity offer creative custom design logo stationery web site careful hand powerfui marketinq toois wiii bring breath fresh air business make stand among competitor click away future success click see sample artwork check price hot offer.
Prediction: This is a Spam email.

Email: subject wall street micro news report homeland security investment terror attack united state september changed security landscape foreseeable future physical logical security become paramount industry segment especially banking national resource government sector according giga wholly owned subsidiary forrester research worldwide demand information security product service set eclipse b homeland security investment newsletter dedicated providing reader information pertaining investment opportunity lucrative sector know event related homeland security happen lightning speed investor