# Week Five Part 2

Ari and Lucas

Our assignment for Week 5 is to use a dataset to predict the class of new documents as either spam and ham (non-spam) e-mails.

We will be using the dataset provided for us:  UCI Machine Learning Repository: Spambase Data Set (https://archive.ics.uci.edu/dataset/94/spambase). Our first step is to load the data and do some prints on it just to see what we are working with.

In [11]:
import pandas as pd

#prepare column names
column_names = [
    'word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
    'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet',
    'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will',
    'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free',
    'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit',
    'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money',
    'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650',
    'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857',
    'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology',
    'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct',
    'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project',
    'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference',
    'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!', 'char_freq_$',
    'char_freq_#', 'capital_run_length_average', 'capital_run_length_longest',
    'capital_run_length_total', 'target'
]

#load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
spambase = pd.read_csv(url, header=None, names=column_names)

#prints
print(spambase.head())
print(f"Total emails/rows: {len(spambase)}")
print("Total spam emails (target=1):", spambase['target'].sum())
print("Total ham emails (target=0):", len(spambase) - spambase['target'].sum())

   word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
0            0.00               0.64           0.64           0.0   
1            0.21               0.28           0.50           0.0   
2            0.06               0.00           0.71           0.0   
3            0.00               0.00           0.00           0.0   
4            0.00               0.00           0.00           0.0   

   word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
0           0.32            0.00              0.00                0.00   
1           0.14            0.28              0.21                0.07   
2           1.23            0.19              0.19                0.12   
3           0.63            0.00              0.31                0.63   
4           0.63            0.00              0.31                0.63   

   word_freq_order  word_freq_mail  ...  char_freq_;  char_freq_(  \
0             0.00            0.00  ...         0.00        0.000   
1 

There is 58 columns in spambase with a most of them holding relative frequencies of different words and characters. The couple extras are on capital letters as a lot of spam emails often have a lot of these i.e. URGENT NOTICE or LIMITED TIME OFFER. The last column 'target' is the binary column for spam (1) or ham (0).

Next, we need to split our dataset into training and testing sets. We can see that the spambase has 4,601 rows of email info. A good split for training and testing sets is between 70:30 to 80:20. In this case, we will use 70:30.

In [19]:
from sklearn.model_selection import train_test_split

# Split: 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(spambase, spambase['target'], test_size=0.3, random_state=77)

print(f"Training set: {len(X_train)}")
print(f"Testing set: {len(X_test)}")

Training set: 3220
Testing set: 1381


Next, we need to train a classifier for predicting if a email is spam or not. We chose Logistic Regression as our classifier because it is good with binary predictions and works well with numeric values which all our columns are.

Below is a simple Logistic Regression classifier with just capital_run_length_total as our only predicting feature. This columns tell us the total number of capital letters in that email. As discussed above, a lot of spam emails tend to use capital letters to make their emails look urgent or pop out as the viewer.

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

#Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train[['capital_run_length_total']], y_train)

#use Logistic Regression model to make predictions on the test set
results = model.predict(X_test[['capital_run_length_total']])

#print results
print(classification_report(y_test, results))

accuracy = accuracy_score(y_test, results)
print(f"Accuracy: {accuracy * 100:.2f}%")


              precision    recall  f1-score   support

           0       0.63      0.92      0.75       803
           1       0.70      0.25      0.37       578

    accuracy                           0.64      1381
   macro avg       0.66      0.59      0.56      1381
weighted avg       0.66      0.64      0.59      1381

Accuracy: 64.08%


The above simple Logistic Regression classifier got an accuracy score of 64.08% which is not that good considering this is a binary prediction so the average if we did this randomly with no features is 50%. This improve this we can add more features and we can standardize our features by scaling them after picking which to use.

The features we added are:


In [32]:
from sklearn.preprocessing import StandardScaler

#create list for the features as there is more than one now
X_train_feature = X_train[['capital_run_length_total']]
X_test_feature = X_test[['capital_run_length_total']]

#scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_feature)
X_test_scaled = scaler.transform(X_test_feature)

#Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)

#use Logistic Regression model to make predictions on the test set
results = model.predict(X_test_scaled)

#print results
print(classification_report(y_test, results))

accuracy = accuracy_score(y_test, results)
print(f"Accuracy: {accuracy * 100:.2f}%")

              precision    recall  f1-score   support

           0       0.63      0.93      0.75       803
           1       0.71      0.25      0.37       578

    accuracy                           0.64      1381
   macro avg       0.67      0.59      0.56      1381
weighted avg       0.66      0.64      0.59      1381

Accuracy: 64.23%


With the new additional features and scaling, the accuracy went up from 64.08% to _%.

Conclusion & Next steps