Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [23]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df1 = pd.DataFrame.from_records(ham)
df2 = pd.DataFrame.from_records(spam)
df = pd.concat([df1, df2], ignore_index=True)

skipped 2248.2004-09-23.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [24]:
import re

def preprocessor(e):
    return re.sub('[^A-Za-z]', ' ', e).lower()

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

vectorizer = CountVectorizer(preprocessor=preprocessor)
X_train, X_test, y_train, y_test = train_test_split(df['content'],df['category'], test_size=0.2, random_state=42)
X_train_df = vectorizer.fit_transform(X_train)

model = LogisticRegression()
model.fit(X_train_df, y_train)

X_test_df = vectorizer.transform(X_test)
y_pred = model.predict(X_test_df)

print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}\n')
print(f'Detailed Statistics: \n{classification_report(y_test, y_pred)}')


Confusion Matrix: 
[[717  12]
 [ 14 291]]

Detailed Statistics: 
              precision    recall  f1-score   support

         ham       0.98      0.98      0.98       729
        spam       0.96      0.95      0.96       305

    accuracy                           0.97      1034
   macro avg       0.97      0.97      0.97      1034
weighted avg       0.97      0.97      0.97      1034



Step 4.

In [26]:
features = vectorizer.get_feature_names_out()
importance = model.coef_[0]

indexed_importance = list(enumerate(importance))
sorted_importance = sorted(indexed_importance, key=lambda x: abs(x[1]), reverse=True)

top_positive = [(index, importance) for index, importance in sorted_importance if importance > 0][:10]
top_negative = [(index, importance) for index, importance in sorted_importance if importance < 0][:10]

df_top_positive = pd.DataFrame(top_positive, columns=['Feature Index', 'Importance'])
df_top_negative = pd.DataFrame(top_negative, columns=['Feature Index', 'Importance'])

df_top_positive['Feature'] = [features[index] for index, _ in top_positive]
df_top_negative['Feature'] = [features[index] for index, _ in top_negative]

print('Top 10 Positive Features')
print(df_top_positive)

print('\nTop 10 Negative Features')
print(df_top_negative)

Top 10 Positive Features
   Feature Index  Importance   Feature
0          24309    0.920089        no
1          17180    0.853581      http
2          27576    0.851243    prices
3          29417    0.760439    remove
4          16446    0.736283     hello
5          25120    0.711263      only
6          29418    0.677409   removed
7          16506    0.663679      here
8          23302    0.625946      more
9          25757    0.622317  paliourg

Top 10 Negative Features
   Feature Index  Importance   Feature
0          12156   -1.493970     enron
1          34478   -1.459408    thanks
2           2474   -1.368536  attached
3          10653   -1.325974       doc
4           9141   -1.295490     daren
5          26651   -1.292209  pictures
6          38503   -1.227058       xls
7           9328   -1.154972      deal
8          24004   -1.152822      neon
9          17096   -1.043461       hpl


All Done!