# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [19]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [20]:
import os
import io
import numpy
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

# change destination based on system
data = pd.concat([data, dataFrameFromDirectory("/content/drive/MyDrive/Colab Notebooks/machine-learning/emails/spam", "spam")]);
data = pd.concat([data, dataFrameFromDirectory("/content/drive/MyDrive/Colab Notebooks/machine-learning/emails/ham", "ham")])

#For Pandas 1.3:
#data = data.append(dataFrameFromDirectory('emails/spam', 'spam'))
#data = data.append(dataFrameFromDirectory('emails/ham', 'ham'))


Let's have a look at that DataFrame:

In [6]:
data.head()

Unnamed: 0,message,class
/content/drive/MyDrive/Colab Notebooks/machine-learning/emails/spam/00010.445affef4c70feec58f9198cfbc22997,"Dear ricardo1 ,\n\n\n\n<html>\n\n<body>\n\n<ce...",spam
/content/drive/MyDrive/Colab Notebooks/machine-learning/emails/spam/00003.2ee33bc6eacdb11f38d052c44819ba6c,1) Fight The Risk of Cancer!\n\nhttp://www.adc...,spam
/content/drive/MyDrive/Colab Notebooks/machine-learning/emails/spam/00017.1a938ecddd047b93cbd7ed92c241e6d1,Help wanted. We are a 14 year old fortune 500...,spam
/content/drive/MyDrive/Colab Notebooks/machine-learning/emails/spam/00005.57696a39d7d84318ce497886896bf90d,I thought you might like these:\n\n1) Slim Dow...,spam
/content/drive/MyDrive/Colab Notebooks/machine-learning/emails/spam/00007.d8521faf753ff9ee989122f6816f87d7,Help wanted. We are a 14 year old fortune 500...,spam


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [7]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

Let's try it out:

In [8]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.

Testing with changed examples

In [17]:
examples = ['Buy Viagra for Bob', "Hi Bob, how about free viagra?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['ham', 'ham'], dtype='<U4')

### Split into train and test data

In [13]:
from sklearn.model_selection import train_test_split

# Split the dataset into features (X) and target (y)
X = data['message']
y = data['class']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# If you want to combine features and target back into DataFrames
train_df = pd.DataFrame({'message': X_train, 'class': y_train})
test_df = pd.DataFrame({'message': X_test, 'class': y_test})

# Display the shapes of the training and testing sets
print(f'Training set shape: {train_df.shape}')
print(f'Testing set shape: {test_df.shape}')


Training set shape: (2400, 2)
Testing set shape: (600, 2)


In [14]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train_df['message'].values)

classifier = MultinomialNB()
targets = train_df['class'].values
classifier.fit(counts, targets)

### Testing on test data

In [15]:
# Transform the test data using the same vectorizer
test_counts = vectorizer.transform(X_test)

# Predict the classes for the test data
predictions = classifier.predict(test_counts)

# Evaluate the performance
from sklearn.metrics import accuracy_score, classification_report

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')

# Print classification report for more detailed evaluation
report = classification_report(y_test, predictions)
print('Classification Report:')
print(report)


Accuracy: 0.96
Classification Report:
              precision    recall  f1-score   support

         ham       0.95      1.00      0.97       482
        spam       0.98      0.81      0.88       118

    accuracy                           0.96       600
   macro avg       0.97      0.90      0.93       600
weighted avg       0.96      0.96      0.96       600

