# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [1]:
import numpy as np
from pandas import DataFrame, concat
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Since the original email dataset isn't available, we'll create sample spam/ham data
# In a real scenario, you would load emails from a directory

# Sample spam messages (typical spam characteristics)
spam_messages = [
    "Free Viagra now!!! Click here for amazing deals!!!",
    "You have won $1,000,000!!! Claim your prize NOW!",
    "URGENT: Your account has been compromised. Send password immediately.",
    "Cheap medications available online. No prescription needed!",
    "Get rich quick! Work from home and earn thousands!",
    "Limited time offer! Buy now and save 90%!",
    "Congratulations! You've been selected for a free iPhone!",
    "Hot singles in your area waiting to meet you!",
    "Make money fast with this one weird trick!",
    "FREE trial! No credit card required! Act now!",
    "Your lottery ticket has won! Send bank details.",
    "Discount prescription drugs shipped overnight!",
    "Click here to claim your free gift card!",
    "Double your income working from home!",
    "Exclusive deal just for you! 80% off everything!",
] * 10  # Repeat to get more samples

# Sample ham messages (typical legitimate emails)
ham_messages = [
    "Hi Bob, how about a game of golf tomorrow?",
    "Please find attached the quarterly report for review.",
    "Can we schedule a meeting for next Tuesday at 2pm?",
    "Thanks for your help with the project yesterday.",
    "Reminder: Team lunch is scheduled for Friday at noon.",
    "I've updated the documentation as requested.",
    "The conference call has been moved to 3pm.",
    "Please review the attached proposal and provide feedback.",
    "Looking forward to seeing you at the company event.",
    "The deadline for the project has been extended.",
    "Could you send me the updated spreadsheet?",
    "Great presentation today! Very informative.",
    "Let me know when you're available to discuss.",
    "The meeting notes from today are attached.",
    "Thanks for getting back to me so quickly.",
] * 10  # Repeat to get more samples

# Create DataFrames for spam and ham
spam_df = DataFrame({'message': spam_messages, 'class': 'spam'})
ham_df = DataFrame({'message': ham_messages, 'class': 'ham'})

# Combine into a single DataFrame using concat (append is deprecated)
data = concat([spam_df, ham_df], ignore_index=True)


Let's have a look at that DataFrame:

In [2]:
data.head()

Unnamed: 0,message,class
0,Free Viagra now!!! Click here for amazing deal...,spam
1,"You have won $1,000,000!!! Claim your prize NOW!",spam
2,URGENT: Your account has been compromised. Sen...,spam
3,Cheap medications available online. No prescri...,spam
4,Get rich quick! Work from home and earn thousa...,spam


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [3]:
# extract words and count the occurance of the words in each individual message
# Convert a collection of text documents to a matrix of token counts
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

# Multinomial naive bayes
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",1.0
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


Let's try it out:

In [4]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.

In [5]:
classifier = MultinomialNB()
# Separate data into train and test sets
splitIndex = np.random.rand(len(data)) < 0.8
train = data[splitIndex]
train.head()
tests = data[~splitIndex]
tests.head()
# Do a new model based on the training set 
train_counts = vectorizer.fit_transform(train['message'].values)

In [6]:
classifier.fit(train_counts, train['class'].values)

0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",1.0
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


In [7]:
# Assess prediction on test set
test_counts = vectorizer.transform(tests)

predictions = classifier.predict(test_counts)

In [8]:
predictions

array(['spam', 'spam'], dtype='<U4')