# Classification Project Classifier using Baye's Theorem
## Machine Learning and Probability

#### Project

Classifying emails as spam or not spam Based on the presence of three keywords: __WIN, FREE ,__ and __OFFER__.

At this point we analyze the data set and try to figure out the keywords (WIN, FREE, OFFER) and spammed or not spammed emails, for the following exercises. Consider that the incident of the three keywords (WIN, FREE, OFFER) in an email are independent events.   

In math, two events A and B are independently of each other, when the probability 
of both occurring together is the product of their individual probabilities:  
                                                           
                                                           𝑃(𝐴∩𝐵) = 𝑃(𝐴)×𝑃(𝐵) 

This means that the presence of one keyword does not influence the appearance of the others, as an example: 

                                           P(𝑊𝐼𝑁 ∩ 𝐹𝑅𝐸𝐸|𝑆𝑝𝑎𝑚) = 𝑃(𝑊𝐼𝑁|𝑆𝑝𝑎𝑚) × 𝑃(𝐹𝑅𝐸𝐸|𝑆𝑝𝑎𝑚)

__Dataset:__

Email Dataset with Spam and No Spam

In [16]:
import pandas as pd

df = pd.read_excel('email_spam_nospam.xlsx')
df

Unnamed: 0,Email,Contains WIN?,Contains FREE?,Contains OFFER?,Spam or Not Spam?
0,WIN a FREE car,Yes,Yes,No,Spam
1,Special OFFER just for you,No,No,Yes,Not Spam
2,WIN a big FREE prize!,Yes,Yes,No,Spam
3,Limited time OFFER,No,No,Yes,Not Spam
4,Claim your FREE vacation!,No,Yes,No,Spam
5,Get an OFFER you can't refuse,No,No,Yes,Not Spam
6,WIN FREE gifts and prizes!,Yes,Yes,No,Spam
7,Exclusive OFFER ends soon,No,No,Yes,Not Spam
8,FREE upgrade on your purchase,No,Yes,No,Not Spam
9,WIN a FREE bonus now!,Yes,Yes,No,Spam


Convert the columns names.

In [17]:
import pandas as pd

df.columns = ["Email", "WIN", "FREE", "OFFER", "Spam"]

# Map categorical values to binary
df["WIN"] = df["WIN"].map({"Yes":1, "No":0})
df["FREE"] = df["FREE"].map({"Yes":1, "No":0})
df["OFFER"] = df["OFFER"].map({"Yes":1, "No":0})
df["Spam"] = df["Spam"].map({"Spam":1, "Not Spam":0})

df

Unnamed: 0,Email,WIN,FREE,OFFER,Spam
0,WIN a FREE car,1,1,0,1
1,Special OFFER just for you,0,0,1,0
2,WIN a big FREE prize!,1,1,0,1
3,Limited time OFFER,0,0,1,0
4,Claim your FREE vacation!,0,1,0,1
5,Get an OFFER you can't refuse,0,0,1,0
6,WIN FREE gifts and prizes!,1,1,0,1
7,Exclusive OFFER ends soon,0,0,1,0
8,FREE upgrade on your purchase,0,1,0,0
9,WIN a FREE bonus now!,1,1,0,1


Calculate Prior Probabilities

In [18]:
P_spam = df["Spam"].mean()
P_not_spam = 1 - P_spam

print(f"P(Spam) = {P_spam:.2f}")
print(f"P(Not Spam) = {P_not_spam:.2f}")

P(Spam) = 0.44
P(Not Spam) = 0.56


Calculate Conditional Probabilities

In [20]:
def conditional_prob(word, spam=1):
    """P(word | spam or not spam)"""
    subset = df[df["Spam"] == spam]
    return subset[word].mean()

for word in ["WIN", "FREE", "OFFER"]:
    print(f"P({word}|Spam) = {conditional_prob(word,1):.2f}")
    print(f"P({word}|NotSpam) = {conditional_prob(word,0):.2f}")
    print("____")

P(WIN|Spam) = 0.50
P(WIN|NotSpam) = 0.20
____
P(FREE|Spam) = 0.88
P(FREE|NotSpam) = 0.20
____
P(OFFER|Spam) = 0.25
P(OFFER|NotSpam) = 0.70
____


Baye's Theorem Classifier

In [22]:
def bayes_classifier(win, free, offer):
    """Classify email using Bayes' theorem"""
    # Likelihoods for spam
    p_win_spam = conditional_prob("WIN",1) if win else (1-conditional_prob("WIN",1))
    p_free_spam = conditional_prob("FREE",1) if free else (1-conditional_prob("FREE",1))
    p_offer_spam = conditional_prob("OFFER",1) if offer else (1-conditional_prob("OFFER",1))
    P_features_given_spam = p_win_spam * p_free_spam * p_offer_spam

    # Likelihoods for not spam
    p_win_not = conditional_prob("WIN",0) if win else (1-conditional_prob("WIN",0))
    p_free_not = conditional_prob("FREE",0) if free else (1-conditional_prob("FREE",0))
    p_offer_not = conditional_prob("OFFER",0) if offer else (1-conditional_prob("OFFER",0))
    P_features_given_not = p_win_not * p_free_not * p_offer_not

    # Bayes' rule
    numerator = P_features_given_spam * P_spam
    denominator = numerator + P_features_given_not * P_not_spam
    P_spam_given_features = numerator / denominator if denominator > 0 else 0

    return "Spam" if P_spam_given_features >= 0.5 else "Not Spam", P_spam_given_features

Now we test if new Emails are considered as Spam or No Spam, according to the key words.

In [24]:
test_emails = {
    "WIN a FREE trip!": (1,1,0),
    "Exclusive OFFER for you!": (0,0,1),
    "WIN big with this special OFFER!": (1,0,1),
    "Get your FREE OFFER now!": (0,1,1),
    "WIN a FREE prize with our OFFER!": (1,1,1)
}

results = []
for email, (w,f,o) in test_emails.items():
    label, prob = bayes_classifier(w,f,o)
    results.append([email, w,f,o, label, round(prob,2)])

results_df = pd.DataFrame(results, columns=["Email","WIN","FREE","OFFER","Prediction","P(Spam)"])
display(results_df)

Unnamed: 0,Email,WIN,FREE,OFFER,Prediction,P(Spam)
0,WIN a FREE trip!,1,1,0,Spam,0.96
1,Exclusive OFFER for you!,0,0,1,Not Spam,0.03
2,WIN big with this special OFFER!,1,0,1,Not Spam,0.1
3,Get your FREE OFFER now!,0,1,1,Not Spam,0.44
4,WIN a FREE prize with our OFFER!,1,1,1,Spam,0.76
