# Naive Bayes

S: spam

B: containing the word "Bitcoin"

a message is spam given on condition that it contains "Bitcoin":

```
P(S|B) = [P(B|S) * P(S)] / [P(B|S) * P(S) + P(B|~S) * P(~S)]
```

Numerator: Probability that a message is spam and it contains "Bitcoin"

Denominator: Probability that a message contains "Bitcoin"

```
P(S|B) = P(B|S) / [P(B|S) + P(B|~S)]
```

Eg:-

- our data: 50% of spam messages contains bitcoin

     only 1% of non-spam messages contains the word Bitcoin

     ```
     P(B|S) = 0.5
     
     P(B|~S) = 0.01
     
     P(S|B) = 0.5 / (0.5+.01) = 0.98
     ```

*A more better spam filter*:

We decided to increase the reach of our spam filter by adding more words to the list of words that may leads to spam.

supose w = [w1, w2, w3, ....wn] is the vocabulary for spam filter

suppose Xi is the event "A message contains the word wi"

S = event that a message is spam

~S = event that a message is not spam

`P(Xi | S)` = is the probability that a message contains the words wi and itsa spam message

`P(Xi | ~S)` = is the probability that a message contains the word wi but its not a spam message

**Assumption**

*The key idea to Naive bayes is making a "Naive" assumption that the presence or absence of each word in vocabulary are independent of each other, conditional on a message being spam or not.*

Eg:-

It Assumes knowing a certain spam message contains the word 'Bitcoin' gives no information about whether the same spam message contains the word 'Rolex' or not

Applying this assumption on the probability model:

xi = partial probability, ie, occurrence of multiple words (eg: w3, w5, ... wi)

`P(x1=X1, x2=X2, ..... xn=Xn | S) = P(x1=X1 | S) * P(x2=X2 | S) * .... * P(xn=Xn | S)`

In another way:

`P(S | x=X) = P(x=X | S) / [P(x=X | S) + P(x=X | ~S)]`

Naive Bayes assumption allows us to compute each of the probabilities in the right by simply multiplying together the individual probabilities for each vocabulary word.

In real cases to avoid "*underflow*":- in which computers deal with real small floating probability estimates that are very close to zero, we find exponential of log

since:

$log(a.b) = log(a) + log(b)$

$exp(log(x)) = x$

then:

means an estimate `p1 * p2 * p3 * ... *pn = exp(1og(p1) + 1og(p2) ... 1og(pn))`



Suppose "data" is a word that never appeard in a spam message.

`P(Xi="data"|S) = 0`, So we use a "*pseudocount*" k, which modifies the probability calculation as:

`P(Xi=wi | S) = (k + number of spam messages containing wi) / (2*K + number of spams)`

- Eg:- 

    "authentic" appears in 0 out of 98 spam messages.
    
    `P(Xi="authentic"|S) = 0`
    
    So we add 1 more count as a pseudo count to the probability and get: 
    
    `P(X="authentic"|S) = [1+0 / (2 * 1 + 98)] = 1/100` to avoid zero probs

## Implementation of Naive Bayes

**ALGORITHM**

1. classifier
2. tokenizer
3. clean


Example message:

"""
*This message contains the word "bitcoin" and- is a spam message of occurance of 0*
"""

In [None]:
from collections import defaultdict
from typing import Set, Iterable, NamedTuple, Dict, Tuple
import re
import math

In [None]:
def tokenize(text: str) -> Set[str]:
    text = text.lower()
    # text = text.replace("'","")
    all_words = re.findall(r"[a-z0-9]+", text)
    return set(all_words)

In [None]:
message = """This message contains the word "bitcoin" and- is a spam message of occurance of 0"""

all_words = tokenize(message)
print(all_words)

{'this', 'the', 'bitcoin', 'and', 'is', 'a', 'occurance', 'of', 'word', '0', 'message', 'spam', 'contains'}


In [None]:
class Message(NamedTuple):
    text: str
    is_spam: bool


# Now we have to count, tokenize and label our training data, spam_count = count spam messages, ham_count=no of non spam messages
class NaiveBayesClassifier:
    def __init__(self, k: float = 0.5) -> None:
        self.k = k  # smoothing factor
        self.tokens: Set[str] = set()       # vocabulary
        self.token_spam_counts: Dict[str, int] = defaultdict(int)
        self.token_ham_counts: Dict[str, int] = defaultdict(int)
        self.spam_counts = self.ham_counts = 0


    # Now we have to train the model by tokenizing each message and for each token we check and increment either spam count or ham count
    def train(self, messages: Iterable[Message]) -> None:
        for message in messages:
            # increment message count
            if message.is_spam:
                self.spam_counts += 1
            else:
                self.ham_counts += 1

            # check and increment token counts
            for token in tokenize(message.text):
                self.tokens.add(token)
                if message.is_spam:
                    self.token_spam_counts[token] += 1
                else:
                    self.token_ham_counts[token] += 1


    # Now ultimately we want to predict P(spam | token) ,as we saw, applying Bayes theoram, we need to to know P(token | s) and we multiply all such individual proabilities, so we have to create a helper function to achieve this
    def _probabilities(self, token: str) -> Tuple[float, float]:
        """Returns P(token| spam) and P(token|ham)"""
        spam = self.token_spam_counts[token]
        ham = self.token_ham_counts[token]

        p_token_spam = (spam + self.k) / (self.spam_counts + 2 * self.k)        # k is Smooting parameter and applying it to the probabilities
        p_token_ham = (ham + self.k) / (self.ham_counts + 2 * self.k)

        return p_token_spam, p_token_ham
    

    def predict(self, text: str) -> float:
        text_tokens = tokenize(text)
        log_prob_if_spam = log_prob_if_ham = 0

        # iterate through each word in vocabulary
        for token in self.tokens:
            prob_if_spam, prob_if_ham = self._probabilities(token)

            # if *token* appears in message, add the log probability of seeing it
            if token in text_tokens:
                log_prob_if_spam += math.log(prob_if_spam)
                log_prob_if_ham += math.log(prob_if_ham)

            # otherwise, add log probability of not seeing it, which is log(1-probability of seeing it)
            else:
                log_prob_if_spam += math.log(1.0 - prob_if_spam)
                log_prob_if_ham += math.log(1.0 - prob_if_ham)

        prob_if_spam = math.exp(log_prob_if_spam)       # P(Xi = wi|S)
        prob_if_ham = math.exp(log_prob_if_ham)       # P(Xi = wi|~S)

        total_probability = prob_if_spam / (prob_if_spam + prob_if_ham)       # P(S | Xi = W)
        return total_probability

## Testing our Naive Bayes classifying model

Lets test our model by writing a unit test case

In [None]:
# messages = [
#     Message("spam rules", is_spam=True),
#     Message("spam rules", is_spam=False),
#     Message("ham message", is_spam=False),
# ]

messages = [
    Message("spam rules", is_spam=True),
    Message("not ham rules", is_spam=False),
    Message("not a spam message", is_spam=False),
]

model = NaiveBayesClassifier(k=0.5)
model.train(messages)

In [None]:
print(f"Words extracted (Tokens) from the messages are: {model.tokens}")
print(f"No. of spam messages are: {model.spam_counts}")
print(f"No. of ham messages are: {model.ham_counts}")
print(f"Word count in spam messages: {dict(model.token_spam_counts)}")
print(f"Word count in ham messages: {dict(model.token_ham_counts)}")

Words extracted (Tokens) from the messages are: {'a', 'ham', 'rules', 'message', 'not', 'spam'}
No. of spam messages are: 1
No. of ham messages are: 2
Word count in spam messages: {'spam': 1, 'rules': 1}
Word count in ham messages: {'ham': 1, 'not': 2, 'rules': 1, 'a': 1, 'message': 1, 'spam': 1}


Now lets see how the prediction works

In [None]:
text = "hello spam"

Calculating the probability of spam and ham using **bayes theoram**:

In [None]:
# TODO

Bayes Theorem: of the 1st testing message

```
P(S|B) = [P(B|S) * P(S)] / [P(B|S) * P(S) + P(B|~S) * P(~S)]

since P(S: Prob of spam), P(~S: Prob of ham) = 0.5

P(S|B) = P(B|S) / [P(B|S) + P(B|~S)] --> Objective


P(Xi = 'hello'|S) = 0/1 = 0+1/(1 + 2 * 1) = 0.33             (k=1)

P(Xi = 'spam'|S) = 1/1 = (1+1)/(1+ 2 * 1) = 0.66             (No.of messages containing word 'spam' / No. of Spams)


P(Xi = 'hello'|~S) = 0/2 = 0+1/(2 + 2 * 1) = 0.25

P(Xi = 'spam'|~S) = 2/2 = (2+1)/(2 + 2 * 1) = 0.75


P(Xi='hello' AND Xi = 'spam'|S) = P(Xi = 'hello'|S) * P(Xi = 'spam'|S)          (with the naive assumption that: P(Xi = 'hello') & P(Xi = 'spam') are independent)

P(Xi='hello' AND Xi = 'spam'|S) = 0.33 * 0.66 = 0.218

P(Xi='hello' AND Xi = 'spam'|~S) = P(Xi = 'hello'|~S) * P(Xi = 'spam'|~S)

P(Xi='hello' AND Xi = 'spam'|~S) = 0.25 * 0.75 = 0.187

```

Finally:

`P(S | prob of our vocab occouring in our message) = P(S | Xi='hello' AND Xi = 'spam') = 0.218/(0.218 + 0.187) = 0.538`

This is 53.8% of message being a spam (Our Model spits out: 62.8%)

Bayes Theorem: of the 2nd testing message <--------------------------------

```
P(S|B) = [P(B|S) * P(S)] / [P(B|S) * P(S) + P(B|~S) * P(~S)]

since P(S), P(~S) = 0.5

P(S|B) = P(B|S) / [P(B|S) + P(B|~S)] --> Objective


P(Xi = 'hello'|S) = 0/1 = 0+1/(1 + 2 * 1) = 0.33             (k=1)

P(Xi = 'spam'|S) = 1/1 = (1+1)/(1 + 2 * 1) = 0.66


P(Xi = 'hello'|~S) = 0/2 = 0+1/(2 + 2 * 1) = 0.25

P(Xi = 'spam'|~S) = 1/2 = (1+1)/(2 + 2 * 1) = 0.5


P(Xi='hello' AND Xi = 'spam'|S) = P(Xi = 'hello'|S) * P(Xi = 'spam'|S)          (with the naive assumption that: P(Xi = 'hello') & P(Xi = 'spam') are independent)

P(Xi='hello' AND Xi = 'spam'|S) = 0.33 * 0.66 = 0.218

P(Xi='hello' AND Xi = 'spam'|~S) = P(Xi = 'hello'|~S) * P(Xi = 'spam'|~S)

P(Xi='hello' AND Xi = 'spam'|~S) = 0.25 * 0.5 = 0.125

```

Finally:

`P(S | prob of input message words(altogether) occouring in our vocabulary) = P(S | Xi='hello' AND Xi = 'spam') = 0.218/(0.218 + 0.125) = 0.635`
<!-- `P(S | prob of our vocab occouring in our message) = P(S | Xi='hello' AND Xi = 'spam') = 0.218/(0.218 + 0.125) = 0.635` -->

This is 63.5% of message being a spam (Our Model spits out: 91.9%)

Calculating the probability of spam and ham using our **model**:

In [None]:
percentile_spam = round(100 * (model.predict(text)), 1)

print(f"The given message has a chance of {percentile_spam}% being a spam message based on training data above")

The given message has a chance of 91.9% being a spam message based on training data above
