### Naive Bayes classification example from the book "Principles of Data Science"
```
P(spam|<sentence>) = P(<sentence>|spam) * P(spam) / P(<sentence>)
P(non-spam|<sentence>) = P(<sentence>|non-spam) * P(non-spam) / P(<sentence>)
```

In [1]:
import pandas as pd
import sklearn

In [3]:
df = pd.read_table('https://raw.githubusercontent.com/sinanuozdemir/sfdat22/master/data/sms.tsv', sep='\t', header=None, names=['label', 'msg'])
df

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [13]:
value_probablity = df.label.value_counts()/len(df)
spam_probability = value_probablity.spam
ham_probability = value_probablity.ham
print('spam probability: {}, ham probability: {}'.format(spam_probability, ham_probability))

spam probability: 0.13406317300789664, ham probability: 0.8659368269921034


In [17]:
spams = df[df.label == 'spam']
sentence = 'send cash now'
spam_words_probability = 1
for word in sentence.split():
    word_probability = spams[spams.msg.str.contains(word)].shape[0]/float(spams.shape[0])
    print("word {} probability: {}".format(word, word_probability))
    spam_words_probability *= word_probability
spam_words_probability *= spam_probability
print('spam words probability: {}'.format(spam_words_probability))

word send probability: 0.06693440428380187
word cash probability: 0.06827309236947791
word now probability: 0.1994645247657296
spam words probability: 0.00012220082487226017


In [19]:
hams = df[df.label == 'ham']
sentence = 'send cash now'
ham_words_probability = 1
for word in sentence.split():
    word_probability = hams[hams.msg.str.contains(word)].shape[0]/float(hams.shape[0])
    print("word {} probability: {}".format(word, word_probability))
    ham_words_probability *= word_probability
ham_words_probability *= ham_probability
print('ham words probability: {}'.format(ham_words_probability))

word send probability: 0.023626943005181346
word cash probability: 0.002694300518134715
word now probability: 0.10051813471502591
ham words probability: 5.540949590575691e-06


In [20]:
if spam_words_probability > ham_words_probability:
    print('{} is more likely a spam'.format(sentence))
else:
    print('{} is more likely NOT a spam'.format(sentence))

send cash now is more likely a spam


### Use the built-in library methods

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
x_train, x_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1)
vect = CountVectorizer()
train_dtm = vect.fit_transform(x_train)
test_dtm = vect.transform(x_test)
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
predicts = nb.predict(test_dtm)
predicts

array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], 
      dtype='<U4')

In [36]:
from sklearn import metrics
print('accuracy: {}, confusion matrix: {}'
      .format(metrics.accuracy_score(y_test, predicts), metrics.confusion_matrix(y_test, predicts)))

accuracy: 0.9885139985642498, confusion matrix: [[1203    5]
 [  11  174]]


for confusion matrix, see: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html