#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Bayesian Models

Bayesian models are at the heart of many ML applications, and they can be implemented in regression or classification. For example, the "Naive Bayes" algorithm has proven to be an excellent spam detection method. Bayesian inference is often used in applications of modeling stochastic, temporal, or time-series data, such as finance, healthcare, sales, marketing, and economics.  Bayesian networks are also at the heart of reinforcement learning (RL) algorithms, which drive complex automation, like autonomous vehicles. And Bayesian optimization is used to maximize the effectiveness of AI game opponents like [alphaGO](https://deepmind.com/research/case-studies/alphago-the-story-so-far).  Bayesian models make effective use of information, and it is possible to parameterize and update these models using prior and posterior probability functions.

There are many libraries that implement probabilistic programming including [TensorFlow Probability](https://www.tensorflow.org/probability).  

In this Colab we will implement a Bayesian model using a Naive Bayes classifier to predict the likelihood of spam in a sample of text data.


### Load Packages

In [None]:
from zipfile import ZipFile
import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

## Naive Bayes

What is Naive Bayes?  There are two aspects: the first is naive, and the second is Bayes'. Let's first review the second part: Bayes' theorem from probability.

$$ P(x)P(y|x) = P(y)P(x|y) $$

Using this theorem, we can solve for the conditional probability of event $y$, given condition $x$.  Furthermore, Bayes' rule can be extended to incorporate $n$ vectors as follows:

$$ P(y|x_1, ..., x_n) = \frac{P(y)P(x_1, ..., x_n|y)}{P(x_1, ..., x_n)}$$

These probability vectors can then be simplified by multiplying the individual conditional probability for each vector and taking the maximum likelihood. Naive Bayes returns the y value, or the category that maximizes the following argument.

$$ \hat{y} = argmax_y(P(y)\prod_{i=1}^nP(x_i|y) $$

Don't worry too much if this is a bit too much algebra. The actual implementations don't require us to remember everything!

### But Wait, Why "Naive"?

In this context, "naive" assumes that there is independence between pairs of conditional vectors. In other words, it assumes the features of your model are independent (or at least, have a low [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)). This is typically not the case, and it is the cause for error. Naive Bayes is practically good for classification, but not for estimation. Furthermore, it is not robust to interaction, so some of your variables may have interactions. This comes up quite frequently in natural language processing (NLP), and so the usefulness of Naive Bayes is limited to simpler applications. Sometimes simple is better, like in spam filtering where Naive Bayes can perform reasonably well with limited training data.

## Spam Filtering

In [None]:
def LoadZip(url, file_name, cols=['type', 'message']):
    # Download file.
    urllib.request.urlretrieve(url, 'spam.zip')
    # Open zip in memory.
    with ZipFile('spam.zip') as myzip:
        with myzip.open(file_name) as myfile:
            df = pd.read_csv(myfile, sep='\t', header=None)

    df.columns=cols
    display(df.head())
    display(df.shape)
    return df

url = ('https://archive.ics.uci.edu/ml/machine-learning-databases/00228/'
       'smsspamcollection.zip')
df = LoadZip(url, 'SMSSpamCollection')

First let's analyze the number of spam vs. ham. For reference, "ham" is the opposite of "spam", so a non-spam message.

In [None]:
sns.countplot(df['type'])
plt.show()

Here we notice a class imbalance with under 1000 spam messages out of over 5000 total messages.

Now we create a list of keywords that might indicate spam and generate features columns for each keyword.


In [None]:
features = pd.DataFrame()
keywords = ['selected', 'win','deal', 'free', 'trip', 'urgent', 'require',
            'need', 'cash', 'asap']

# Use regex search built into pandas.
for k in keywords:
    features[k]=df['message'].str.contains(k, case=False)

Let's look at the correlation of features.

In [None]:
features['allcaps'] = df['message'].str.isupper()
sns.heatmap(features.corr())

plt.show()

The heatmap shows only weak correlations between variables like 'cash', 'win', 'free', and 'urgent'.  Therefore, we can assume there is independence between each keyword. In actuality, we are violating this assumption.

## Train a Model to Predict Spam

In [None]:
np.random.seed(seed=0)
X = features
y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X,y)
sns.countplot(y_test)
plt.show()

Using `features`, we will now make predictions on whether an individual message is spam or ham.

In [None]:
def classifyNB(X_train,y_train, X_test, y_test, cols=['spam', 'ham']):
    nb = BernoulliNB()

    nb.fit(X_train,y_train)

    y_pred = nb.predict(X_test)
    class_names = cols
    print('Classification Report')
    print(classification_report(y_test, y_pred, target_names=class_names))
    cm = confusion_matrix(y_test, y_pred, labels=class_names)
    df_cm = pd.DataFrame(cm, index=class_names, columns=class_names)

    sns.heatmap(df_cm, cmap='Blues', annot=True, fmt="d",
                xticklabels=True, yticklabels=True, cbar=False, square=True)
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    plt.suptitle("Confusion Matrix")
    plt.show()
    
classifyNB(X_train,y_train,X_test,y_test)

The confusion matrix reads as follows:

* 1182 ham messages correctly predicted
* 114 ham messages were predicted to be spam (Type II error)
* 71 spam messages were correctly predicted
* 26 spam messages were erroneously predicted to be ham (Type I error)



### Precision and Recall

Remember that precision and recall are derived from the ground truth. Review the diagram below for clarification.

In [None]:
%%html

<a title="Walber [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons" 
   href="https://commons.wikimedia.org/wiki/File:Precisionrecall.svg">
    <img width="256" alt="Precisionrecall" 
         src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/256px-Precisionrecall.svg.png">
</a>

For email, what's more important: spam detection or ham protection?

In the case of your inbox, I don't think anyone wants to have legitimate email end up in the spam folder. On the other hand, your organization may be the target of phishing, and it may be important to filter out all spam aggressively. The answer to the question depends on the situation.

# Resources

* [Naive Bayes docs](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [Spam dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
* [Sentiment reviews](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)
* [Paper on classifiers](http://mdenil.com/static/papers/2015-deep-multi-instance-learning.pdf)
* [Bayesian fnference](https://cran.r-project.org/web/packages/LaplacesDemon/vignettes/BayesianInference.pdf)

# Exercises

## Exercise 1

Let's load some user reviews data and do a sentiment analysis. Download the text data from [this UCI ML archive](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences).

Create a classifier using Naive Bayes for one of the three datasets in the cell below. See how it performs on the other two sets of reviews. Comment on your approach to building features and why that may or may not work well for each dataset.

In [None]:
url = ('https://archive.ics.uci.edu/ml/machine-learning-databases/'
'00331/sentiment%20labelled%20sentences.zip')

cols = ['message', 'sentiment']
folder = 'sentiment labelled sentences'
print('\nYelp')
df_yelp = LoadZip(url, folder+'/yelp_labelled.txt', cols)
print('\nAmazon')
df_amazon = LoadZip(url, folder+'/amazon_cells_labelled.txt', cols)
print('\nImdb')
df_imdb = LoadZip(url, folder+'/imdb_labelled.txt', cols)

### Student Solution

In [None]:
# Your answer goes here

---