#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Bayesian Models

Bayesian models are at the heart of many ML applications, and can be implemented in regression or classification.  Naive Bayes has proven to be an excellent spam detection method. Bayesian Inference is often used in applications of modeling stochastic, temporal, or time-series data like finance, healthcare, sales, marketing, and economics.  Bayesian Networks are also at the heart of reinforcement learning (RL) algorithms which drive complex automation, like autonomous vehicles. And Bayesian Optimization is used to maximize the effectiveness of AI game opponents like alphaGO.  Bayesian Models make effective use of information, and it is possible to parameterize, and update these models using prior, and posterior probability functions.

There are many libraries that implement probabilistic programming including [tensorflow.org/probability](https://www.tensorflow.org/probability).  

In this module we will implement a Bayesian Model using a Naive Bayes Classifier to predict the likelihood of spam in a sample of text data


## Overview

### Learning Objectives

* Review Bayes' Theorem
* Build a Classifier with sklearn
* clean, and analyze text
* predict spam or ham
* predict review sentiment (+ or -)

### Prerequisites

* Probability
* Classification with Sklearn
* Visualization

### Estimated Duration

60 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There are 1 exercises in this Colab so there are 3 points available. The grading scale will be 3 points.

### Load Packages

In [0]:
from zipfile import ZipFile
import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

## Naive Bayes

What is Naive Bayes?  There are two aspects, the first being naive, and the second is Bayes'.  Let us review Bayes' theorem from probability.
$$ P(x)P(y|x) = P(y)P(x|y) $$

Using this theorem, we can solve for the conditional probability of event $y$, given condition $x$.  Furthermore, Bayes' Rule can be extended to incorporate $n$ vectors as follows:

$$ P(y|x_1...x_n) = \frac{P(y)P(x_1...x_n|y)}{P(x_1...x_n)}$$

These probability vectors can then be simplified by multiplying the individual conditional probability for each vector, and taking the maximum likelihood. Naive Bayes returns the y value, or category that maximizes the following argument.

$$ \hat{y} = argmax_y(P(y)\prod_{i=1}^nP(x_i|y) $$

### But wait, why Naive?

In this context, Naive assumes that there is independence between pairs of conditional vectors, in other words, the features of your model have a low multicollinearity.  This is typically not the case, and is the cause for error.  Naive Bayes is practically good for classification, but not for estimation.  Furthermore, it is not robust to interaction, so some of your variables may have interactions, or create a different effect when they both have a specific value.  This comes up quite frequently in Natural Language Processing, and so Naive Bayes usefulness is limited to simpler applications.  Sometimes, simple is better, like in spam filtering, where Naive Bayes can perform reasonably well, with limited training data.

## Spam Filtering

In [0]:
def LoadZip(url, file_name, cols=['type', 'message']):
    # Download file
    urllib.request.urlretrieve(url, 'spam.zip')
    # Open zip in memory
    with ZipFile('spam.zip') as myzip:
        with myzip.open(file_name) as myfile:
            df = pd.read_csv(myfile, sep='\t', header=None)

    df.columns=cols
    display(df.head())
    display(df.shape)
    return df

url = ('https://archive.ics.uci.edu/ml/machine-learning-databases/00228/'
       'smsspamcollection.zip')
df = LoadZip(url, 'SMSSpamCollection')

## First analyze the the number of spam vs. ham

In [0]:
sns.countplot(df['type'])
plt.show()

Here we notice a class imbalance with under 1000 spam messages out over 5000 total.

## Create a list of keywords that might indicate spam

and generate features columns for each keyword.


In [0]:
features = pd.DataFrame()
keywords = ['selected', 'win','deal', 'free', 'trip', 'urgent', 'require',
            'need', 'cash', 'asap']

# use regex search built into pandas
for k in keywords:
    features[k]=df['message'].str.contains(k, case=False)

## Come up with some more rules and implement them as feature columns

In [0]:
features['allcaps'] = df['message'].str.isupper()

## Look at correlation of features

In [0]:
sns.heatmap(features.corr())

plt.show()

The heatmap shows only weak correlations between variables like cash-win,free, urgent.  So, we can assume that there is independence between each keyword.  In actuality, we are violating this assumption.

## Train a model to predict spam

In [0]:
np.random.seed(seed=0)
X = features
y = df['type']
X_train, X_test, y_train, y_test = train_test_split(X,y)
sns.countplot(y_test)
plt.show()

## Make predictions

Using `features` we will now make predictions on whether an individual message is spam or ham.

In [0]:
def classifyNB(X_train,y_train, X_test, y_test, cols=['spam', 'ham']):
    nb = BernoulliNB()

    nb.fit(X_train,y_train)

    y_pred = nb.predict(X_test)
    class_names = cols
    print('Classification Report')
    print(classification_report(y_test, y_pred, target_names=class_names))
    cm = confusion_matrix(y_test, y_pred, labels=class_names)
    df_cm = pd.DataFrame(cm, index=class_names, columns=class_names)

    sns.heatmap(df_cm, cmap='Blues', annot=True, fmt="d",
                xticklabels=True, yticklabels=True, cbar=False, square=True)
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    plt.suptitle("Confusion Matrix")
    plt.show()
    
classifyNB(X_train,y_train,X_test,y_test)

The confusion matrix reads as follows:

* 1182 ham messages correctly predicted
* 114 ham messages were predicted to be spam (Type II error)
* 71 spam messages were correctly predicted
* 26 spam messages were erroneously predicted to be ham (Type I error)



### Precision and Recall

Remember that precision and recall are derived from the ground truth, review the diagram below for clarification.

In [0]:
%%html

<a title="Walber [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons" 
   href="https://commons.wikimedia.org/wiki/File:Precisionrecall.svg">
    <img width="256" alt="Precisionrecall" 
         src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/256px-Precisionrecall.svg.png">
</a>

## For Gmail? What's more important, spam detection or ham protection?

In the case of your inbox, I don't think anyone wants to have legitimate email end up in the spam folder.  On the other hand, your organization may be the target of phishing, so it depends on the objective.

# Resources

* [Naive Bayes Docs](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [spam dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
* [sentiment reviews](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)
* [Paper on Classifiers](http://mdenil.com/media/papers/2015-deep-multi-instance-learning.pdf)* [Bayesian Inference](https://cran.r-project.org/web/packages/LaplacesDemon/vignettes/BayesianInference.pdf)

# Exercises

## Exercise 1

Load Reviews Data and do sentiment analysis

Download the text data from [UCI ML archive](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)

Create a classifier using Naive Bayes one of the 3 datasets, and see how it performs on the other two sets of reviews.  Comment on your approach to building features and why that may, or may not work well for each dataset.

In [0]:
url = ('https://archive.ics.uci.edu/ml/machine-learning-databases/'
'00331/sentiment%20labelled%20sentences.zip')

cols = ['message', 'sentiment']
folder = 'sentiment labelled sentences'
print('\nYelp')
df_yelp = LoadZip(url, folder+'/yelp_labelled.txt', cols)
print('\nAmazon')
df_amazon = LoadZip(url, folder+'/amazon_cells_labelled.txt', cols)
print('\nImdb')
df_imdb = LoadZip(url, folder+'/imdb_labelled.txt', cols)

### Student Solution

In [0]:
# Your answer goes here

### Answer Key

**Solution**

>***Train on Yelp Data***

In [0]:
df = df_yelp.copy()
df['sentiment'] = df['sentiment'].apply(lambda x: 'positive' if x==1 else 'negative')
sns.countplot(df['sentiment'])
plt.suptitle('Yelp Reviews')
plt.show()

In [0]:

# replace punctuation 
df['message'] = df.message.str.replace(r'[^a-zA-Z\d\s:]', '')
# make lower case
df['message'] = df['message'].str.lower()


# split negative messages and combine into one list
negative_words = df.message[df['sentiment']=='negative'].str.cat(sep=' ').split()

positive_words = df.message[df['sentiment']=='positive'].str.cat(sep=' ').split()

# Unique Words
print('negative:', len(np.unique(negative_words)), ' positive:', len(np.unique(positive_words)))

# Create positive words
diff_pos = np.setdiff1d(ar1=positive_words, 
                      ar2=negative_words
                     )

# Create negative words
diff_neg = np.setdiff1d(ar1=positive_words, 
                      ar2=negative_words
                     )

# combine
diff = np.append(diff_pos, diff_neg)

#split
diff = np.random.choice(diff, size=int(len(diff) / 2))

diff = diff_neg

# diff = diff_pos

features = pd.DataFrame()

for key in diff:
    # Note that we add spaces around the key so that we're getting the word,
    # not just pattern matching.
    features[key] = df['message'].str.contains(key, case=False)

X = features
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X,y)

bnb = BernoulliNB()

# Fit our model to the data.
bnb.fit(X_train, y_train)

# Classify, storing the result in a new variable.
y_pred = bnb.predict(X_test)

classifyNB(X_train,y_train,X_test,y_test, cols=['negative', 'positive'])

**Validation**

In [0]:
# If the solution can be auto-graded, perform the autograding here.