# Week 7

Last week we discussed methods for "big" data modeling using H2O clusters. This week we'll introduce another big data modeling tool called Spark. Then, we'll discuss naive bayes. We'll finish up with a discussion of AWS Lambda in the context of model deployment. 

**Table of Contents**

-   [Modeling with "Big" Data](#big-data)
-   [Naive Bayes](#naive)
    -   [Gaussian](#gaussian)
    -   [Multinomial](#multinomial)
    -   [Bernoulli](#bernoulli)
    -   [Demonstration](#demo)
-   [AWS Lambda](#lambda)   

<a id='big-data'></a>
## Modeling with "Big" Data

See SparkDemo.ipynb

<a id='naive'></a>
## Naive Bayes

The Naive Bayes is a method of classification based on Bayes' theorem. Given a set of $n$ features, $\mathbf{x} = (x_1, \dots, x_n)$, Naive Bayes predicts a single class $C_k$ from among $K$ possible classes.

The Naive Bayes method can be derived from Bayes theorem, along with assuming the features, $\mathbf{x}$, are conditionally independent. 

Starting with Bayes Theorem, 

$p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})}$

Since we are only seeking the $\underset{k \in \{1, \dots, K\}}{\operatorname{argmax}}{p(C_k \mid \mathbf{x})}$ and $p(\mathbf{x})$ is the same for every $k$, we don't bother computing $p(\mathbf{x})$.

$\begin{align}
p(C_k \mid \mathbf{x}) & \varpropto p(C_k) \ p(\mathbf{x} \mid C_k) \\
                       & = \frac{p(C_k) \ p(\mathbf{x} \cap C_k)}{p(C_k)} \\
                       & = p(C_k, x_1, \dots, x_n) \\
                       & = p(x_1 \mid x_2, \dots, x_n, C_k) \ p(x_2 \mid x_3, \dots, x_n, C_k) \dots   p(x_{n-1} \mid x_n, C_k) \ p(x_n \mid C_k) \ p(C_k) \\
                       & = p(x_1 \mid C_k) \ p(x_2 \mid C_k) \dots   p(x_{n-1} \mid C_k) \ p(x_n \mid C_k) \ p(C_k) \\
                       & = p(C_k) \prod_{i=1}^n p(x_i \mid C_k)
\end{align}$



Here, $p(C_k)$ can be computed for each $k \in \{1, \dots, K\}$ from the training dataset by computing the proportion of observations of a given class that make up all observations. You could also assume that each class is equally probable, such that $p(C_k) = \frac{1}{K}, \forall k$

The $p(x_i \mid C_k)$ for $i \in {1, \dots, n}$ can be computed a number of ways, depending on the distribution that we assume for $x_i$.

<a id='gaussian'></a>
### Gaussian

If $x_i$ is continuously-valued and assumed to be normally distributed, then $p(x_i=v \mid C_k)$ can be computed from the training data with $\frac{1}{\sqrt{2\pi\sigma^2_k}}\,e^{ -\frac{(v-\mu_k)^2}{2\sigma^2_k} }$, where $\mu_k$ is the mean value of $x_i$ in the training data for observations of class $k$, and $\sigma^2_k$ is the variance of the values of $x_i$ in the training data for observations of class $k$.

<a id='multinomial'></a>
### Multinomial

Sometimes, it might be appropriate to assume that some or all of our features are drawn from a multinomial distribution. 

$p(\mathbf{x} \mid C_k) = \frac{(\sum_i x_i)!}{\prod_i x_i !} \prod_i {p_{ki}}^{x_i}$

This is a typical assumption for document word count features. For example, let's assume that our training dataset consists of the following 4 documents, where each document belongs to one of two classes, A or B ([reference](https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html#laplace)). 

```
x1 = # "Chinese"
x2 = # "Beijing"
x3 = # "Shanghai"
x4 = # "Macao"
x5 = # "Tokyo"
x6 = # "Japan"

x1 | x2 | x3 | x4 | x5 | x6 | Class
-----------------------------------
 2 | 1  | 0  | 0  | 0  | 0  |  A
-----------------------------------
 2 | 0  | 1  | 0  | 0  | 0  |  A
-----------------------------------
 1 | 0  | 0  | 1  | 0  | 0  |  A
-----------------------------------
 1 | 0  | 0  | 0  | 1  | 1  |  B
-----------------------------------
```

And let's say we are asked to predict the class of associated with some document, $\mathbf{x}$, `Chinese Chinese Chinese Tokyo Japan`, which can be represented as 

```
x1 | x2 | x3 | x4 | x5 | x6 
----------------------------
 3 | 0  | 0  | 0  | 1  | 1  
----------------------------
```

The term $(\sum_i x_i)!$ evaluates to `(3 + 0 + 0 + 0 + 1 + 1)! = 5! = 120`

The term $\prod_i x_i !$ evaluates to `3! * 0! * 0! * 0! * 1! * 1! = 6`. 

In the term $\prod_i {p_{ki}}^{x_i}$, $p_{ki}$ is the probability of observing the word associated with $x_i$ given the document is of class $k$. We can compute these values at training time from the training data. For example, $p_{A,Chinese}$ equals 

```
# of instances of the word "Chinese" in all of the training documents of class A     5
-------------------------------------------------------------------------------- =  ---
                total # of words in the documents of class A                         8
```

If we attempt to compute $p_{A,Tokyo}$, we'll get a value of 0. To avoid this situation [*add-one* or *Laplace smoothing*](https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html#laplace) can be applied. For example, if we apply Laplace smoothing (0.5), $p_{A,Chinese}$ becomes


```
  5 + 0.5
----------- = 0.5
 8 + 0.5*6
```

and $p_{A,Tokyo}$ becomes

```
  0 + 0.5
----------- = 0.04545455
 8 + 0.5*6
```

<a id='bernoulli'></a>
### Bernoulli

Depending on your features, $\mathbf{x}$, it may also be appropriate to assume that they were drawn from a multivariate bernoulli distribution.

$p(\mathbf{x} \mid C_k) = \prod_{i=1}^n p_{ki}^{x_i} (1 - p_{ki})^{(1-x_i)}$

In the context of a document classification example, the features would correspond, not to word counts, but to word occurrence/absence. $p_{ki}$ could be computed in the same manner as above.

<a id='demo'></a>
### Demonstration

The R package `naivebayes` can be used to build naive bayes models. Let's build a model based on the multinomial example above. 

First, let's create our training dataset.

```R
library(naivebayes)

class <- c("A", "A", "A", "B")
x1    <- c(2,2,1,1)
x2    <- c(1,0,0,0)
x3    <- c(0,1,0,0)
x4    <- c(0,0,1,0)
x5    <- c(0,0,0,1)
x6    <- c(0,0,0,1)

train <- data.frame(class = class,
                    x1 = x1,
                    x2 = x2,
                    x3 = x3,
                    x4 = x4,
                    x5 = x5,
                    x6 = x6)
```

```
> train
  class x1 x2 x3 x4 x5 x6
1     A  2  1  0  0  0  0
2     A  2  0  1  0  0  0
3     A  1  0  0  1  0  0
4     B  1  0  0  0  1  1
```

There's a `multinomial_naive_bayes` function available as part of the `naive_bayes` package. If we take a look at the documentation, we can see that there is an argument called `laplace`, which has a default value of 0.5. What happens if we set it to 0?

```
> nb_multinomial <- multinomial_naive_bayes(as.matrix(train[,-1]), train[,1], laplace = 0)
Warning message:
multinomial_naive_bayes(): There are 5 empty cells leading to zero estimates. Consider Laplace smoothing. 

> nb_multinomial
================================================ Multinomial Naive Bayes ==============
 
 Call: multinomial_naive_bayes(x = as.matrix(train[, -1]), y = train[, 1], laplace = 0)

-------------------------------------------------------------------------------- 
 
Laplace smoothing: 0

-------------------------------------------------------------------------------- 
 
 A priori probabilities: 
   A    B 
0.75 0.25 

--------------------------------------------------------------------------------
 
        Classes
Features     A         B
      x1 0.625 0.3333333
      x2 0.125 0.0000000
      x3 0.125 0.0000000
      x4 0.125 0.0000000
      x5 0.000 0.3333333
      x6 0.000 0.3333333

--------------------------------------------------------------------------------

> test <- as.matrix(data.frame(x1=c(3), x2=c(0), x3=c(0), x4=c(0), x5=c(1), x6=c(1)))
> predict(nb_multinomial, test, type = "prob")
       A   B
[1,] NaN NaN
```

Let's go ahead and use the default laplace smoothing of 0.5 instead.

```
> nb_multinomial <- multinomial_naive_bayes(as.matrix(train[,-1]), train[,1])
> nb_multinomial

================================================ Multinomial Naive Bayes =
 
 Call: multinomial_naive_bayes(x = as.matrix(train[, -1]), y = train[, 1])

--------------------------------------------------------------------------- 
 
Laplace smoothing: 0.5

---------------------------------------------------------------------------
 
 A priori probabilities: 
   A    B 
0.75 0.25 

---------------------------------------------------------------------------
 
        Classes
Features          A          B
      x1 0.50000000 0.25000000
      x2 0.13636364 0.08333333
      x3 0.13636364 0.08333333
      x4 0.13636364 0.08333333
      x5 0.04545455 0.25000000
      x6 0.04545455 0.25000000

---------------------------------------------------------------------------


> predict(nb_multinomial, test, type = "prob")
             A         B
[1,] 0.4423963 0.5576037
```

The `naivebayes` package is nice because it allows you to build models when the features, $\mathbf{x}$, come from various distributions. For example, we can easily construct a naive bayes model with the following dataset.

```R
n <- 100
set.seed(1)
class <- sample(c("A", "B"), n, TRUE)
b1    <- sample(c(TRUE, FALSE), n, TRUE)
b2    <- sample(c(TRUE, FALSE), n, TRUE)
b3    <- sample(c(TRUE, FALSE), n, TRUE)
norm  <- rnorm(n)

data <- data.frame(class = class,
                   b1 = b1,
                   b2 = b2,
                   b3 = b3,
                   norm = norm)
```

```
> head(data)
  class    b1    b2    b3       norm
1     A FALSE  TRUE FALSE  0.4094018
2     A  TRUE  TRUE  TRUE  1.6888733
3     B  TRUE FALSE  TRUE  1.5865884
4     B FALSE  TRUE  TRUE -0.3309078
5     A FALSE  TRUE  TRUE -2.2852355
6     B  TRUE FALSE FALSE  2.4976616
```

We can see that the `naive_bayes` method treats `b1`, `b2`, and `b3` as bernoulli distributed features and `norm` as gaussian

```
> train <- data[1:90, ]
> test <- data[91:100, -1]
> nb <- naive_bayes(class ~ ., train)
> nb

====================================================== Naive Bayes ===================== 
 
 Call: naive_bayes.formula(formula = class ~ ., data = train)

----------------------------------------------------------------------------------------
 
Laplace smoothing: 0

----------------------------------------------------------------------------------------
 
 A priori probabilities: 

        A         B 
0.5333333 0.4666667 

---------------------------------------------------------------------------------------- 
 
 Tables: 

---------------------------------------------------------------------------------------- 
 ::: b1 (Bernoulli) 
----------------------------------------------------------------------------------------
       
b1              A         B
  FALSE 0.5625000 0.5238095
  TRUE  0.4375000 0.4761905

----------------------------------------------------------------------------------------
 ::: b2 (Bernoulli) 
----------------------------------------------------------------------------------------
       
b2              A         B
  FALSE 0.3333333 0.4285714
  TRUE  0.6666667 0.5714286

----------------------------------------------------------------------------------------
 ::: b3 (Bernoulli) 
----------------------------------------------------------------------------------------
       
b3              A         B
  FALSE 0.5000000 0.3809524
  TRUE  0.5000000 0.6190476

---------------------------------------------------------------------------------------- 
 ::: norm (Gaussian) 
----------------------------------------------------------------------------------------
      
norm             A           B
  mean 0.006874219 0.013238102
  sd   1.003095185 1.121027017

----------------------------------------------------------------------------------------

```

How would you deal with features from mixed distributions using scikit-learn's implementation of [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes)?

In [3]:
from sklearn import datasets
iris = datasets.load_iris()

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(iris.data, iris.target)
gnb

GaussianNB(priors=None, var_smoothing=1e-09)

In [5]:
import pickle
with open('gnb.pkl', 'wb') as mf:
    pickle.dump(gnb,mf)

<a id='lambda'></a>
## AWS Lambda

AWS [Lambda](https://aws.amazon.com/lambda/) is a service that enables you to run code that would typically reside in a server, without actually setting up and maintaining the server yourself. With Lamda functions, you can delegate typical server administration tasks to AWS. In theory, you don't have to worry about availability and scaling of your service. 

In terms of the machine learning pipeline we've been discussing throughout the course, Lambda could be used to solve part of the model deployment problem. For example, you could push your model code to the Lambda service. Stakeholders could interact with your model through the Lambda HTTP API, rather than incorporating your model into their own software stack, which may be incompatible with your model.

Let's walk through an example of how to set up a simple lambda function and expose it as an HTTP API.

We can follow the following steps to do this.

1. Log into the AWS Console and select the `Lamda` service
2. Click on `Create function`
3. `Author from scratch`
4. In the `Basic information`, we can type a `Function name` of `CS4315Summer2019-Lambda-Instructor`, for example
5. Select `Python 3.7` for the `Runtime`
6. For `Permissions` -> `Execution role` -> `Use an existing role` -> `Execution role` -> `Lambda_Sagemaker`
7. Click `Create function`

8. In the `Function code` block, replace the existing code with:

```python
import json

def lambda_handler(event, context):
    return {
        'statusCode': 200,
        'body': json.dumps({ 'cs4315': 'eeckstrand' })
    }
```

9. Click on `Save`

10. In the `Designer` block, click on `+ Add trigger`
11. Select `API Gateway`
12. For `Security`, select `Open`
13. Click `Add`

Let's test our API out. We'll use the python `requests` package again. In the `API Gateway` block, copy the `API Endpoint` URL.

In [47]:
import requests
import json
response = requests.post('https://brfnsrikga.execute-api.us-east-2.amazonaws.com/default/CS4315Summer2019-Lambda-Instructor')
response.text

'{"cs4315": "eeckstrand"}'