# 04 - Naive-Bayes-Simulations

In this exercise, we will learn to simulate data to test hypotheses about algorithms. We will test two hypothesis:

- (1) It is commonly accepted (check on the Internet) that performance of the algorithm Naive Bayes decreases when features are correlated. We will compare performances with correlated and not correlated feature. We will compare the performance to kNN as a baseline.
- (2) We will test the time required to make prediction for kNN and Naive Bayes to see if kNN is slower.

For this purpose, we will also see how to create data, which is a great way to experiment and see how algorithms behave. 


## 1. Create Random Data with a Uniform Distribution

Let's first import some libraries:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In this first part, you will create a Numpy array filled with random values. These random values should be drawn from a *normal distribution*.

Now you can create the random array named `x1` with a distribution with a mean of 1 and a standard deviation of 0.7 and with 500 values.

In [2]:
# Your code here


Now, try to plot the actual distribution of our array to be sure that it has the properties we asked for.

<details>
    <summary>Hint</summary>
sns.distplot
</details>

In [4]:
# Your code here


It should be a normal curve centered around 1.

We want to create a fake *labeled* dataset. This means that we want to have one column containing continuous values (like our variable `x1` above) and the true labels: 0 or 1 since we want to do a binary classification. We need to have data with label 0 following a different distribution than the data with label 1. To do that, we will create another variable with a different distribution. It will correspond to the label 1.

Your next task is to **create** another variable named `x2` drawn from a normal distribution with a **mean of 2** and a **standard deviation of 0.7**, with **500 values**.

You will then **concatenate** `x1` and `x2` into a variable named `X1` (watch the uppercase) to end up with an array of shape `(1000, )`.

<details>
    <summary>Hint</summary>
np.concatenate
</details>

In [6]:
# Your code here


We have our first variable. We will now create the depend variable `y`: it is filled with 0 for the first distribution (the first 500 rows) and with 1 for the second distribution (the last 500 rows).


<details>
    <summary>Hint</summary>
np.repeat()
</details>

In [8]:
# Your code here


You should now have an array `y` of labels of 1000 rows: 500 `0` followed by 500 `1`.

You will now create another feature that is *dependent* on the first feature `X1`.<br>
To do that you will:
- create an other normal distribution named `x3`, with a **mean of 0**, a **standard deviation of 0.2** and **1000 values**.
- create a variable `X2` which is correlated to `X1`, `X2` will be the addition of `x3` and `X1` (not the concatenation).

`x3` and `X2` must have the same shape than `X1`.

In [10]:
# Your code here


Now, do a scatter plot with `X2` in function of `X1` to check that features are correlated. You can add colors for the label (the `y` value: 0 or 1).


In [12]:
# Your code here


You can also plot the distribution of `X1` or `X2` corresponding to a label of 0 along with the distribution of the samples corresponding to a label of 1.

In [14]:
# Your code here


You should see two normal distributions: one centered in 1 and one centered in 2.

The last thing to do to have our fake data is to stack `X1` and `X2` into a 2D Numpy array named `X`. This will be our input data. `X` should have 1000 rows and 2 columns.

<details>
    <summary>Hint</summary>
np.vstack() and .T
</details>

In [16]:
# Your code here


You can now shuffle the arrays `X` and `y`.

<details>
    <summary>Hint</summary>
from sklearn.utils import shuffle
</details>

In [18]:
# Your code here


## 2. Modeling: Naive Bayes vs. kNN

Now, you will try two algorithms on these data:

- (1) kNN
- (2) Naive Bayes

For each model, use Sklearn and create the model, fit the data and calculate the scores.

In [20]:
# Your code here


You should get an accuracy around 0.84 for the kNN and 0.75 for the Naive Bayes. You can see that Naive Bayes is not working as well as kNN.

To see if this is better without correlations, we will create the variables `X1_ind` (for independent) and `X2_ind` like you created `X1` before. This time, `X2_ind` will not depend on `X1_ind`: it will be other random value drawn from a distribution with similar mean (1 for for label 0 and 2 for label 1). This time, the `X1_ind` and `X2_ind` variables will not be correlated. The variable `X_ind` will be the variables `X1_ind` and `X2_ind` stacked as before.


In [23]:
# Your code here


Now, run again both classifier (kNN and Naive Bayes).

In [26]:
# Your code here


In both cases, performance increased. This is because we added information with the second uncorrelated variable. This why it is important to compare these results with a baseline. We can see that the difference between kNN and Naive Bayes seems to be lower than with correlated feature. However, this difference is quite small.


## 3. Naive Bayes vs. kNN for Computing Time

There can be some advantages to use Naive Bayes instead of kNN. For instance, the prediction time should be lower with Naive Bayes. It can be game changing when you need to do prediction in real time for instance. Let's try to feed more data into our algorithms and calculate the time of prediction.

You can create the variables `X1_big` and `X2_big` similar to `X1` and `X2` (independent) but with 100,000 values instead of 1000.

In [29]:
# Your code here


Now, fit again a kNN and a Naive Bayes, and calculate the duration with the Python function `time`.

<details>
    <summary>Hint</summary>
import time
</details>

In [37]:
# Your code here


You should see that the functions `score` or `predict` take a lot more time with the kNN in comparison to the Naive Bayes. 

## 4 - Optional

To be able to plot the decision boundary of a model, you need to do a prediction on a grid and assign a color according to the prediction. It is a great tool to see how your model behaves.

In this part, you will try to plot the data you generated along with the decision boundary of the models. You can plot the data and the decision boundary for:

- kNN with correlated data
- kNN with uncorrelated data
- Naive Bayes with correlated data
- Naive Bayes with uncorrelated data


<details>
    <summary>Hint</summary>
np.meshgrid()
    
ravel()

np.arange()

plt.pcolormesh()
</details>

In [None]:
# Your code here
