# Naive Bayes Classifier

As part of this practical, we will use Naive Bayes to predict if a flight will have a significant delay. We will use a variety of information, including day of the week or the flight distance. The dataset will require some preprocessing, so you will also get further experience with data cleaning and feature engineering!

Let's import the packages that we will use during the practical:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Data processing and exploration

Outline of what we will do as part of data processing:
- load the dataset 
- remove the column `Month`
- in `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay` columns replace NA by 0
- extract and remove the column `ArrDelay`
- define the vector `major_delay` as the delays above or equal to 26 minutes

### Loading the dataset

In [None]:
# load the dataset into a dataframe called data

data = pd.read_csv("data/flights08.csv")

### First look at the data
Have a look at the data:
    
- Do the features make sense?
- What's the shape of the dataset?
- How many missing values are present?
- How many unique values are present per feature? What does that tell you?

### Dealing with missing values
The previous step should have shown you two things:
1. Some features have a lot of missing values, in particular those associated with delay at departure (e.g. ``CarrierDelay``). In the sequel, we will assume that a missing value for delay amounts to no delay.
2. Some features don't have enough unique values to be interesting (which ones?) and should probably removed. 

Based on this:
- fill the missing values associated with delay by a 0
- remove the feature(s) that don't have enough variability
- remove all instances that have missing values left

### Extracting the response
Our aim is to predict whether there will be a significant delay. The variable that encodes the delay is `ArrDelay`.

1. Start by having a look at ``ArrDelay`` using ``distplot`` from ``seaborn`` .
2. Compute the delay threshold such that 70% of the positive delays are lower than that threshold. The method `np.percentile` might be useful here.
3. Form a response vector `major_delay` being either 0 or 1 depending on whether the delay is less than or greater or equal to the threshold.
4. Finally remove the `ArrDelay` column from the dataset.

Have a look at the value of the delay threshold:

## Fit and evaluate a Naive Bayes model

### Train-test split
Split the data into training and testing sets, using a random state of ``5175`` for comparable results. Use 30% of the data for testing and stratify the training and testing sets using ``major_delay`` vector to have a similar proportion of flights with major delay in both of them.

In [None]:
from sklearn.model_selection import train_test_split



### Fit a basic Gaussian Naive Bayes model
Create and fit a Gaussian Naive Bayes model to the training data:

In [None]:
from sklearn.naive_bayes import GaussianNB



### Make predictions and display the classification report
Make predictions on the test data and have a look at the classification report:

In [None]:
from sklearn.metrics import classification_report



### Look at the probabilities
Gaussian Naive Bayes gives probabilities indicating how confident the model is about the classification. Use `distplot` from `seaborn` to display the modelled probabilities for class 1 (major delays). Use `predict_proba`, not `score`, but you may also want to try `predict_log_proba`. Comment on the resulting graph.