<a href="https://colab.research.google.com/github/nickstone1911/data-analysis-practice/blob/main/Fraud_Prediction_with_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tech Lesson 10: Fraud Detection with Naive Bayes

## Learning Objectives
In this exercise, you will learn how to use the Gaussian Naive Bayes classifier to predict fraud.

---






## About the Data

We will use the `bigquery-public-data.ml_datasets.ulb_fraud_detection` dataset for this exercise.

The dataset contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. It is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

>- Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are `Time` and `Amount`.
>- Feature `Time` contains the seconds elapsed between each transaction and the first transaction in the dataset.
>- Feature `Amount` is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning.
>- Feature `Class` is the response variable and it takes value 1 in case of fraud and 0 otherwise.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group ( http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project.

---

# Section 1: Notebook Setup and Data Load

---

## 1.1: Import Libraries
First, import the necessary initial libraries for this exercise:

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
```

Initial imports. We will import more later in the notebook:

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

In the next cell, import the `bigquery` library and authenticate your GCP account in Colab.

In [None]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery

## 1.2: Load Data

In this section we load a BigQuery dataset, `ulb_fraud_detection`, dataset from the BigQuery public datasets. The full table id is given below:

`bigquery-public-data.ml_datasets.ulb_fraud_detection`



### 1.2.1: Set Up Environment and Create Bigquery Client

In the next cell, set up your environment by setting your project_id variable to your GCP project id and then create a BigQuery client.

In [None]:
project_id = 'baim-412018'
myclient = bigquery.Client(project = project_id)

### 1.2.2: Accessing the Dataset

We'll be using the `bigquery-public-data.ml_datasets.ulb_fraud_detection` dataset publicly available on BigQuery. This dataset contains various features related to financial transactions and a target variable indicating if the transaction was fraudulent.

In the next code cell, define:

1. A `dataset_name` variable set to the full dataset name
2. A `table_name` variable set to the table name
3. Then create a full table refrence variable, `ulb_id` that will include the full BigQuery refrence name
4. Define a BiqQuery client table, `ulb_table` with `client.get_table(ulb_id)`

In [None]:
dataset_name = 'bigquery-public-data.ml_datasets'
table_name = 'ulb_fraud_detection'
ulb_id = f'{dataset_name}.{table_name}'
ulb_table = myclient.get_table(ulb_id)

## 1.3: Data Exploration

In this section we briefly explore the `ulb_fraud_detection` dataset.

### 1.3.1: Schema

In the next code cell, view the schema of the `ulb_fraud_detection` dataset.

In [None]:
schema = ulb_table.schema
print(f'{ulb_table} Schema')

for field in schema:
  print(f'{field.name} : {field.field_type}')

bigquery-public-data.ml_datasets.ulb_fraud_detection Schema
Time : FLOAT
V1 : FLOAT
V2 : FLOAT
V3 : FLOAT
V4 : FLOAT
V5 : FLOAT
V6 : FLOAT
V7 : FLOAT
V8 : FLOAT
V9 : FLOAT
V10 : FLOAT
V11 : FLOAT
V12 : FLOAT
V13 : FLOAT
V14 : FLOAT
V15 : FLOAT
V16 : FLOAT
V17 : FLOAT
V18 : FLOAT
V19 : FLOAT
V20 : FLOAT
V21 : FLOAT
V22 : FLOAT
V23 : FLOAT
V24 : FLOAT
V25 : FLOAT
V26 : FLOAT
V27 : FLOAT
V28 : FLOAT
Amount : FLOAT
Class : INTEGER


### 1.3.2: Query Data and Load to DataFrame

In the next cell query the dataset and load to a pandas DataFrame, named, `fraud_df`.

Option 1: Loading Using BigQuery Magics

In [None]:
%%bigquery fraud_df --project baim-412018
SELECT *
FROM
bigquery-public-data.ml_datasets.ulb_fraud_detection

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
fraud_df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.254450,-0.448529,-0.398691,0.144672,1.070900,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.198600,0.098435,0.00,0
1,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.00,0
2,28292.0,1.050879,0.053408,1.364590,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.244090,0.063834,0.010981,0.00,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.065070,0.023500,0.00,0
4,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,154599.0,0.667714,3.041502,-5.845112,5.967587,0.213863,-1.462923,-2.688761,0.677764,-3.447596,...,0.329760,-0.941383,-0.006075,-0.958925,0.239298,-0.067356,0.821048,0.426175,6.74,1
284803,90676.0,-2.405580,3.738235,-2.317843,1.367442,0.394001,1.919938,-3.106942,-10.764403,3.353525,...,10.005998,-2.454964,1.684957,0.118263,-1.531380,-0.695308,-0.152502,-0.138866,6.99,1
284804,34634.0,0.333499,1.699873,-2.596561,3.643945,-0.585068,-0.654659,-2.275789,0.675229,-2.042416,...,0.469212,-0.144363,-0.317981,-0.769644,0.807855,0.228164,0.551002,0.305473,18.96,1
284805,96135.0,-1.952933,3.541385,-1.310561,5.955664,-1.003993,0.983049,-4.587235,-4.892184,-2.516752,...,-1.998091,1.133706,-0.041461,-0.215379,-0.865599,0.212545,0.532897,0.357892,18.96,1


Option 2: Loading a Table From a SQL Query Using BigQuery Client

In [None]:
query = '''
SELECT
  *
FROM
bigquery-public-data.ml_datasets.ulb_fraud_detection
  '''

query_job = myclient.query(query, job_config = job_config)

fraud_df = query_job.result()

NameError: name 'job_config' is not defined

### 1.3.3: Check `shape`

In the next code cell, show how many records and columns are in the `fraud_df` DataFrame.

In [None]:
fraud_df.shape

(284807, 31)

### 1.3.4: Sample Records

In the next code cell, show the first 5 records of the `fraud_df` DataFrame.

In [None]:
fraud_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.25445,-0.448529,-0.398691,0.144672,1.0709,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.1986,0.098435,0.0,0
1,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.0,0
2,28292.0,1.050879,0.053408,1.36459,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.24409,0.063834,0.010981,0.0,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.06507,0.0235,0.0,0
4,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.0,0


### 1.3.5: Box Plots

Show boxplots for all the "V" features. These are the PCA features as indicated in the data description.

In [None]:
fraud_df.iloc

### 1.3.6: Target Class Frequency

In the next cell, show the distribution of the class variable.

In [None]:
fraud_df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: Int64

---
# Section 2: Define Data and Perform Train/Test Split Procedures

In this section we perform our standard train/test/split procedure.

Recall the train/test/split procedure:

## In this section we will:
1. Import Libraries
2. Split Data in Train/Test for both X and y
3. Fit/Train Scaler on Training X Data
4. Scale X Test Data

## In section 3 we will:
5. Create Model
6. Fit/Train Model on X Train Data
7. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
8. Adjust Parameters as Necessary and repeat steps 5 and 6

---

### 2.1.1: Define Data

For this exercise, we will use the following:

>- X features: use all the "V" features
>- y target: this is the "Class" feature


In [None]:
X = fraud_df.iloc[:, 1:29]
X.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28'],
      dtype='object')

In [None]:
y = fraud_df['Class']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: Int64

### 2.1.2: Import Libraries

Import the train_test_split function.

In [None]:
from sklearn.model_selection import train_test_split

### 2.1.3: Train-Test-Split

In the next code cell split the `fraud_df` into training and testing sets with the following arguments:

>- test_size = 20%
>- random_state = 42
>- stratify = y
>>- this will ensure that the class proportions are preserved in the training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .2, random_state = 42, stratify = y)

X_test.shape, y_test.shape

((56962, 28), (56962,))

Check the class distributions for `y_train`:

In [None]:
y_train.value_counts(normalize=True)

0    0.998271
1    0.001729
Name: Class, dtype: Float64

Check the class distributions for `y_test`:

>- Compare to `y_train` to make sure your have even class distributions between training and testing data sets.

In [None]:
y_test.value_counts(normalize=True)

0    0.99828
1    0.00172
Name: Class, dtype: Float64

### 2.1.4: Fit/Train Scaler on Training Data

Becauase the the "V" features should already be scaled through the PCA process we will skip this for this exercise...

In [None]:
# skip for this exercise

### 2.1.5: Scale Test Data

For the same reasons as 2.1.4 we will skip this step.

In [None]:
# skip for this exercise

# Section 3: Create Model: Naive Bayes

In this section we continue the train/test/split procedures where we pick back up in the 5th step, `Create Model`.

Recall the train/test/split procedure:

## In section 2 we completed:
1. Import Libraries
2. Split Data in Train/Test for both X and y
3. Fit/Train Scaler on Training X Data
4. Scale X Test Data

## In this section we will:
5. Create Model
6. Fit/Train Model on X Train Data
7. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
8. Adjust Parameters as Necessary and repeat steps 5 and 6


### 3.1.1: Create Model



In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

nb_classifier = GaussianNB()

### 3.1.2: Fit/Train nodel on X Train Data

In [None]:
nb_classifier.fit(X_train, y_train)

#### 3.1.2b: Show Class Priors

Naive Bayes models should return the prior probabilities of the classes with the `class_prior_` attribute. In the next cell, show your class priors for your Naive Bayes model.

>- Round the priors to 3 decimals
>- Print a statement out that looks like the following:

```
Legit Prior: 0.XXX
Fraud Prior: 0.XXX
```

In [None]:
print('Legit:', round(nb_classifier.class_prior_[0],3))

Legit: 0.998


In [None]:
print('Fraud:', round(nb_classifier.class_prior_[1],3))

Fraud: 0.002


### 3.1.3: Evaluate Model on X Test Data

In this section, create a confusion matrix DataFrame based on the Naive Bayes model.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = nb_classifier.predict(X_test)

#### 3.1.3a: Create a confusion matrix in the next code cell:

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[55674,  1190],
       [   15,    83]])

#### 3.1.3b: Create a classification report in the next cell:

In [None]:
print(classification_report(y_test, y_pred, target_names = ['0','1']))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     56864
           1       0.07      0.85      0.12        98

    accuracy                           0.98     56962
   macro avg       0.53      0.91      0.56     56962
weighted avg       1.00      0.98      0.99     56962



### 3.1.4: Cross Validation

In this section perform a cross-validation for your Naive Bayes model.

>- Use 10 folds in your cross validation

### 3.1.5: Cross Validation Summary

In the next code cells, show the average and standard deviation for of your cross validation scores from 3.1.4.

>- Round both average and standard deviation to 3 decimals

# Section 4: Visualize Model Performance

In this section, we visualize the Naive Bayes model performance.

### 4.1.1: Creat the ROC Curve

In the next code cell, create the ROC curve for the Naive Bayes Classifier

### 4.1.2: Create Precision-Recall Curve

In the next code cell, create the precision-recall curve for the Naive Bayes model.

### 4.1.3: Cumulative Gains Plot

In the next cells, create a cumulative gain plot for the Naive Bayes Model.

Get the predicted probabilities from the Naive Bayes Model:

Now plot the cumulative gains curve: