# Credit Card Fraud Detection<br>
### Project Goals
- Discover drivers of fraud from credit card data
-Use these drivers to develop a machine learning model that helps predicts fraud
-This information could be used on future datasets to help detect fraud

In [1]:
# data science imports
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
# ML imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
# no warnings 
import warnings
warnings.filterwarnings("ignore")
# premade functions
import wrangle as w

# Acquire<br>
-Data aquired from Kaggle  
-Data frame containted 1,000,000 rows and 8 columns before cleaning  
-Each row represents a credit card transaction  
-Each column represents a feature associated with the transaction  

In [2]:
# Use wrangle function to import raw data
df = w.wrangle_cc()
# take a peak
df.head(1)

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0


In [None]:
# Use prepare function to prep data 
df = w.prep_cc(df)
# Quick peak into the cleaned data
df.head(1)

# Data Dictionary
This will help with any questions or information on this dataset

| Name                 | Definition |
| -------------------- | ---------- |
| distance_from_home | The distance from home where the transaction happened |
| distance_from_last_transaction | The distance from last transaction |
| ratio_to_median_purchase_price | Ratio of purchased price transaction to median purchase price. |
| repeat_retailer      | Binary, specifies if the transaction happened from same retailer. |
| used_chip           | Binary, specifies if the transaction through chip (credit card). |
| used_pin_number  | Binary, specifies if the transaction happened by using PIN number. |
| online_order | Binary, specifies if the transaction is an online order. |
| fraud              | Binary, specifies if the transaction is fraudulent. |


# Split data into train/validate/test sample dataframes

In [None]:
# function to split data and print shape of our splits
train, validate, test, X_train, y_train, X_validate, y_validate, X_test, y_test = w.split_data(wines, 
                                                                                              'quality')

# Exploration<br>
- Here we will be asking some questions of our data
- We will then support these questions with visuals and statistical tests

In [None]:
# function for visual 1
e.question_1_visual(train)

In [None]:
# function for stats test on question 1
e.question_hypothesis_test(1,train,'wine_color',
                            'Does color affect wine quality?',
                            'quality',alpha=.05)

In [None]:
# function for visual 2
mf.question_2_visual(train)

In [None]:
# function for stats test on question 2
e.question_hypothesis_test(2,train,'alcohol',
                            'Does a higher quality mean higher alcohol content?',
                            'quality',alpha=.05)

In [None]:
# function for visual 3
e.question_3_visual(train)

In [None]:
# function for stats test on questions 3
e.question_hypothesis_test(3,train,'citric_acid',
                            'Is there a relationship between Citric Acid and Quality?',
                            'quality',alpha=.05)

In [None]:
# function for visual 4
e.question_4_visual(train)

In [None]:
# function for stats test on question 4
e.question_hypothesis_test(4,train,'free_sulfur_dioxide',
                            'Is there a relationship between Free Sulfur Dioxide and Quality?',
                            'quality',alpha=.05)

# Exploration Summary<br>
- 
- 
- <br>

#### We are moving forward to include these drivers in our model:
- 
- 
- 

# Modeling<br>
We ran the algorithms below to see what fit the data best.

## Baseline

In [None]:
# function to generate baseline
m.get_baseline(wines)

## Random Forest

In [None]:
# function to get random forest algorithm
m.get_rf(X_train, y_train, X_validate, y_validate)

## Logistic Regression

In [None]:
# function to get logistic regression algorithm
m.get_logit(X_train, y_train, X_validate, y_validate)

## K Nearest Neighboor

In [None]:
# function to get knn algorithm
m.get_knn(X_train, y_train, X_validate, y_validate)

## Descision Tree

In [None]:
# function to get knn algorithm
m.get_clf(X_train, y_train, X_validate, y_validate)

# Visualize All Models

In [None]:
#function for visual of top models
m.get_top_models(X_train, y_train, X_validate, y_validate)

# Test Model<br>
- We are choosing the Random Forest model as it has the highest accuracy.
- We will now run our model on the test data to gauge how it will perform on unseen data.

In [None]:
# function to get test 
m.get_test2(X_train, y_train, X_test, y_test)

# Test Versus Baseline Visual

In [None]:
# functions to get visual on test vs baseline
m.get_mvb(X_train, y_train, X_test, y_test, wines)

## Modeling Wrap
The Random Forest model outperforms the baseline and I would recommend to use this model, as it beat the baseline by almost 9%

# Conclusion<br>
### Summary
- 
- 
- 

# Recommendations<br>
- 
- 
- 

# Next Steps<br>
- If provided more time to work on this project I would  