# Logistic Regression: Predicting Credit Card Fraud

## Introduction: In this project, I will utilize the Logistic Regression algorithm to predict fraudulent credit card transactions. Logistic Regression is defined as: "a supervised machine learning algorithm that uses mathematics to analyze data relationships and predict the value of one factor based on another."

In this notebook you will see the following steps taken:
1. Prepare the data

2. Split the data into training and testing sets

3. Model and fit the data into a logistic regression

4. Predict the testing labels 

5. Calculate the performance metrics


#### Prepare the Data:
1. I will load the `transaction_fraud_data.csv` file into a Pandas DataFrame. I will then set the “id” column as the index.

2. Note that I will be predicting the `fraud` variable. I will then be able to use 'value_counts' to determine how many fraudulent transactions exist in this dataset. 


#### Split the Data into Training and Testing Sets
1. Using the `transaction_fraud_data` DataFrame, I will separate the data into training and testing data. I will start by defining the `target` (the “fraud” column) and the `features` of the data (all the columns except “fraud”).

2. I will then split the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.


#### Model and Fit the Data to a Logistic Regression
1. I will then declare a `LogisticRegression` model.

2. Next I will fit the training data to the model, and save the model.


#### Predict the Testing Labels
1. I will make predictions about fraud by using the testing dataset, and save those predictions.

#### Calculate the Performance Metrics
1. I will calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`. 
2. I will then be able to determine how well the model predicted fraudulent transactions using this dataset. 

## Resources:
Following are links to modules from the scikit learn library that will be utilized:

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

[classifiction report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)


In [1]:
# Import the required modules
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression


# Prepare the Data

### Step 1: I will load the `transaction_fraud_data.csv` file into a Pandas DataFrame. I will then set the “id” column as the index.

In [2]:
# Read in the transaction_fraud_data.csv file into a PandasDataFrame.
transaction_fraud_data = pd.read_csv(
    Path("transaction_fraud_data.csv"), 
    index_col="id"
)

# Review the DataFrame
transaction_fraud_data.head()


Unnamed: 0_level_0,Z_0,Z_1,Z_2,Z_3,Z_4,Z_5,Z_6,Z_7,Z_8,Z_9,fraud
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,-2.346302,-1.026583,-10.363716,-8.05426,7.519907,1.860217,9.056866,0.392113,-12.937505,-0.801264,0
1,-3.296776,-8.4877,9.175655,1.097409,-1.766353,-2.293392,-2.247549,-0.041269,8.216953,8.883102,0
2,12.83961,4.475612,-5.213528,-5.72266,-4.07339,-5.661766,5.967037,-9.826743,-17.443248,5.26647,0
3,13.237325,13.605183,-5.958039,4.392244,4.763587,3.781628,-2.722725,-5.814775,11.236515,2.582494,0
4,4.161311,2.520646,7.17165,1.301273,-5.40819,4.651314,9.639546,4.648132,3.928619,2.358164,0


### Step 2: Answer the following question:

Note that I will be predicting the `fraud` variable. I will then be able to use 'value_counts' to determine how many fraudulent transactions exist in this dataset. 

In [10]:
# The  column 'fraud' is the thing I want to predict. 
# Class 0 indicates no-fraud trasactions and class 1 indicates fraudulent transactions
# Using value_counts, how many fraudulent transactions are in this dataset?
# ANSWER: There are 993 no-fraud transactions and 7 fraud transactions
transaction_fraud_data["fraud"].value_counts()

fraud
0    993
1      7
Name: count, dtype: int64

# Split the data into training and testing sets

### Step 1: Using the `transaction_fraud_data` DataFrame, I will separate the data into training and testing data. I will start by defining the `target` (the “fraud” column) and the `features` of the data (all the columns except “fraud”).

In [4]:
# The target column should be the binary `fraud` column.
target = transaction_fraud_data["fraud"]


# The features column should be all of the features. 
features = transaction_fraud_data.drop(columns="fraud")


### Step 2: Splitting the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.

In [5]:
# Split the dataset using the train_test_split function
training_features, testing_features, training_targets, testing_targets = train_test_split(features, target)

# Model and Fit the Data to a Logistic Regression

### Step 1: Declare a `LogisticRegression` model.

In [6]:
# Declare a logistic regression model.
# Apply a random_state of 7 to the model
logistic_regression_model = LogisticRegression(random_state=7)

### Step 2: Fit the training data to the model, and save the model.

In [7]:
# Fit and save the logistic regression model using the training data
lr_model = logistic_regression_model.fit(training_features, training_targets)

# Predict the Testing Labels

### Step 1: Make predictions about fraud by using the testing dataset, and save those predictions.

In [8]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predictions = lr_model.predict(testing_features)

# Review the predictions
testing_predictions

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

# Calculate the Performance Metrics

### Step 1: Calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`.

In [9]:
# Display the accuracy score for the test dataset.
accuracy_score(testing_targets, testing_predictions)

1.0

### Step 2: Answer the following question

**Question:**: For this dataset, how well did the model predict the actual fraudulent transactions?

**Answer:**: For this test data: Accuracy looks extremely good:  100% of the transactions in the test data were accurately categorized by the model. However, based on `value_counts`, there were very few transactions in the data that were actually fraudulent, and so our model could have had high accuracy by simply predicting all transactions to be valid. 