# Predicting Fraudulent Transactions

You’ll try logistic regression on the real-world problem of fraud detection and find out how it fares.

## Instructions:

The instructions for this activity are divided into the following stages:

1. Prepare the data

2. Split the data into training and testing sets

3. Model and fit the data into a logistic regression

4. Predict the testing labels 

5. Calculate the performance metrics

#### Prepare the Data

1. Load the `transaction_fraud_data.csv` file from the `Resources` folder into a Pandas DataFrame. Set the “id” column as the index.

2. Note that you want to predict the `fraud` variable. Answer the following question: Using `value_counts`, how many fraudulent transactions exist in this dataset?

#### Split the Data into Training and Testing Sets

1. Using the `transaction_fraud_data` DataFrame, separate the data into training and testing data. Start by defining the `target` (the “fraud” column) and the `features` of the data (all the columns except “fraud”).

2. Split the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.

#### Model and Fit the Data to a Logistic Regression

1. Declare a `LogisticRegression` model.

2. Fit the training data to the model, and save the model.

#### Predict the Testing Labels

1. Make predictions about fraud by using the testing dataset, and save those predictions.

#### Calculate the Performance Metrics

1. Calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`. 
2. Answer the following question: For this dataset, how well did the model predict the actually fraudulent transactions?


## Resources:

Following are links to modules from the scikit learn library that will be utilized:

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

[classifiction report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)


In [1]:
# Import the required modules
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression


# Prepare the Data

### Step 1: Load the `transaction_fraud_data.csv` file from the `Resources` folder into a Pandas DataFrame. Set the “id” column as the index.

In [2]:
# Read in the transaction_fraud_data.csv file into a PandasDataFrame.
transaction_fraud_data = pd.read_csv(Path("transaction_fraud_data.csv"))
# YOUR CODE HERE

# Review the DataFrame
transaction_fraud_data
# YOUR CODE HERE


Unnamed: 0,id,Z_0,Z_1,Z_2,Z_3,Z_4,Z_5,Z_6,Z_7,Z_8,Z_9,fraud
0,0,-2.346302,-1.026583,-10.363716,-8.054260,7.519907,1.860217,9.056866,0.392113,-12.937505,-0.801264,0
1,1,-3.296776,-8.487700,9.175655,1.097409,-1.766353,-2.293392,-2.247549,-0.041269,8.216953,8.883102,0
2,2,12.839610,4.475612,-5.213528,-5.722660,-4.073390,-5.661766,5.967037,-9.826743,-17.443248,5.266470,0
3,3,13.237325,13.605183,-5.958039,4.392244,4.763587,3.781628,-2.722725,-5.814775,11.236515,2.582494,0
4,4,4.161311,2.520646,7.171650,1.301273,-5.408190,4.651314,9.639546,4.648132,3.928619,2.358164,0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,4.934022,4.443195,1.388734,2.278806,-21.147607,-0.368845,2.628414,-1.351398,5.148549,4.619609,0
996,996,-4.910172,-10.815344,-10.595855,5.884628,0.528660,0.830482,-13.764331,13.836068,7.353338,10.938554,0
997,997,11.623010,6.322765,-3.407261,-7.983187,-22.384596,5.291396,-6.293332,-1.309196,-1.836557,-3.473290,0
998,998,-7.030572,7.131075,-6.298506,-2.606031,0.484820,0.139240,4.640984,-11.869494,3.227287,-5.899874,0


In [4]:
#set index to id
transaction_fraud_data = transaction_fraud_data.set_index("id")
transaction_fraud_data

Unnamed: 0_level_0,Z_0,Z_1,Z_2,Z_3,Z_4,Z_5,Z_6,Z_7,Z_8,Z_9,fraud
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,-2.346302,-1.026583,-10.363716,-8.054260,7.519907,1.860217,9.056866,0.392113,-12.937505,-0.801264,0
1,-3.296776,-8.487700,9.175655,1.097409,-1.766353,-2.293392,-2.247549,-0.041269,8.216953,8.883102,0
2,12.839610,4.475612,-5.213528,-5.722660,-4.073390,-5.661766,5.967037,-9.826743,-17.443248,5.266470,0
3,13.237325,13.605183,-5.958039,4.392244,4.763587,3.781628,-2.722725,-5.814775,11.236515,2.582494,0
4,4.161311,2.520646,7.171650,1.301273,-5.408190,4.651314,9.639546,4.648132,3.928619,2.358164,0
...,...,...,...,...,...,...,...,...,...,...,...
995,4.934022,4.443195,1.388734,2.278806,-21.147607,-0.368845,2.628414,-1.351398,5.148549,4.619609,0
996,-4.910172,-10.815344,-10.595855,5.884628,0.528660,0.830482,-13.764331,13.836068,7.353338,10.938554,0
997,11.623010,6.322765,-3.407261,-7.983187,-22.384596,5.291396,-6.293332,-1.309196,-1.836557,-3.473290,0
998,-7.030572,7.131075,-6.298506,-2.606031,0.484820,0.139240,4.640984,-11.869494,3.227287,-5.899874,0


### Step 2: Answer the following question:

Note that you want to predict the `fraud` variable. Answer the following question: Using `value_counts`, how many fraudulent transactions exist in this dataset?

In [7]:
# The  column 'fraud' is the thing you want to predict. 
# Class 0 indicates no-fraud trasactions and class 1 indicates fraudulent transactions
# Using value_counts, how many fraudulent transactions are in this dataset?
# YOUR CODE HERE
transaction_fraud_data["fraud"].value_counts()

0    993
1      7
Name: fraud, dtype: int64

In [None]:
# noting that 1 indicates fraud ... there are 7 counts of fraud in the df
# THIS IS IMBALANCED DATA AS ONLY 7/1000 ARE FRAUD

# Split the data into training and testing sets

### Step 1: Using the `transaction_fraud_data` DataFrame, separate the data into training and testing data. Start by defining the `target` (the “fraud” column) and the `features` of the data (all the columns except “fraud”).

In [9]:
# The target column should be the binary `fraud` column.
target = transaction_fraud_data["fraud"]
# YOUR CODE HERE


# The features column should be all of the features. 
features = transaction_fraud_data.drop(columns="fraud")
# YOUR CODE HERE

display(target.head())
display(features.head())

id
0    0
1    0
2    0
3    0
4    0
Name: fraud, dtype: int64

Unnamed: 0_level_0,Z_0,Z_1,Z_2,Z_3,Z_4,Z_5,Z_6,Z_7,Z_8,Z_9
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,-2.346302,-1.026583,-10.363716,-8.05426,7.519907,1.860217,9.056866,0.392113,-12.937505,-0.801264
1,-3.296776,-8.4877,9.175655,1.097409,-1.766353,-2.293392,-2.247549,-0.041269,8.216953,8.883102
2,12.83961,4.475612,-5.213528,-5.72266,-4.07339,-5.661766,5.967037,-9.826743,-17.443248,5.26647
3,13.237325,13.605183,-5.958039,4.392244,4.763587,3.781628,-2.722725,-5.814775,11.236515,2.582494
4,4.161311,2.520646,7.17165,1.301273,-5.40819,4.651314,9.639546,4.648132,3.928619,2.358164


### Step 2: Split the features and target data into `training_features`, `testing_features`, `training_targets`, and `testing_targets` datasets by using the `train_test_split` function.

In [11]:
# Split the dataset using the train_test_split function
training_features, testing_features, training_targets, testing_targets = train_test_split(features,target)
# YOUR CODE HERE


In [22]:
#show the train/test split
display(len(training_features))
display(len(testing_features))

750

250

# Model and Fit the Data to a Logistic Regression

### Step 1: Declare a `LogisticRegression` model.

In [26]:
# 1.  instantiate the tool
# 2. fit the tool
# 3. use thr tool (scale, predict evaluate)

In [12]:
# Declare a logistic regression model.
# Apply a random_state of 7 to the model
logistic_regression_model = LogisticRegression(solver="lbfgs", random_state=7)
# YOUR CODE HERE


In [17]:
#score model using training set
logistic_regression_model.score(training_features, training_targets)

0.9933333333333333

In [18]:
#score model using test set --> taining set should have a higher score with larger (normal) data sets
logistic_regression_model.score(testing_features, testing_targets)

0.996

### Step 2: Fit the training data to the model, and save the model.

In [13]:
# Fit and save the logistic regression model using the training data
lr_model = logistic_regression_model.fit(training_features, training_targets) #why is X,y training_features, training_targets?
# YOUR CODE HERE


In [27]:
# look at the model coefficients
# these are coeff of each X value (m in y=mx +b)
lr_model.coef_

array([[-7.87133837e-02,  8.38381716e-03,  5.03047528e-02,
        -4.93164949e-02,  5.89272803e-02, -1.65188482e-04,
         3.03398915e-01,  2.93230054e-02, -8.96752448e-02,
         8.33784134e-02]])

# Predict the Testing Labels

### Step 1: Make predictions about fraud by using the testing dataset, and save those predictions.

In [15]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predictions = lr_model.predict(testing_features) #why testing features?? - because we trained on training set 
# & now predict on the test set which would also be called previously y_test
# YOUR CODE HERE

# Review the predictions
# YOUR CODE HERE
testing_predictions

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

# Calculate the Performance Metrics

### Step 1: Calculate the accuracy score by evaluating `testing_targets` vs. `testing_predictions`.

In [16]:
# Display the accuracy score for the test dataset.
# YOUR CODE HERE
accuracy_score(testing_targets, testing_predictions)

0.996

### Step 2: Answer the following question

**Question:**: For this dataset, how well did the model predict the actual fraudulent transactions?

**Answer:**: # YOUR ANSWER HERE 

In [23]:
from sklearn.metrics import confusion_matrix, classification_report

In [24]:
print(confusion_matrix(testing_targets, testing_predictions))

[[249   0]
 [  1   0]]


In [25]:
print(classification_report(testing_targets, testing_predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       249
           1       0.00      0.00      0.00         1

    accuracy                           1.00       250
   macro avg       0.50      0.50      0.50       250
weighted avg       0.99      1.00      0.99       250



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
