# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [9]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

Answer here:
No the data has no null-values but had non-predictive columns like 'NameOrig', 'NamDest'

In [3]:
transactions = pd.read_csv("../data/bank_transactions.csv")

In [4]:
transactions.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,PAYMENT,55215.25,C1031766358,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,C458368123,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,C1098978063,0.0,0.0,C142246322,625317.04,693307.19,0,0


In [5]:
transactions.isnull().sum()

type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Answer here
Transfer and cash-out transactions were the only types where a significant amount of fraud was found in the 'isFraud' column. 

In [None]:
#list the number of transactions for each type
transactions["type"].value_counts()

type
CASH_OUT    351360
PAYMENT     338573
CASH_IN     219955
TRANSFER     83695
DEBIT         6417
Name: count, dtype: int64

In [None]:
#number of transacations categorized as fraud vs non-fraud
transactions["isFraud"].value_counts(normalize=True)
cat_cols = transactions.select_dtypes(include=['object']).columns
for col in cat_cols:
    plt.figure(figsize=(6,3))
    transactions[col].value_counts().plot(kind='bar')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

In [12]:
transactions["type"].sum()

'PAYMENTPAYMENTCASH_INTRANSFERCASH_OUTPAYMENTCASH_OUTPAYMENTCASH_OUTCASH_INPAYMENTCASH_INCASH_INCASH_OUTPAYMENTCASH_OUTCASH_INCASH_INCASH_OUTPAYMENTPAYMENTPAYMENTCASH_INTRANSFERCASH_OUTCASH_INCASH_OUTCASH_INCASH_OUTPAYMENTCASH_OUTCASH_OUTCASH_OUTCASH_OUTCASH_OUTCASH_OUTCASH_OUTCASH_OUTTRANSFERCASH_INCASH_OUTCASH_OUTCASH_OUTPAYMENTPAYMENTCASH_OUTPAYMENTCASH_OUTPAYMENTPAYMENTPAYMENTPAYMENTTRANSFERPAYMENTPAYMENTCASH_INPAYMENTCASH_OUTCASH_INCASH_OUTCASH_OUTCASH_INCASH_INTRANSFERPAYMENTPAYMENTCASH_INPAYMENTCASH_OUTCASH_INPAYMENTCASH_OUTTRANSFERCASH_OUTCASH_OUTPAYMENTTRANSFERCASH_INCASH_INPAYMENTPAYMENTCASH_INCASH_INPAYMENTPAYMENTCASH_INPAYMENTCASH_OUTPAYMENTCASH_OUTPAYMENTCASH_INCASH_OUTCASH_OUTPAYMENTPAYMENTCASH_INPAYMENTCASH_INPAYMENTPAYMENTCASH_OUTCASH_INCASH_INCASH_OUTCASH_OUTTRANSFERCASH_INCASH_OUTPAYMENTPAYMENTCASH_INCASH_OUTCASH_OUTCASH_INCASH_INCASH_OUTCASH_OUTTRANSFERCASH_OUTCASH_OUTCASH_INTRANSFERCASH_OUTCASH_INCASH_INPAYMENTCASH_INPAYMENTPAYMENTPAYMENTCASH_OUTPAYMENTCASH_OUTCASH_

In [None]:
#bar plot for each type of transaction 
sns.barplot(transactions, x="type", y="amount", hue="type")

In [None]:
transactions['isFraud'].value_counts()

isFraud
0    998703
1      1297
Name: count, dtype: int64

In [None]:
#number of transactions categorized as fraud
plt.figure(figsize=(6,3))
transactions["isFraud"].value_counts().plot(kind='bar')
plt.title('Distribution of isFraud')
plt.xlabel(col)
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Answer here

Because the amoount of fraudulent transactions are so rare, a naive model can easily be achieve 99% accurarcy by predicting not fraud each time and would miss actual fraud cases. To overcome this class imbalance apply SMOTE to rebalance the training set.

In [8]:
transactions["isFraud"].describe()

count    1000000.000000
mean           0.001297
std            0.035991
min            0.000000
25%            0.000000
50%            0.000000
75%            0.000000
max            1.000000
Name: isFraud, dtype: float64

In [13]:
#make balance difference columns for origin and destination 
transactions["origBalanceDiff"] = (transactions["oldbalanceOrg"] - transactions["newbalanceOrig"])

transactions["destBalanceDiff"] = (transactions["newbalanceDest"] - transactions["oldbalanceDest"])

transactions.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,origBalanceDiff,destBalanceDiff
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.0,0.0,0,0,983.09,0.0
1,PAYMENT,55215.25,C1031766358,99414.0,44198.75,M2102868029,0.0,0.0,0,0,55215.25,0.0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0,-220986.01,-220986.0
3,TRANSFER,2357394.75,C458368123,0.0,0.0,C620979654,4202580.45,6559975.19,0,0,0.0,2357394.74
4,CASH_OUT,67990.14,C1098978063,0.0,0.0,C142246322,625317.04,693307.19,0,0,0.0,67990.15


In [14]:
print(transactions["origBalanceDiff"].describe())
print(transactions["destBalanceDiff"].describe())

count    1.000000e+06
mean    -2.139204e+04
std      1.425342e+05
min     -1.609288e+06
25%      0.000000e+00
50%      0.000000e+00
75%      1.012796e+04
max      1.000000e+07
Name: origBalanceDiff, dtype: float64
count    1.000000e+06
mean     1.250032e+05
std      8.518922e+05
min     -3.407192e+06
25%      0.000000e+00
50%      0.000000e+00
75%      1.484347e+05
max      1.003252e+08
Name: destBalanceDiff, dtype: float64


## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [2]:
# write out newly transformed dataset to your folder
...