# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [47]:
import pandas as pd
import numpy as np

In [48]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")

transactions.sample(5)

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
948597,PAYMENT,39943.85,C1048150973,6074.0,0.0,M1363663294,0.0,0.0,0,0
649719,PAYMENT,14129.03,C358477039,274251.0,260121.97,M834991707,0.0,0.0,0,0
372486,CASH_OUT,11509.97,C881215138,0.0,0.0,C1886350762,17696.78,29206.75,0,0
818529,PAYMENT,13328.84,C334845941,107144.72,93815.88,M922607434,0.0,0.0,0,0
248178,CASH_IN,55680.77,C2037405964,17936232.43,17991913.2,C286095182,9729243.92,9673563.15,0,0


In [49]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   type            1000000 non-null  object 
 1   amount          1000000 non-null  float64
 2   nameOrig        1000000 non-null  object 
 3   oldbalanceOrg   1000000 non-null  float64
 4   newbalanceOrig  1000000 non-null  float64
 5   nameDest        1000000 non-null  object 
 6   oldbalanceDest  1000000 non-null  float64
 7   newbalanceDest  1000000 non-null  float64
 8   isFraud         1000000 non-null  int64  
 9   isFlaggedFraud  1000000 non-null  int64  
dtypes: float64(5), int64(2), object(3)
memory usage: 76.3+ MB


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

There are no missing values but there are non predictive columns including nameOrig and nameDest and isFraud flagged. I will drop these columns from the data set. 

In [50]:
transactions.drop(columns=["nameOrig", "nameDest", "isFlaggedFraud"], inplace=True)


In [51]:
transactions.sample(5)

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
100338,PAYMENT,15479.9,12059.0,0.0,0.0,0.0,0
264776,PAYMENT,2795.65,0.0,0.0,0.0,0.0,0
830613,PAYMENT,5098.83,63136.0,58037.17,0.0,0.0,0
600457,PAYMENT,1473.53,10242.0,8768.47,0.0,0.0,0
943819,CASH_OUT,165500.7,0.0,0.0,293856.35,459357.06,0


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Fraudulent transactions are usually transaction type Transfer or Cash out. I can use one-hot encoding to transform the type column. 

In [52]:
from sklearn.preprocessing import OneHotEncoder

cat_features = ["type"]                              
num_features = ["amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "isFraud"]

X_cat = transactions[cat_features]
X_num = transactions[num_features]

X_cat.head()

Unnamed: 0,type
0,PAYMENT
1,PAYMENT
2,CASH_IN
3,TRANSFER
4,CASH_OUT


In [53]:
X_num.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,983.09,36730.24,35747.15,0.0,0.0,0
1,55215.25,99414.0,44198.75,0.0,0.0,0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0
3,2357394.75,0.0,0.0,4202580.45,6559975.19,0
4,67990.14,0.0,0.0,625317.04,693307.19,0


In [54]:
ohe = OneHotEncoder(drop="first")
X_cat_full = ohe.fit_transform(X_cat).toarray()

X_cat_full

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 0.],
       ...,
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]], shape=(1000000, 4))

In [55]:
cat_names = ohe.get_feature_names_out(['type'])

encoded_df = pd.DataFrame(X_cat_full, columns=cat_names, index=transactions.index)

encoded_df.head()

Unnamed: 0,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0


In [56]:
full_df = pd.concat([X_num, encoded_df], axis=1)

full_df

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.00,0.00,0,0.0,0.0,1.0,0.0
1,55215.25,99414.00,44198.75,0.00,0.00,0,0.0,0.0,1.0,0.0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0.0,0.0,0.0,0.0
3,2357394.75,0.00,0.00,4202580.45,6559975.19,0,0.0,0.0,0.0,1.0
4,67990.14,0.00,0.00,625317.04,693307.19,0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
999995,13606.07,114122.11,100516.04,0.00,0.00,0,0.0,0.0,1.0,0.0
999996,9139.61,0.00,0.00,0.00,0.00,0,0.0,0.0,1.0,0.0
999997,153650.41,50677.00,0.00,0.00,380368.36,0,1.0,0.0,0.0,0.0
999998,163810.52,0.00,0.00,357850.15,521660.67,0,1.0,0.0,0.0,0.0


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Yes, there are many more non-fradulent transactions than fradulent ones. This might make the model biased to predicting non-fradulent transactions and not predict the fradulent ones well. I can use SMOTE to generate more samples of non-fradulent transactions. However this will come after splitting the data for fitting the model and not in the preprocessing step.


## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.



In [57]:
# write out newly transformed dataset to your folder
full_df.to_csv('../data/bank_transactions_cleaned.csv', index=False)