# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [32]:
import pandas as pd
import numpy as np
from sklearn.utils import resample

In [20]:
transactions2 = pd.read_csv("../data/bank_transactions.csv")

In [21]:
missing = transactions2.isnull().sum()
print("Missing values per column:\n", missing)

Missing values per column:
 type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


In [35]:
print(transactions2['isFraud'].value_counts())

isFraud
0    998703
1      1297
Name: count, dtype: int64


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

My model doesn't contain any missing values but contain two "non-predictive" columns which are nameOrig and nameDest.

The adjustments I should take to ensure that my model has good predictive capabilities is by removing the nameOrig and nameDest column from the dataset.

In [27]:
transactions = transactions2.drop(columns=['nameOrig', 'nameDest'])

In [29]:
transactions.columns

Index(['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'isFlaggedFraud'],
      dtype='object')

## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

TRANSFER and CASH_OUT transactions are commonly associated with fraud and involve higher amounts.

Other types like PAYMENT, DEBIT, CASH_IN doesn't assoicted with kind of fraudulent activites.

I will transform the type column to make this pattern usable by a machine learning model by using One-Hot Encoding to separate the columns for each transaction type because Machine learning models can’t use text directly. Therefore, I will convert it into numeric format.

In [30]:
transactions = pd.get_dummies(transactions, columns=['type'])

In [31]:
transactions.columns

Index(['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'isFlaggedFraud', 'type_CASH_IN',
       'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'type_TRANSFER'],
      dtype='object')

## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

The challenges would be that the model will mostly learn from the not froud data rather than the froud data since there are way more none froud data than froud data. The Metrics will also be inaccurate and misleading.

The strategies that I will use is that I will balance the dataset by undersampling non-fraud which will make the froud and none froud equal to each other. For example, there are 1297 froud transections and  998703 non-froud transections which shows the diffference is too high. So I will undersample the non-froud to make it 1297 to match it with froud.

In [36]:
fraud = transactions[transactions['isFraud'] == 1]
non_fraud = transactions[transactions['isFraud'] == 0]

# Downsample non-fraud to match fraud count
non_fraud_downsampled = resample(non_fraud, replace=False, n_samples=len(fraud), 
                                 random_state=42)

# Combine to create balanced dataset
balanced_data = pd.concat([fraud, non_fraud_downsampled])

# Shuffle the dataset
balanced_data = balanced_data.sample(frac=1, random_state=42).reset_index(drop=True)

print(balanced_data['isFraud'].value_counts())


isFraud
1    1297
0    1297
Name: count, dtype: int64


## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [38]:
# write out newly transformed dataset to your folder
balanced_data.to_csv("../data/transformed_transactions.csv", index=False)