**Task: Clean data to be used for our modeling**

Ex: Dropping null values, removing unnecessary columns, removing outliers, and
potentially fixing incorrectly formatted data.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

In [2]:
fraud = pd.read_csv('../data/fraud_new.csv')

In [3]:
fraud.head()

Unnamed: 0.1,Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,780985,39,CASH_OUT,134220.68,C131297852,148689.91,14469.23,C1409230127,1460741.73,1515197.87,0,0
1,258402,14,CASH_IN,147989.15,C135660289,770489.94,918479.1,C426052941,165786.61,0.0,0,0
2,2863601,227,CASH_OUT,22497.02,C1835502862,140.0,0.0,C857314417,0.0,22497.02,0,0
3,2660525,210,CASH_OUT,200096.3,C914107697,147679.0,0.0,C1832844105,0.0,200096.3,0,0
4,1792523,162,PAYMENT,13904.94,C1489506871,0.0,0.0,M1861487502,0.0,0.0,0,0


From the data we are only interested in the columns that would give us insights on fraudulent transactions, as such the columns that should be dropped based off of our EDA would be: nameOrig, nameDest columns for now. 

In [12]:
fraud_clean = fraud.drop(['nameOrig','nameDest','Unnamed: 0'], axis = 1)

In [13]:
fraud_clean.head()

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,39,CASH_OUT,134220.68,148689.91,14469.23,1460741.73,1515197.87,0,0
1,14,CASH_IN,147989.15,770489.94,918479.1,165786.61,0.0,0,0
2,227,CASH_OUT,22497.02,140.0,0.0,0.0,22497.02,0,0
3,210,CASH_OUT,200096.3,147679.0,0.0,0.0,200096.3,0,0
4,162,PAYMENT,13904.94,0.0,0.0,0.0,0.0,0,0


In [14]:
fraud_clean.isnull().sum()

step              0
type              0
amount            0
oldbalanceOrg     0
newbalanceOrig    0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

There are no null values within this dataset to drop. 

In [15]:
fraud_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            20000 non-null  int64  
 1   type            20000 non-null  object 
 2   amount          20000 non-null  float64
 3   oldbalanceOrg   20000 non-null  float64
 4   newbalanceOrig  20000 non-null  float64
 5   oldbalanceDest  20000 non-null  float64
 6   newbalanceDest  20000 non-null  float64
 7   isFraud         20000 non-null  int64  
 8   isFlaggedFraud  20000 non-null  int64  
dtypes: float64(5), int64(3), object(1)
memory usage: 1.4+ MB


In [16]:
# save dataframe as a new csv file to be used in the next step

fraud_clean.to_csv('../data/fraud_cleaned.csv')