<a href="https://colab.research.google.com/github/liuzheqi0723/capstone-fraud-detection/blob/main/models/3_Join_datasets_with_further_data_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Application for real-time fraudulent transaction detection**

#### Load Datasets and Import Libraries

In [25]:
### import libraries ###

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [26]:
# # # Run it if it is the first time you running this notebook.

# # # Mount your google drive to colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
# Before you run the code below,
# Please create a shortcut for the 'Capstone' folder from shared drive to your own Googledrive.


clean_id = pd.read_csv('/content/drive/MyDrive/Capstone/Data/clean_train_id.csv')
clean_id.name = 'clean_id'
# clean_train_id.head()

clean_trans = pd.read_csv('/content/drive/MyDrive/Capstone/Data/clean_train_trans.csv')
clean_trans.name = 'clean_trans'
# clean_trans.head()

# Dataset is now stored in a Pandas Dataframe

#### Fill Nan values (Part A)

##### A.fill with string and constant values. 

**'clean_id' dataset:**
1. All the Nans in columns with data type of **'object'** will be filled with string **'NA_'**.
2. All the Nans in columns with data type of **'float'** will be filled with mean value of the column using [sklearn.impute](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer).<br>To prevent from data leakage, we will do this after seperating the training and testing datasets.

In [28]:
clean_id.drop(columns=['Unnamed: 0'], inplace=True) # drop index col

In [29]:
# fill object cols with str 'NA'
cols = clean_id.columns.to_list()
for col in cols:
  if clean_id[col].dtype == 'O':
    clean_id[col].fillna('NA_', inplace=True)

**'clean_trans' dataset:**



In [30]:
clean_trans.drop(columns=['Unnamed: 0'], inplace=True) # drop index col

1. All the Nans in columns with data type of **'object'** will be filled with string **'NA_'**.

2. For the Nans in columns with data type of **'float'** will be treated differently.<br><br>
`    2.a fill with an unique value that has never appears in the column.`

  >Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
For example, how many times the payment card associated with a IP and email or address appeared in 24 hours time range, etc.
>*Because the 'VXXX' columns are engineered features, the nan values indicate that the row do not belongs any category of the column, which is also an infomative message. *

In [31]:
# fill object cols with str 'NA'
cols = clean_trans.columns.to_list()
for col in cols: 
  if clean_trans[col].dtype == 'O': # condition 1
    clean_trans[col].fillna('NA_', inplace=True)
  elif str(col).startswith( 'V' ): # condition 2.a
    clean_trans[col].fillna(-1, inplace=True)

In [32]:
##test and see if the code works
#clean_trans['card4'].unique()

    2.b fill with **mean** value of the column using sklearn.impute.

>card1 - card6: payment card information.Such as card type, card category, issue bank, country, etc.

>addr: both addresses are for purchaser.
addr1 as billing region.
addr2 as billing country.

>dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.

*To prevent from data leakage, we will do this after seperating the training and testing datasets.*<br>

    2.c fill with **most frequent value** in the column.
  >C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
  
  >D1-D15: timedelta, such as days between previous transaction, etc.

  *To prevent from data leakage, we will do this after seperating the training and testing datasets.*<br>

### Join datasets, Save X_raw and y_raw



In [33]:
# #column dist1 and D11 only has Nan in df_join。
# df_join=df_join.drop(columns=['dist1', 'D11'], inplace=False)

In [34]:
# Define X and y
from sklearn.model_selection import train_test_split

X = df_join.drop(columns=['TransactionID', 'isFraud'], inplace=False) # drop id and label
y = df_join['isFraud']

# X.dtypes.unique()

#### Save 'X_raw.csv' and 'y_raw.csv' before get dummies


In [35]:
# X.to_csv('X_raw.csv')
# !cp X_raw.csv "drive/MyDrive/Capstone/Data/"

# y.to_csv('y_raw.csv')
# !cp y_raw.csv "drive/MyDrive/Capstone/Data/"

### Get Dummies

Get dummies of the catogorical variables, does not improve the performance of ML Models. So the following part are all comment out.
But we still keep all these codes for potencially used in the future.

In [36]:
# # define a function to get dummies for the catogorical cols.
# def get_dummies(df: DataFrame, cols: list):
#     '''
#     Get the dummy values for the categorical columns.
#     Append them to the input df and drop the original cols.

#     df: data.
#     cols: the name of the columns need to be converted.
#     '''

#     for col in cols:
#         if col in df.columns:
#             col_dummies = pd.get_dummies(data=df[col])
#             df = pd.concat([df, col_dummies], axis=1)
#             df = df.drop(col, axis=1)

  
#     return df

In [37]:
# # there are many cols named as 'NA_'
# # rename the duplicates named cols.

# cols = pd.Series(X.columns)
# dup_count = cols.value_counts()
# for dup in cols[cols.duplicated()].unique():
#     cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) for i in range(1, dup_count[dup]+1)]

# # run it twice, because newly named cols in last step got dups with ori not changed col names.
# X.columns = cols
# cols = pd.Series(X.columns)
# dup_count = cols.value_counts()
# for dup in cols[cols.duplicated()].unique():
#     cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) for i in range(1, dup_count[dup]+1)]

# X.columns = cols

In [38]:
# # test if there are still duplicates names in the df
# uni_set = set()
# for col in cols:
#   if col not in uni_set:
#     uni_set.add(col)
#   else:
#     print(col)

# len(cols) - len(uni_set)