<a href="https://colab.research.google.com/github/liuzheqi0723/capstone-fraud-detection/blob/YaoW/models/3_Join_datasets_with_further_data_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Application for real-time fraudulent transaction detection**

#### Load Datasets and Import Libraries

In [1]:
### import libraries ###

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# # # Run it if it is the first time you running this notebook.

# # # Mount your google drive to colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Before you run the code below,
# Please create a shortcut for the 'Capstone' folder from shared drive to your own Googledrive.


clean_id = pd.read_csv('/content/drive/MyDrive/Capstone/Data/clean_train_id.csv')
clean_id.name = 'clean_id'
# clean_train_id.head()

clean_trans = pd.read_csv('/content/drive/MyDrive/Capstone/Data/clean_train_trans.csv')
clean_trans.name = 'clean_trans'
# clean_trans.head()

# Dataset is now stored in a Pandas Dataframe

#### Fill Nan values (Part A)

##### A.fill with string and constant values. 

**'clean_id' dataset:**
1. All the Nans in columns with data type of **'object'** will be filled with string **'NA_'**.
2. All the Nans in columns with data type of **'float'** will be filled with mean value of the column using [sklearn.impute](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer).<br>To prevent from data leakage, we will do this after seperating the training and testing datasets.

In [6]:
clean_id.drop(columns=['Unnamed: 0'], inplace=True) # drop index col

In [7]:
# fill object cols with str 'NA'
cols = clean_id.columns.to_list()
for col in cols:
  if clean_id[col].dtype == 'O':
    clean_id[col].fillna('NA_', inplace=True)

**'clean_trans' dataset:**



In [8]:
clean_trans.drop(columns=['Unnamed: 0'], inplace=True) # drop index col

1. All the Nans in columns with data type of **'object'** will be filled with string **'NA_'**.

2. For the Nans in columns with data type of **'float'** will be treated differently.<br><br>
`    2.a fill with an unique value that has never appears in the column.`

  >Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
For example, how many times the payment card associated with a IP and email or address appeared in 24 hours time range, etc.
>*Because the 'VXXX' columns are engineered features, the nan values indicate that the row do not belongs any category of the column, which is also an infomative message. *

In [9]:
# fill object cols with str 'NA'
cols = clean_trans.columns.to_list()
for col in cols: 
  if clean_trans[col].dtype == 'O': # condition 1
    clean_trans[col].fillna('NA_', inplace=True)
  elif str(col).startswith( 'V' ): # condition 2.a
    clean_trans[col].fillna(-1, inplace=True)

In [10]:
##test and see if the code works
#clean_trans['card4'].unique()

    2.b fill with **mean** value of the column using sklearn.impute.

>card1 - card6: payment card information.Such as card type, card category, issue bank, country, etc.

>addr: both addresses are for purchaser.
addr1 as billing region.
addr2 as billing country.

>dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.

*To prevent from data leakage, we will do this after seperating the training and testing datasets.*<br>

    2.c fill with **most frequent value** in the column.
  >C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
  
  >D1-D15: timedelta, such as days between previous transaction, etc.

  *To prevent from data leakage, we will do this after seperating the training and testing datasets.*<br>

### Join datasets, Save X_raw and y_raw



In [11]:
# Join two dfs, to get a df ready for used in scikit learn.
df_join = clean_id.merge(clean_trans, left_on='TransactionID', right_on='TransactionID')
df_join


Unnamed: 0,TransactionID,id_01,id_02,id_05,id_06,id_11,id_12,id_13,id_15,id_16,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987004,0.0,70787.0,,,100.0,NotFound,,New,NotFound,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2987008,-5.0,98945.0,0.0,-5.0,100.0,NotFound,49.0,New,NotFound,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2987010,-5.0,191631.0,0.0,0.0,100.0,NotFound,52.0,Found,Found,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,2987011,-5.0,221832.0,0.0,-6.0,100.0,NotFound,52.0,New,NotFound,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,2987016,0.0,7460.0,1.0,0.0,100.0,NotFound,,Found,Found,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134828,3577521,-15.0,145955.0,0.0,0.0,100.0,NotFound,27.0,Found,Found,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
134829,3577526,-5.0,172059.0,1.0,-5.0,100.0,NotFound,27.0,New,NotFound,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
134830,3577529,-20.0,632381.0,-1.0,-36.0,100.0,NotFound,27.0,New,NotFound,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
134831,3577531,-5.0,55528.0,0.0,-7.0,100.0,NotFound,27.0,Found,Found,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# #column dist1 and D11 only has Nan in df_join。
# df_join=df_join.drop(columns=['dist1', 'D11'], inplace=False)

In [13]:
# Define X and y
from sklearn.model_selection import train_test_split

X = df_join.drop(columns=['TransactionID', 'isFraud'], inplace=False) # drop id and label
y = df_join['isFraud']

# X.dtypes.unique()

#### Save 'X_raw.csv' and 'y_raw.csv' before get dummies


In [14]:
# X.to_csv('X_raw.csv')
# !cp X_raw.csv "drive/MyDrive/Capstone/Data/"

# y.to_csv('y_raw.csv')
# !cp y_raw.csv "drive/MyDrive/Capstone/Data/"

### Get Dummies

Get dummies of the catogorical variables, does not improve the performance of ML Models. So the following part are all comment out.
But we still keep all these codes for potencially used in the future.

In [18]:
# # define a function to get dummies for the catogorical cols.
# def get_dummies(df: DataFrame, cols: list):
#     '''
#     Get the dummy values for the categorical columns.
#     Append them to the input df and drop the original cols.

#     df: data.
#     cols: the name of the columns need to be converted.
#     '''

#     for col in cols:
#         if col in df.columns:
#             col_dummies = pd.get_dummies(data=df[col])
#             df = pd.concat([df, col_dummies], axis=1)
#             df = df.drop(col, axis=1)

  
#     return df

In [19]:
# # there are many cols named as 'NA_'
# # rename the duplicates named cols.

# cols = pd.Series(X.columns)
# dup_count = cols.value_counts()
# for dup in cols[cols.duplicated()].unique():
#     cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) for i in range(1, dup_count[dup]+1)]

# # run it twice, because newly named cols in last step got dups with ori not changed col names.
# X.columns = cols
# cols = pd.Series(X.columns)
# dup_count = cols.value_counts()
# for dup in cols[cols.duplicated()].unique():
#     cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) for i in range(1, dup_count[dup]+1)]

# X.columns = cols

In [20]:
# # test if there are still duplicates names in the df
# uni_set = set()
# for col in cols:
#   if col not in uni_set:
#     uni_set.add(col)
#   else:
#     print(col)

# len(cols) - len(uni_set)

0

#### Fill Nan values (Part B)
with mean and most frequent values. fit and train using pipeline.

    1.b fill with **mean** value of the column
>numerical 'id_XX'

    2.b fill with **mean** value of the column using sklearn.impute.

>card1 - card6: payment card information.Such as card type, card category, issue bank, country, etc.

>addr: both addresses are for purchaser.
addr1 as billing region.
addr2 as billing country.

>dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
<br>

    2.c fill with **most frequent value** in the column.
  >C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
  
  >D1-D15: timedelta, such as days between previous transaction, etc.
<br>

In [28]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Step 1:
# filter out the cols with Nans.
X_null = X.isnull().sum(axis=0).to_frame() # count Nans in every col.
X_null.rename(columns={0: '#_Nans'}, inplace=True) # rename cols.
X_NanCols = X_null[X_null['#_Nans']>0].index # get a series contains all the names of cols with Nan.

X_fullCols = X_null[X_null['#_Nans']==0].index

# make lists, indicating which stratage will be used in imputing the cols.
cols_fill_mean = []
cols_fill_freq = []

for col in X_NanCols:
  if str(col).startswith('C'): # cols C1-C1
    cols_fill_freq.append(col)
  elif str(col).startswith('D'): # cols D1-D15 and 'Device ...' which has been filled previously.
    cols_fill_freq.append(col)
  else:
    cols_fill_mean.append(col) # cols id_XX and cols has already been filled with other startages earlier.

# make all the cols still included in the following processing
cols_fill_freq.extend(X_fullCols.to_list())

In [29]:
# cols_fill_freq

In [30]:
# Step 2:
# instantiate the imputers, within a pipeline
# imputer imputes with the mean
imp_mean = Pipeline(steps=[('imputer', SimpleImputer(missing_values=np.nan, strategy='mean'))])

# imputer imputes with 'most_frequent'
imp_freq = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan, strategy='most_frequent'))])


# Step 3:
# put the features list and the transformers together by col transformer.
imp_preprocessor = ColumnTransformer(transformers=[('imp_mean', imp_mean, cols_fill_mean),\
                                                   ('imp_freq',imp_freq,cols_fill_freq)])

In [31]:
# Step 4:
# fit and trans the datasets with 'imp_preprocessor'.
imp_preprocessor.fit(X_train)

X_train = imp_preprocessor.transform(X_train)
X_test = imp_preprocessor.transform(X_test)

ValueError: ignored

In [None]:
X_val.shape

In [None]:
X_train

In [None]:
X_train[1]


In [None]:
# # output
# X_train.tofile('X_train')
# !cp X_train "drive/MyDrive/Capstone/Data/"

# X_val.tofile('X_val')
# !cp X_val "drive/MyDrive/Capstone/Data/"

# X_test.tofile('X_test')
# !cp X_test "drive/MyDrive/Capstone/Data/"




In [None]:
# y_train.to_csv('y_train.csv')
# !cp y_train.csv "drive/MyDrive/Capstone/Data/"

# y_val.to_csv('y_val.csv')
# !cp y_val.csv "drive/MyDrive/Capstone/Data/"

# y_test.to_csv('y_test.csv')
# !cp y_test.csv "drive/MyDrive/Capstone/Data/"