# Question 4: Model

Fraud is a problem for any bank. Fraud can take many forms, whether it is someone stealing a single credit card, to large batches of stolen credit card numbers being used on the web, or even a mass compromise of credit card numbers stolen from a merchant via tools like credit card skimming devices.

Each of the transactions in the dataset has a field called isFraud. Please build a predictive model to determine whether a given transaction will be fraudulent or not. Use as much of the data as you like (or all of it).

Provide an estimate of performance using an appropriate sample, and show your work.

Please explain your methodology (modeling algorithm/method used and why, what features/data you found useful, what questions you have, and what you would do next with more time)

In [1]:
import os, pickle
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, OrdinalEncoder, PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.under_sampling import RandomUnderSampler

## Data selection and Preprocessing

### 1. New features
I added two more features of `accountAge` and `falseCVVDigits`. 
1. `accountAge`: days between `accountOpenDate` and `transactionDateTime`. 
2. `falseCVVDigits`: mismatched number of digits between `enteredCVV` and `cardCVV`.


In [5]:
df = pd.read_csv('transactions_eda.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 786363 entries, 0 to 786362
Data columns (total 30 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   accountNumber             786363 non-null  int64  
 1   customerId                786363 non-null  int64  
 2   creditLimit               786363 non-null  int64  
 3   availableMoney            786363 non-null  float64
 4   transactionDateTime       786363 non-null  object 
 5   transactionAmount         786363 non-null  float64
 6   merchantName              786363 non-null  object 
 7   acqCountry                786363 non-null  object 
 8   merchantCountryCode       786363 non-null  object 
 9   posEntryMode              786363 non-null  int64  
 10  posConditionCode          786363 non-null  int64  
 11  merchantCategoryCode      786363 non-null  object 
 12  currentExpDate            786363 non-null  object 
 13  accountOpenDate           786363 non-null  o

### 2. Balance the dataset 
1. Drop columns that are not useful for prediction 
2. Random Under Sampling for balancing classes
    
    Because the dataset contains imbalanced classes, I used `Random Under Sampling` to downgrade the size of non-fraud class to the same size of fraud class. As a result, I have in total `24834 x 15` for data size. 

In [6]:
account = df['accountNumber']

drop_cols = ['customerId', 'transactionDateTime', 'merchantName', 
             'accountOpenDate', 'transactionDateTime',
             'currentExpDate', 'accountOpenDate',
             'dateOfLastAddressChange',
             'datetime', 'timestamp', 'logTransactionAmount',
             'cardCVV', 'enteredCVV', 'cardLast4Digits', 
             'correctCVV', 'accountNumber']

df.drop(drop_cols, inplace=True, axis=1)


for col in ['cardPresent', 'expirationDateKeyInMatch', 'isFraud', 'isMultiSwipe']:
    df[col] = df[col].replace({False: 0, True: 1})

# rus = RandomUnderSampler(sampling_strategy=0.5)
rus = RandomUnderSampler()

y = df['isFraud']
df.drop('isFraud', inplace=True, axis=1)
new_x, new_y = rus.fit_resample(df, y)
print(f'Before Random Under Sampling: {df.shape}')
print(f'After Random Under Sampling: {new_x.shape}')

Before Random Under Sampling: (786363, 15)
After Random Under Sampling: (24834, 15)


### 3. Train-test split and data normalization

1. Divide the dataset into train and test sets, with 70% for training and 30% for testing. 
2. Divide columns into categorical set, ordinal set and numerical set, and normalize the each set using a different pipeline. 

    2.1 Categorical set: columns that have categorical values. First convert the categorical data into numerical labels, and then use MinMaxScaler to normalize samples into range [0,1]. 

    2.2 Ordinal set: columns that have ordered numerical values, for example, `falseCVVDigits`.
    
    2.3 Numerical set: columns that have numerical features. 

In [7]:
x_train, x_test, y_train, y_test = train_test_split(new_x, new_y, test_size=0.3)

df.info()

cat_cols = ['merchantCountryCode',
            'merchantCategoryCode',
            'posConditionCode',
            'posEntryMode',
            'transactionType',
            'acqCountry',
            'cardPresent',
            'expirationDateKeyInMatch',
            'isMultiSwipe']
ord_cols = ['falseCVVDigits']
num_cols = ['creditLimit', 
            'availableMoney', 
            'transactionAmount',
            'currentBalance',
            'accountAge']
pipeline = ColumnTransformer([
    ('cat_pipe', 
    Pipeline([('cat_imputer', SimpleImputer(strategy='most_frequent')),
              ('label_enc', OrdinalEncoder()),
              ('scaler', MinMaxScaler())
              ]),
    cat_cols,
    ),
    ('ord_pipe', 
    Pipeline([('ord_enc', OrdinalEncoder()),
              ('scaler', MinMaxScaler())]),
    ord_cols,
    ),
    ('num_pipe',
    Pipeline([('power_trans', PowerTransformer(method='yeo-johnson', standardize=False)),
              ('scaler', MinMaxScaler())]),
    num_cols)
], remainder='passthrough')


x_train = pipeline.fit_transform(x_train)
x_test = pipeline.transform(x_test)

print(x_train.shape)

data = {'x_train': x_train, 'x_test': x_test, 'y_train': y_train, 'y_test': y_test}

with open('results/preprocessed_data', 'wb') as file:
    pickle.dump(data, file, protocol=4)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 786363 entries, 0 to 786362
Data columns (total 15 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   creditLimit               786363 non-null  int64  
 1   availableMoney            786363 non-null  float64
 2   transactionAmount         786363 non-null  float64
 3   acqCountry                786363 non-null  object 
 4   merchantCountryCode       786363 non-null  object 
 5   posEntryMode              786363 non-null  int64  
 6   posConditionCode          786363 non-null  int64  
 7   merchantCategoryCode      786363 non-null  object 
 8   transactionType           786363 non-null  object 
 9   currentBalance            786363 non-null  float64
 10  cardPresent               786363 non-null  int64  
 11  expirationDateKeyInMatch  786363 non-null  int64  
 12  isMultiSwipe              786363 non-null  int64  
 13  accountAge                786363 non-null  i