# Preprocessing

This notebook walks through generating test and training data sets.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
sys.path.append("../src")
import utils

rng = np.random.default_rng(10)

## Preparing User and Card Data

Before beginning to prepare the transaction data, it is necessary to assemble the needed information about cards and users, which will be joined to each transaction record. To enable doing the later join with the transaction dataw in one step, the following cells create a combined data frame of cards and their owners.

Users

In [3]:
user_data = pd.read_csv(
    utils.prepend_dir('users_all.csv'),
    index_col=0
)
user_data['birthdate'] = pd.to_datetime({
    'year': user_data.birth_year, 
    'month': user_data.birth_month, 
    'day':1
})

N_users = user_data.shape[0]

print(user_data.info())
user_cols = ['birthdate', 'retirement_age', 'gender' ,'city', 'state', 'zipcode', 'latitude', 'longitude', 'per_capita_income_zipcode', 'yearly_income_person', 'total_debt', 'fico_score', 'num_credit_cards']

<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, 0 to 1999
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   person                     2000 non-null   object        
 1   current_age                2000 non-null   int64         
 2   retirement_age             2000 non-null   int64         
 3   birth_year                 2000 non-null   int64         
 4   birth_month                2000 non-null   int64         
 5   gender                     2000 non-null   object        
 6   address                    2000 non-null   object        
 7   apartment                  528 non-null    float64       
 8   city                       2000 non-null   object        
 9   state                      2000 non-null   object        
 10  zipcode                    2000 non-null   int64         
 11  latitude                   2000 non-null   float64       
 12  longitude  

Cards

In [4]:
card_data = pd.read_csv(
    utils.prepend_dir('cards_all.csv'), 
    index_col=0, 
    parse_dates=['expires', 'acct_open_date']
)
print(card_data.info())
card_cols = ['user', 'card_index', 'card_brand', 'card_type', 'expires', 'has_chip', 'cards_issued', 'credit_limit', 'acct_open_date']

<class 'pandas.core.frame.DataFrame'>
Index: 6146 entries, 0 to 6145
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   user                   6146 non-null   int64         
 1   card_index             6146 non-null   int64         
 2   card_brand             6146 non-null   object        
 3   card_type              6146 non-null   object        
 4   card_number            6146 non-null   int64         
 5   expires                6146 non-null   datetime64[ns]
 6   cvv                    6146 non-null   int64         
 7   has_chip               6146 non-null   bool          
 8   cards_issued           6146 non-null   int64         
 9   credit_limit           6146 non-null   float64       
 10  acct_open_date         6146 non-null   datetime64[ns]
 11  year_pin_last_changed  6146 non-null   int64         
dtypes: bool(1), datetime64[ns](2), float64(1), int64(6), object(2)
memo

Merging users and cards data

In [5]:
cards_users = (card_data[card_cols]
    .merge(user_data[user_cols],
        left_on='user',
        right_index=True)
    .rename({'card_index':'card'}, axis=1)
)

print(cards_users.info())

# there's enough data that memory is a constraint
del user_data
del card_data

<class 'pandas.core.frame.DataFrame'>
Index: 6146 entries, 0 to 6145
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   user                       6146 non-null   int64         
 1   card                       6146 non-null   int64         
 2   card_brand                 6146 non-null   object        
 3   card_type                  6146 non-null   object        
 4   expires                    6146 non-null   datetime64[ns]
 5   has_chip                   6146 non-null   bool          
 6   cards_issued               6146 non-null   int64         
 7   credit_limit               6146 non-null   float64       
 8   acct_open_date             6146 non-null   datetime64[ns]
 9   birthdate                  6146 non-null   datetime64[ns]
 10  retirement_age             6146 non-null   int64         
 11  gender                     6146 non-null   object        
 12  city       

## Testing and training data

Because of the large size of the data and the highly unbalanced target variable, I will create three distinct data sets. Note that since there are user-level features, to avoid contamination I'm splitting on users rather than on transactions.
1. A test set consisting of 10% of the users (i.e. 200). The rest of the data represents the training pool, though it is too much data to work with on my personal laptop.
1. An unbalanced training set consisting of 400 users in the training pool.
1. A balanced training set consisting of all positive cases of fraud in the training pool and a randomly selected matching number of non-fraud transactions from the training pool.

### Randomly selecting users for testing and unbalanced training data

The code below randomly selects 400 user IDs for the unbalanced training data and 200 for the holdback testing data. In order ensure disjoint sets, this is achieved by a single random selection of 600 user IDs.

In [6]:
seed = 111
# sample 1/5 of users
rng = np.random.default_rng(seed)

testing_size = 200
unbalanced_training_size = 400
subset_size = testing_size + unbalanced_training_size
# randomly select the unbalanced training and testing users
subset = rng.choice(N_users, size=subset_size, replace=False)

# Testing data for 200 users and a training subset of 400 users. No need to randomize again because order in the subset array is already random.
training_subset = subset[:unbalanced_training_size]
testing_subset = subset[-testing_size:]

Because I need to know the overall fraud rate in the entire training pool to make the balanced data set, the construction of the data sets proceeds in two steps.

The first step is to make separate CSVs for:
1. The testing data
1. All of the positive cases in the training pool
1. All of the negative cases in the training pool

For my own convenience if I need the code elsewhere, I put the code for this in a function in `src/utils.py`.

In [7]:
pos_filename, neg_filename, rate = utils.clean_split_tx(cards_users, testing_subset)

Data already present. Skipping download.


The second step is to construct and save the two training data sets:
1. Balanced training data (based on the fraud **rate** rather than matching numbers, so there is a small difference in the quantities)
1. Complete training data for the 400 users selected above

At the same time, the called function keeps track of the fraud rates for each `mcc` code, in the training pool, saving the results into a separate CSV file. This will be used to transform the `mcc` codes into a usable feature.

Again, the code is in `utils.py`.

In [8]:
subset_filename, balanced_filename, mcc_rates_filename = utils.make_training_sets(training_subset, pos_filename, neg_filename, rate)

At this point, the data is saved and ready for analysis. To confirm the results and demonstrate the final preparatory steps for modeling, the balanced data is read into a data frame below:

In [9]:
balanced_df = pd.read_csv(balanced_filename, index_col=0)
print(balanced_df.is_fraud.value_counts())

is_fraud
True     13850
False    13749
Name: count, dtype: int64


## The transformation pipeline

Since different methods employed in the modeling stage will require different transformations, these will be handled in pipelines. To prepare for this, I wrote two scikitlearn tranform classes:
1. `MMCRates`: converts the `mcc` codes to the fraud rates observed in the entire training pool, as recorded by the function above.
1. `MakeDummies`: combines all of the needed operations for dealing with the other categories into a single step, namely converting the multicategory `errors` column into separate indicators and making dummies for the transaction type, card brand, and card type columns.

First, we can look at the two steps in isolation.

### Converting MMC rates

The following cell shows the results of replacing MCC codes with fraud rates:

In [10]:
mcc_rates = pd.read_csv(mcc_rates_filename, index_col=0)
print(f'Original data: {len(balanced_df.mcc.unique())} distinct MCC codes')
print('After merge, the mcc_fraud_rate variable is:')
balanced_df.merge(mcc_rates, how='left', left_on='mcc', right_index=True).mcc_fraud_rate.describe()

Original data: 109 distinct MCC codes
After merge, the mcc_fraud_rate variable is:


count    27599.000000
mean         0.009009
std          0.037777
min          0.000000
25%          0.000334
50%          0.001395
75%          0.005829
max          0.468439
Name: mcc_fraud_rate, dtype: float64

### Errors

The errors column in the data is NA if no error, and lists (possibly multiple) errors that occured. Multiple errors are separated by a comma.

In [11]:
print(balanced_df.errors.value_counts())

errors
Insufficient Balance                    327
Bad PIN                                 170
Bad CVV                                 129
Bad Expiration                           56
Technical Glitch                         51
Bad Card Number                          48
Bad PIN,Insufficient Balance              6
Bad Zipcode                               2
Bad Expiration,Technical Glitch           1
Bad Card Number,Insufficient Balance      1
Bad Card Number,Technical Glitch          1
Bad CVV,Technical Glitch                  1
Bad Expiration,Bad CVV                    1
Name: count, dtype: int64


These can be converted into indicator features for each error. So, for example, "Bad Card Number,Insufficient Balance" is convered into `True` for the two columns "Bad Card Number" and "Insufficient Balance."

In [12]:
dummy_df, cats = utils.convert_multicat(balanced_df, 'errors')
print(dummy_df[cats].sum())

Insufficient Balance    334
Bad Zipcode               2
Bad Expiration           58
Technical Glitch         54
Bad Card Number          50
Bad PIN                 176
Bad CVV                 131
dtype: int64


### The full pipeline

Now we can look at the full transform pipeline in action. Note that I don't save this data to CSV, because I think it is easier to recreate it, or customize it in different ways, using pipelines in the modeling process.

In [13]:
transform_pipeline = make_pipeline(
    utils.MCCRates(),
    utils.MakeDummies('errors')
)

transformed_df = transform_pipeline.fit_transform(balanced_df)
transformed_df.head(3).T

Unnamed: 0,7977,7979,7987
user,5,5,5
card,0,0,0
amount,11.45,471.0,398.93
is_fraud,True,True,True
has_chip,True,True,True
cards_issued,2,2,2
credit_limit,9900.0,9900.0,9900.0
latitude,41.55,41.55,41.55
longitude,-90.6,-90.6,-90.6
per_capita_income_zipcode,20599.0,20599.0,20599.0


The same 2 transformations (`mcc` codes and dummies) can also be performed on the testing data. Note that since the `mcc` averages and the categories are based on the training data, this code prevents contamination and ensures that the data has the correct number of features.

In [14]:
testing_data = pd.read_csv(utils.prepend_dir('tx_test.csv'), index_col=0)
testing_data_transformed = transform_pipeline.transform(testing_data)
print(testing_data_transformed['mcc_fraud_rate'].describe())
print(testing_data_transformed[cats].sum())


count    1.176707e+06
mean     1.239080e-03
std      4.408158e-03
min      0.000000e+00
25%      1.435712e-04
50%      3.340256e-04
75%      8.865766e-04
max      4.684385e-01
Name: mcc_fraud_rate, dtype: float64
Insufficient Balance    12162
Bad Zipcode               110
Bad Expiration            526
Technical Glitch         2341
Bad Card Number           632
Bad PIN                  2661
Bad CVV                   486
dtype: int64


## Wrap-up

This notebook documents preparing training and testing data sets. It outputs CSV data files for:
1. A testing set of 200 users
1. An unbalanced training set of 400 users
1. A balanced training set drawing on all 1,800 users in the wider pool of training data,

This saved data is cleaned and ready to be fed into alternative transformation pipelines for different modeling strategies. Two important transformations are demonstrated in the final section of the notebook:
1. Replacing MCC codes with rates of fraud observed in the training data
1. Converting categorical features to dummies, including the `error` column that can have multiple errors for one observation