# Preprocessing

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import pandas as pd
import numpy as np
sys.path.append("../src")
import utils

rng = np.random.default_rng(10)

## Todo

1. Transformations
    1. Join income and credit limit features from users data
    1. Calculate "home city", "home state", "home country", user age, and retired features
    1. Calculate card vintage and time to expiration features from card data
    1. **What to do with MCCs?** There are 109 of them: too many to just make dummies? Perhaps they could be grouped according to fraud rates, though this would need to be done using only training data.
1. Training and testing data
    1. Make a hold-back testing data set with $\frac{1}{10}$ of the users (i.e. 200). Some of the features include user-level values, so the final testing data should avoid any contamination from that information. Need to double-check that sampling error on the fraud rates isn't too extreme.
    1. Make a balanced training data set from all of the rest of the data.
    1. Make an alternative training data set without balancing. This should include $\frac{1}{4}$ of the users (i.e. 500).

## Preparing User and Card Data

Users

In [3]:
user_data = pd.read_csv(
    utils.prepend_dir('users_all.csv'),
    index_col=0
)
user_data['birthdate'] = pd.to_datetime({
    'year': user_data.birth_year, 
    'month': user_data.birth_month, 
    'day':1
})

N_users = user_data.shape[0]

print(user_data.info())
user_cols = ['birthdate', 'retirement_age', 'gender' ,'city', 'state', 'zipcode', 'latitude', 'longitude', 'per_capita_income_zipcode', 'yearly_income_person', 'total_debt', 'fico_score', 'num_credit_cards']

<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, 0 to 1999
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   person                     2000 non-null   object        
 1   current_age                2000 non-null   int64         
 2   retirement_age             2000 non-null   int64         
 3   birth_year                 2000 non-null   int64         
 4   birth_month                2000 non-null   int64         
 5   gender                     2000 non-null   object        
 6   address                    2000 non-null   object        
 7   apartment                  528 non-null    float64       
 8   city                       2000 non-null   object        
 9   state                      2000 non-null   object        
 10  zipcode                    2000 non-null   int64         
 11  latitude                   2000 non-null   float64       
 12  longitude  

Cards

In [4]:
card_data = pd.read_csv(
    utils.prepend_dir('cards_all.csv'), 
    index_col=0, 
    parse_dates=['expires', 'acct_open_date']
)
print(card_data.info())
card_cols = ['user', 'card_index', 'card_brand', 'card_type', 'expires', 'has_chip', 'cards_issued', 'credit_limit', 'acct_open_date']

<class 'pandas.core.frame.DataFrame'>
Index: 6146 entries, 0 to 6145
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   user                   6146 non-null   int64         
 1   card_index             6146 non-null   int64         
 2   card_brand             6146 non-null   object        
 3   card_type              6146 non-null   object        
 4   card_number            6146 non-null   int64         
 5   expires                6146 non-null   datetime64[ns]
 6   cvv                    6146 non-null   int64         
 7   has_chip               6146 non-null   bool          
 8   cards_issued           6146 non-null   int64         
 9   credit_limit           6146 non-null   float64       
 10  acct_open_date         6146 non-null   datetime64[ns]
 11  year_pin_last_changed  6146 non-null   int64         
dtypes: bool(1), datetime64[ns](2), float64(1), int64(6), object(2)
memo

Merging users and cards data

In [5]:
cards_users = (card_data[card_cols]
    .merge(user_data[user_cols],
        left_on='user',
        right_index=True)
    .rename({'card_index':'card'}, axis=1)
)

print(cards_users.info())

# there's enough data that memory is a constraint
del user_data
del card_data

<class 'pandas.core.frame.DataFrame'>
Index: 6146 entries, 0 to 6145
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   user                       6146 non-null   int64         
 1   card                       6146 non-null   int64         
 2   card_brand                 6146 non-null   object        
 3   card_type                  6146 non-null   object        
 4   expires                    6146 non-null   datetime64[ns]
 5   has_chip                   6146 non-null   bool          
 6   cards_issued               6146 non-null   int64         
 7   credit_limit               6146 non-null   float64       
 8   acct_open_date             6146 non-null   datetime64[ns]
 9   birthdate                  6146 non-null   datetime64[ns]
 10  retirement_age             6146 non-null   int64         
 11  gender                     6146 non-null   object        
 12  city       

## Testing and training data

Splitting the data: 100 users to the test set and 400 users for the unbalanced training set.

Note that since there are user-level features, to avoid contamination I'm splitting on users rather than randomly on transactions.

In [6]:
seed = 111
# sample 1/5 of users
rng = np.random.default_rng(seed)

subset_size = N_users // 4
subset = rng.choice(N_users, size=subset_size, replace=False)

# Testing data for 100 users and a training subset of 400 users
training_subset = subset[:400]
testing_subset = subset[400:]

Making a CSVs for:
1. Testing data
1. All of the positive cases on the remaining training data
1. All of the negative cases in the remaining training data

In [8]:
metadata = utils.clean_split_tx(cards_users, testing_subset)
metadata

Data already present. Skipping download.


{'pos_filename': '../data/processed/tx_train_pos.csv',
 'neg_filename': '../data/processed/tx_train_neg.csv',
 'rate': 0.0012562183149985669}