# Identifying Credit Card Fraud: 1. Data Wrangling
Mark Cohen

2023-3-7

## 1 Setup

The structure of the project directory is as follows:
- data
    - raw
    - processed
- src
- notebooks
- models
- reports

The `src` directory includes python scripts that define utility functions needed at various steps of the project. This data wrangling notebook will use the `utils.py`, which provides functions to download and load the project's data.

**NOTE:** because of its size, the raw data is not stored in the github repository. Instead, the zipped data is mirrored on my Google Drive account. See the file `data source and license.txt` for the original source and license. 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import pandas as pd
import numpy as np
from functools import reduce
sys.path.append("../src")
import utils

In [3]:
# Confirming that data is present locally, or downloading and unzipping if not.
utils.raw_data_on_disk()

Downloading data.
Downloaded ../tmp/data_archive.zip
Unzipping data files.


## 2 Inspecting the data files

### 2.1 A sample transaction record

The data set includes the transaction records for a single customer separated out into its own csv file. It will be useful to look at this file to get a sense of the format of the data.

In [4]:
df_user0 = utils.read_sample_transactions()

As seen below, the dataset includes almost 20,000 transactions for this one user, and each transaction record consists of 15 features.

The column names are inconvenient for data frame indexing. They should be reformatted in snake case with no special characters.

We can also see already that some of the data types will need to be adjusted:
1. Amount is a string, including currency marks.
1. Zip is a float rather than int or string.
1. The target feature is a text `yes`/`no` rather than an integer or boolean.

`MCC` is "Merchant Category Code." For example: `5411` represents "Grocery Stores and Supermarkets." See the following document from Visa: https://usa.visa.com/content/dam/VCOM/download/merchants/visa-merchant-data-standards-manual.pdf. 

Finally, there are a substantial number of missing values in the `Merchant State` and `Zip` columns. In the `Errors?` column most rows have no value, which probably means there is no error.

In [5]:
print(df_user0.info())
print(df_user0.head(3).T)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19963 entries, 0 to 19962
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   User            19963 non-null  int64  
 1   Card            19963 non-null  int64  
 2   Year            19963 non-null  int64  
 3   Month           19963 non-null  int64  
 4   Day             19963 non-null  int64  
 5   Time            19963 non-null  object 
 6   Amount          19963 non-null  object 
 7   Use Chip        19963 non-null  object 
 8   Merchant Name   19963 non-null  int64  
 9   Merchant City   19963 non-null  object 
 10  Merchant State  18646 non-null  object 
 11  Zip             18316 non-null  float64
 12  MCC             19963 non-null  int64  
 13  Errors?         574 non-null    object 
 14  Is Fraud?       19963 non-null  object 
dtypes: float64(1), int64(7), object(7)
memory usage: 2.3+ MB
None
                                  0                    1   

In [6]:
# Updating the column names

df_user0.columns = utils.update_colnames(df_user0.columns)
print(f"New column names: {df_user0.columns}")

New column names: Index(['user', 'card', 'year', 'month', 'day', 'time', 'amount', 'use_chip',
       'merchant_name', 'merchant_city', 'merchant_state', 'zip', 'mcc',
       'errors', 'is_fraud'],
      dtype='object')


Looking next at descriptive statistics for the numeric features, a few points stand out.
1. The data for this user includes transactions from 5 distinct cards.
1. The data covers a long period of time: from 2002 to 2020. One possibility might be to subset the data on time rather than by customers, e.g. restrict it to a period of a few years.
1. In addition to missing values, the minimum value of the zip code column is not a valid zip code.

In [7]:
df_user0.describe()

Unnamed: 0,user,card,year,month,day,merchant_name,zip,mcc
count,19963.0,19963.0,19963.0,19963.0,19963.0,19963.0,18316.0,19963.0
mean,0.0,1.910735,2011.011922,6.568101,15.743876,7.825653e+17,88812.744922,5617.940239
std,0.0,1.237763,5.048146,3.477497,8.801378,4.040602e+18,13711.491085,707.982901
min,0.0,0.0,2002.0,1.0,1.0,-9.179793e+18,1012.0,1711.0
25%,0.0,0.0,2007.0,4.0,8.0,-1.288082e+18,91750.0,5311.0
50%,0.0,2.0,2011.0,7.0,16.0,8.38425e+17,91750.0,5499.0
75%,0.0,3.0,2015.0,10.0,23.0,4.060647e+18,91752.0,5912.0
max,0.0,4.0,2020.0,12.0,31.0,9.137769e+18,99504.0,9402.0


Looking first at the `zip` column, most of the rows with missing data represent online transactions. Transactions outside of the United States are recorded such that there is no zip code and the country name is stored in the `merchant_state` column.

In [8]:
missing_zip = df_user0.loc[df_user0.zip.isna(),["merchant_city", "merchant_state", "zip"]]
print(missing_zip.merchant_city.value_counts())
print()
print(missing_zip.loc[missing_zip.merchant_state != 'ONLINE', 'merchant_state'].value_counts())

merchant_city
ONLINE            1317
Cancun             112
Manila              46
Kingston            46
Cabo San Lucas      34
Rome                32
Tallinn             13
Tokyo               12
Beijing             11
Shanghai             7
Lisbon               6
Zurich               5
Santo Domingo        4
Toronto              2
Name: count, dtype: int64

merchant_state
Mexico                146
Philippines            46
Jamaica                46
Italy                  32
China                  18
Estonia                13
Japan                  12
Portugal                6
Switzerland             5
Dominican Republic      4
Canada                  2
Name: count, dtype: int64


As for the zip codes that seem to be too short, this is apparently because leading zeros (i.e. zip codes from northeastern states) have been dropped. This does not impact the validity or usability of the data, so it will be left as is for now.

In [9]:
low_zip = df_user0[df_user0.zip < 10000]
print(low_zip.merchant_state.unique())

['NJ' 'CT' 'MA']


Moving on to the textual columns, it turns out that the `amount` column just appends `$` to every value. Accordingly, this can be removed and the values converted to float.

In [10]:
print(f"The first character of the amount column:\n {df_user0.amount.str.get(0).value_counts()}")

df_user0.amount = utils.convert_dollar_amounts(df_user0.amount)
df_user0.amount.describe()

The first character of the amount column:
 amount
$    19963
Name: count, dtype: int64


count    19963.000000
mean        81.299989
std         94.159093
min       -499.000000
25%         36.630000
50%         69.450000
75%        125.680000
max       1409.400000
Name: amount, dtype: float64

This reveals an additional issue: some of the transaction values are negative. These appear to be refunds: note how the 2nd example immediately follows a transaction for the same (but positive) amount from the same merchant.

In [11]:
neg_val = df_user0[df_user0.amount <= 0]
print(neg_val.head(3).T)
print(df_user0.iloc[71:73,:].T)

                                32                    72   \
user                              0                     0   
card                              0                     0   
year                           2002                  2002   
month                             9                     9   
day                              11                    25   
time                          13:17                 13:14   
amount                        -99.0                -100.0   
use_chip          Swipe Transaction     Swipe Transaction   
merchant_name   2027553650310142703  -1288082279022882052   
merchant_city             Mira Loma              La Verne   
merchant_state                   CA                    CA   
zip                         91752.0               91750.0   
mcc                            5541                  5499   
errors                          NaN                   NaN   
is_fraud                         No                    No   

                       

The separate columns for year, month, day, and time can be combined into a single Pandas Timestamp column.

In [12]:


df_user0['timestamp'] = utils.make_timestamps(df_user0)
# Confirm the years match
print("Mismatched years:", (df_user0.year != df_user0.timestamp.dt.year).sum())

Mismatched years: 0


Next, let's look at `use_chip`. It turns out there are only three values, corresponding to swipe, chip, and online. We can clean this up by renaming the column more intuitively, and converting into categories without redundant names.

In [13]:
print(df_user0.use_chip.value_counts(), df_user0.use_chip.isna().sum())

use_chip
Swipe Transaction     15840
Chip Transaction       2808
Online Transaction     1315
Name: count, dtype: int64 0


In [14]:
tx_type = df_user0.use_chip.str.strip(" Transaction").str.lower().astype("category")
print(tx_type.cat.categories)
df_user0['tx_type'] = tx_type

Index(['chip', 'online', 'swipe'], dtype='object')


In [15]:
df_user0.drop(columns = ["use_chip"], inplace=True)

Finally, the target feature indicating fraud is a binary 'Yes' or 'No'. It will be easier to work with as a boolean.

In [16]:
print(df_user0.is_fraud.value_counts())

df_user0.is_fraud = df_user0.is_fraud == 'Yes'

print(df_user0.is_fraud.value_counts())

is_fraud
No     19936
Yes       27
Name: count, dtype: int64
is_fraud
False    19936
True        27
Name: count, dtype: int64


This is what the head of the data frame looks like after processing:

In [17]:
df_user0.head(2).T

Unnamed: 0,0,1
user,0,0
card,0,0
year,2002,2002
month,9,9
day,1,1
time,06:21,06:42
amount,134.09,38.48
merchant_name,3527213246127876953,-727612092139916043
merchant_city,La Verne,Monterey Park
merchant_state,CA,CA


To sum up the data cleaning steps that will need to be repeated for the full training and test samples:
1. Renaming the columns.
1. Stripping the `$` and converting the transaction amounts to floating point values.
1. Creating a combined timestamp columns.
1. Creating categories and renaming the `use_chip` column.
1. Converting `is_fraud` into a boolean value.

### 2.2 The user records file

Information about each user is stored in a separate file, `sd254_users.csv`. It has 2000 records and 18 columns.

In [18]:
users = utils.read_users()
print(f"Shape of the users data frame: {users.shape}")
users.head(2).T

Shape of the users data frame: (2000, 18)


Unnamed: 0,0,1
Person,Hazel Robinson,Sasha Sadr
Current Age,53,53
Retirement Age,66,68
Birth Year,1966,1966
Birth Month,11,12
Gender,Female,Female
Address,462 Rose Lane,3606 Federal Boulevard
Apartment,,
City,La Verne,Little Neck
State,CA,NY


As with the transactions, the columns headings can be updated for easier interaction.

In [19]:
users.columns = utils.update_colnames(users.columns)
users.head(2).T

Unnamed: 0,0,1
person,Hazel Robinson,Sasha Sadr
current_age,53,53
retirement_age,66,68
birth_year,1966,1966
birth_month,11,12
gender,Female,Female
address,462 Rose Lane,3606 Federal Boulevard
apartment,,
city,La Verne,Little Neck
state,CA,NY


Also, as with the transactions records, the money amounts are entered as strings rather than floating point values.

In [20]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   person                     2000 non-null   object 
 1   current_age                2000 non-null   int64  
 2   retirement_age             2000 non-null   int64  
 3   birth_year                 2000 non-null   int64  
 4   birth_month                2000 non-null   int64  
 5   gender                     2000 non-null   object 
 6   address                    2000 non-null   object 
 7   apartment                  528 non-null    float64
 8   city                       2000 non-null   object 
 9   state                      2000 non-null   object 
 10  zipcode                    2000 non-null   int64  
 11  latitude                   2000 non-null   float64
 12  longitude                  2000 non-null   float64
 13  per_capita_income_zipcode  2000 non-null   objec

In [21]:
val_cols = ['per_capita_income_zipcode', 'yearly_income_person', 'total_debt']
for col in val_cols:
    users[col] = utils.convert_dollar_amounts(users[col])
users[val_cols].describe()

Unnamed: 0,per_capita_income_zipcode,yearly_income_person,total_debt
count,2000.0,2000.0,2000.0
mean,23141.928,45715.882,63709.694
std,11324.137358,22992.615456,52254.453421
min,0.0,1.0,0.0
25%,16824.5,32818.5,23986.75
50%,20581.0,40744.5,58251.0
75%,26286.0,52698.5,89070.5
max,163145.0,307018.0,516263.0


We can note that there apparently are zipcodes with 0 per capita income. This appears to be missing data: `60657` for example is part of Chicago, IL. It will be worth returning to this during EDA to decide whether and how to fill it in.

In [22]:
zero_zips = users.loc[users.per_capita_income_zipcode==0, 'zipcode']
print("Zip codes with 0 income p.c.:", zero_zips.value_counts(), sep='\n')
print("These zip codes consistently have 0 income p.c.:", users.loc[users.zipcode.isin(zero_zips), 'per_capita_income_zipcode'], sep='\n')

Zip codes with 0 income p.c.:
zipcode
60657    2
60614    2
94583    2
76248    2
92130    1
8540     1
10003    1
30022    1
94010    1
11215    1
77450    1
Name: count, dtype: int64
These zip codes consistently have 0 income p.c.:
246     0.0
662     0.0
741     0.0
751     0.0
764     0.0
993     0.0
1068    0.0
1100    0.0
1166    0.0
1213    0.0
1342    0.0
1426    0.0
1543    0.0
1686    0.0
1731    0.0
Name: per_capita_income_zipcode, dtype: float64


### 2.3 The card records file

Each user can have more than one card. Information about the cards is stored in `sd254_cards.csv`. It has 6,146 records (i.e. just over 3 per user) and 13 columns.

In [23]:
cards = utils.read_cards()
print(f"Shape of data frame: {cards.shape}")
print(cards.head(2).T)

Shape of data frame: (6146, 13)
                                      0                 1
User                                  0                 0
CARD INDEX                            0                 1
Card Brand                         Visa              Visa
Card Type                         Debit             Debit
Card Number            4344676511950444  4956965974959986
Expires                         12/2022           12/2020
CVV                                 623               393
Has Chip                            YES               YES
Cards Issued                          2                 2
Credit Limit                     $24295            $21968
Acct Open Date                  09/2002           04/2014
Year PIN last Changed              2008              2014
Card on Dark Web                     No                No


In [24]:
cards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6146 entries, 0 to 6145
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   User                   6146 non-null   int64 
 1   CARD INDEX             6146 non-null   int64 
 2   Card Brand             6146 non-null   object
 3   Card Type              6146 non-null   object
 4   Card Number            6146 non-null   int64 
 5   Expires                6146 non-null   object
 6   CVV                    6146 non-null   int64 
 7   Has Chip               6146 non-null   object
 8   Cards Issued           6146 non-null   int64 
 9   Credit Limit           6146 non-null   object
 10  Acct Open Date         6146 non-null   object
 11  Year PIN last Changed  6146 non-null   int64 
 12  Card on Dark Web       6146 non-null   object
dtypes: int64(6), object(7)
memory usage: 624.3+ KB


Once again, the column names and dolar values can be reformatted.

In [25]:
cards.columns = utils.update_colnames(cards.columns)
cards.credit_limit = utils.convert_dollar_amounts(cards.credit_limit)
cards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6146 entries, 0 to 6145
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   user                   6146 non-null   int64  
 1   card_index             6146 non-null   int64  
 2   card_brand             6146 non-null   object 
 3   card_type              6146 non-null   object 
 4   card_number            6146 non-null   int64  
 5   expires                6146 non-null   object 
 6   cvv                    6146 non-null   int64  
 7   has_chip               6146 non-null   object 
 8   cards_issued           6146 non-null   int64  
 9   credit_limit           6146 non-null   float64
 10  acct_open_date         6146 non-null   object 
 11  year_pin_last_changed  6146 non-null   int64  
 12  card_on_dark_web       6146 non-null   object 
dtypes: float64(1), int64(6), object(6)
memory usage: 624.3+ KB


The columns for the data the account was opened and the card expires are strings in the format `MM/YYYY`. For easier comparability, they can be converted to `Date` objects corresponding to the first of the identified month.

In [26]:
date_cols = ['expires', 'acct_open_date']
for col in date_cols:
    cards[col] = utils.convert_monthyear_dates(cards[col])
print(cards[date_cols].describe())

                             expires                 acct_open_date
count                           6146                           6146
mean   2020-10-08 06:30:06.443215360  2011-01-15 12:55:31.727953152
min              1997-07-01 00:00:00            1991-01-01 00:00:00
25%              2020-02-01 00:00:00            2006-10-01 00:00:00
50%              2021-09-01 00:00:00            2010-02-15 00:00:00
75%              2023-05-01 00:00:00            2016-05-01 00:00:00
max              2024-12-01 00:00:00            2020-02-01 00:00:00


Finally, the `has_chip` and `card_on_dark_web` columns are logically boolean.

In [27]:
print(cards.has_chip.value_counts())
cards.has_chip = cards.has_chip == 'YES'
print(cards.has_chip.value_counts())

has_chip
YES    5500
NO      646
Name: count, dtype: int64
has_chip
True     5500
False     646
Name: count, dtype: int64


However, the `card_on_dark_web` feature only has the value 'No'. This provides no useful information, so it will be dropped.

In [28]:
print(cards.card_on_dark_web.value_counts())
cards.drop(columns='card_on_dark_web', inplace=True)

card_on_dark_web
No    6146
Name: count, dtype: int64


### 2.4 The transactions file

The file `credit-card-transactions-ibm_v2.csv` includes 20 million records of simulated transactions. Importing training and test samples will be covered below. For now, though, let's confirm that the records have the same structure as the sample above.

In [29]:
small_chunk_reader = utils.make_txdata_reader(chunksize = 100)
sample_df = next(small_chunk_reader)
print(sample_df.head(2).T)
del small_chunk_reader

                                  0                    1
User                              0                    0
Card                              0                    0
Year                           2002                 2002
Month                             9                    9
Day                               1                    1
Time                          06:21                06:42
Amount                      $134.09               $38.48
Use Chip          Swipe Transaction    Swipe Transaction
Merchant Name   3527213246127876953  -727612092139916043
Merchant City              La Verne        Monterey Park
Merchant State                   CA                   CA
Zip                         91750.0              91754.0
MCC                            5300                 5411
Errors?                         NaN                  NaN
Is Fraud?                        No                   No


In [30]:
sample_df = utils.clean_tx_df(sample_df)
sample_df.head(2).T

Unnamed: 0,0,1
user,0,0
card,0,0
amount,134.09,38.48
merchant_city,La Verne,Monterey Park
merchant_state,CA,CA
zip,91750.0,91754.0
mcc,5300,5411
errors,,
is_fraud,False,False
timestamp,2002-09-01 06:21:00,2002-09-01 06:42:00


## 3 Sampling the transactions data for EDA

### 3.1 Sampling strategies

The main transactions data file includes 20 million records. This is too large a scale for the limited computing resources available for this project. Accordingly, it will be necessary to sample from the entire data set for both EDA and model selection and fitting. I will consider two approaches
1. **Sample on users**: randomly select some fraction (say, $\frac{1}{5}$) of the users and conduct the analysis on all of their cards and transactions. In this case, the test set would be another subsample of users from the original data set.
1. **Sample on time**: choose some time period (say, 5 years, not including the last 2 years) and conduct the analysis on all tractactions by all users in that period. Then, the test data would be *subsequent* transactions from the same users. The main advantage of this approach is that "future testing" more closely approximates the key business problem this project seeks to address -- identifying potentially fraudulent new transactions based only on previous data.

For EDA, it makes sense to begin with a sample of users, if only to clarify the temporal characteristics of the data. This sample is built below. I will return in future steps to the possibility of sampling, or at least splitting training and test data, on time.

### 3.2 Making the user-sampled data frame

In [31]:
# set seed for reproducibility
seed = 111
# sample 1/5 of users
N = users.shape[0]
sample_size = N // 5
sampled_users = np.random.default_rng(seed).choice(N, size=sample_size, replace=False)
def filter_for_sample(df):
    filter = df["User"].isin(sampled_users)
    if filter.sum() == 0:
        print(f"Empty subsample!")
    return df[filter]
filter_df = lambda df: df[df["User"].isin(sampled_users)]
# needed because reduce() requires a function of form fn(T, T) -> T
concatenate = lambda df1, df2: pd.concat([df1, df2])
# chunk reader for tx data
tx_df_reader = utils.make_txdata_reader()
# test = clean_tx_df(filter_df(next(tx_df_reader)))
# test.head(2).T
# this should memory efficiently concatenate all of the chunks together
sampled_tx_df = reduce(concatenate, map(utils.clean_tx_df, map(filter_df, tx_df_reader)))

  concatenate = lambda df1, df2: pd.concat([df1, df2])
  concatenate = lambda df1, df2: pd.concat([df1, df2])
  concatenate = lambda df1, df2: pd.concat([df1, df2])


  concatenate = lambda df1, df2: pd.concat([df1, df2])
  concatenate = lambda df1, df2: pd.concat([df1, df2])
  concatenate = lambda df1, df2: pd.concat([df1, df2])


Because the number of transactions per user is not uniformly distributed, the resulting data frame is not exactly 1/5 of original data (at over 5 million, it is in fact above 1/4).

In [32]:
sampled_tx_df.user.unique().shape

(400,)

In [33]:
sampled_tx_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5214807 entries, 28882 to 24270533
Data columns (total 11 columns):
 #   Column          Dtype         
---  ------          -----         
 0   user            int64         
 1   card            int64         
 2   amount          float64       
 3   merchant_city   object        
 4   merchant_state  object        
 5   zip             float64       
 6   mcc             int64         
 7   errors          object        
 8   is_fraud        bool          
 9   timestamp       datetime64[ns]
 10  tx_type         object        
dtypes: bool(1), datetime64[ns](1), float64(2), int64(3), object(4)
memory usage: 442.6+ MB


### 3.3 Aggregating data

For EDA, it will be helpful to combine certain summary information from the cards and transactions data for the users and cards tables. Specifically, we can add:
- To the users table:
    - Total credit limit
    - Fraud rate
- To the cards table:
    - Total transactions count and value
    - TX type shares
    - Fraud rate

In [34]:
# Aggregating by card

tx_group_cards = pd.get_dummies(
    sampled_tx_df,
    columns=['tx_type']
    ).groupby(["user", "card"])
card_grouped_cols = {
    'amount': ['count', 'sum'],
    'is_fraud': 'mean',
    'tx_type_chip': 'mean',
    'tx_type_online': 'mean',
    'tx_type_swipe': 'mean'
}
card_agg_colnames = ["num_transactions", "total_tx_amount", 'fraud_rate', 'chip_rate', 'online_rate', 'swipe_rate']

cards_sample = cards.merge(
    tx_group_cards.agg(card_grouped_cols)
        .set_axis(card_agg_colnames, axis=1), 
    left_on=["user", "card_index"], 
    right_index=True
)
cards_sample.head(2).T


Unnamed: 0,10,11
user,2,2
card_index,0,1
card_brand,Mastercard,Mastercard
card_type,Debit,Debit
card_number,5495199163052054,5804499644308599
expires,2022-03-01 00:00:00,2023-07-01 00:00:00
cvv,677,258
has_chip,True,False
cards_issued,2,2
credit_limit,31599.0,27480.0


In [35]:
# Aggregating by user

tx_group_users = sampled_tx_df.groupby("user")
cards_group_users = cards.groupby("user")
user_agg_data = (
    tx_group_users
    .agg({'is_fraud': 'mean'})
    .merge(
        cards_group_users.agg({'credit_limit': 'sum'}),
        on='user'
    )
    .set_axis(['fraud_rate', 'total_credit_limit'], axis=1)
)
users_sample = users.join(user_agg_data, how='inner')
users_sample.head(2).T

Unnamed: 0,2,15
person,Saanvi Lee,Riya Cruz
current_age,81,41
retirement_age,67,68
birth_year,1938,1978
birth_month,11,4
gender,Female,Female
address,766 Third Drive,40 Washington Drive
apartment,,
city,West Covina,Boise City
state,CA,OK


### 3.4 Putting user and card data in the transactions data frame



## 4 Wrap-up

### 4.1 Saving cleaned data frames

In [36]:
dfdict = {
    "users_all.csv": users,
    "cards_all.csv": cards,
    "users_sample.csv": users_sample,
    "cards_sample.csv": cards_sample,
    "tx_sample.csv": sampled_tx_df
}
utils.save_data(dfdict)

### 4.2 Summary

This notebook produces 5 cleaned data files:
1. User information (all 2,000 records)
1. Card information (all 6,146 records)
1. User information (400 randomly sampled users)
1. Card information (for the user sample)
1. Transaction information (17 features on 5,214,807 records for the user sample)

The transactions data frame includes the target features: the boolean `is_fraud`.

Issues with the format and data types were fixed:
1. Clean column names
1. Amounts converted to floating point
1. Dates converted to more easily usable formats
1. Logically categorical and boolean values converted into appropriate types

In addition, aggregated data was added to users and cards data based on the transactions, including the rate of fraud for each card and each user.

Two issues to be considered in EDA and model pre-processing were identified:
1. Sampling strategies: rely solely on sample of users or also sample (or stratify training/test data) by time?
1. Per capita income values in zip codes in the users data that report values of $0.


