# Primary data analysis: transactions

Transaction history consists of 14 columns, which are described by the table below. In this notebook, an exploratory data analysis is performed on all columns.

| Column name          | Description                                      |
|----------------------|--------------------------------------------------|
| card_id              | Card identifier                                  |
| month_lag            | month lag to reference date                      |
| purchase_date        | Purchase date                                    |
| authorized_flag      | Y' if approved, 'N' if denied                    |
| category_3           | anonymised category                              |
| installments         | number of installments of purchase               |
| category_1           | anonymised category                              |
| merchant_category_id | Merchant category identifier (anonymised)        |
| subsector_id         | Merchant category group identifier (anonymised)  |
| merchant_id          | Merchant identifier (anonymised)                 |
| purchase_amount      | Normalized purchase amount                       |
| city_id              | City identifier (anonymised)                     |
| state_id             | State identifier (anonymised)                    |
| category_2           | anonymised category                              |

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
from pprint import pprint

# Append PATH, such that methods defined in the notebooks folder can be loaded.
sys.path.append('./notebooks')

# Import Dask, which allows for parallel data processing in a Pandas-way.
import dask.dataframe as dd

from column_explore import dataframe_explore

In [3]:
transactions = dd.read_csv('../data/raw/historical_transactions.csv')

In [4]:
transactions.head()

Unnamed: 0,authorized_flag,card_id,city_id,category_1,installments,category_3,merchant_category_id,merchant_id,month_lag,purchase_amount,purchase_date,category_2,state_id,subsector_id
0,Y,C_ID_4e6213e9bc,88,N,0,A,80,M_ID_e020e9b302,-8,-0.703331,2017-06-25 15:33:07,1.0,16,37
1,Y,C_ID_4e6213e9bc,88,N,0,A,367,M_ID_86ec983688,-7,-0.733128,2017-07-15 12:10:45,1.0,16,16
2,Y,C_ID_4e6213e9bc,88,N,0,A,80,M_ID_979ed661fc,-6,-0.720386,2017-08-09 22:04:29,1.0,16,37
3,Y,C_ID_4e6213e9bc,88,N,0,A,560,M_ID_e6d5ae8ea6,-5,-0.735352,2017-09-02 10:06:26,1.0,16,34
4,Y,C_ID_4e6213e9bc,88,N,0,A,80,M_ID_e020e9b302,-11,-0.722865,2017-03-10 01:14:19,1.0,16,37


## Primary column exploration

In [None]:
fundamental_details = dataframe_explore(transactions)

In [6]:
pprint(fundamental_details)

{'authorized_flag': {'top_values': array(['Y', 'N'], dtype=object)},
 'card_id': {'uniques': 325540},
 'category_1': {'uniques': array(['N', 'Y'], dtype=object)},
 'category_2': {'description': count    2.645950e+07
mean     2.194578e+00
std      1.531896e+00
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      4.000000e+00
max      5.000000e+00
dtype: float64,
                'uniques': array([ 1., nan,  3.,  5.,  2.,  4.])},
 'category_3': {'uniques': array(['A', 'B', 'C', nan], dtype=object)},
 'city_id': {'description': count    2.911236e+07
mean     1.293256e+02
std      1.042563e+02
min     -1.000000e+00
25%      6.900000e+01
50%      1.090000e+02
75%      2.130000e+02
max      3.470000e+02
dtype: float64,
             'uniques': 308},
 'dataframe': {'count': authorized_flag         29112361
card_id                 29112361
city_id                 29112361
category_1              29112361
installments            29112361
category_3              28934202
merc