# Description
This notebook shows simple ways to reduce memory footprint of the large *historical_transactions* dataframe. 

Tools used: changing data types, simple encoding of categorical variables, conversion of IDs.

In [None]:
#imports
import numpy as np 
import pandas as pd 

In [None]:
%time historical = pd.read_csv('../input/historical_transactions.csv') # (takes 1-2 minutes)

In [None]:
#Let's see what is stored in the dataframe
historical.info()

In [None]:
#Let's see how much memory it uses:
mem_use = historical.memory_usage(deep=True)
original_mem_use = mem_use.sum()
print ('total memory used: {:,} bytes'.format(mem_use.sum()))
mem_use

Apparently this DataFrame uses quite a lot of memory (close to 14 Gb) because of sheer number of records (>29 million) and less efficient data formats used after csv imports. 

Let's try to fix the latter:

In [None]:
# This function was written after quick data analysis exploring contents of each column and choosing datatypes with smaller memory footprint
# it may be applied both to historical dataframe and new transactions dataframe

def transactions_reduce (df_trans):
    df_trans.authorized_flag = (df_trans.authorized_flag == 'Y') # was Y/N 
    df_trans.city_id = df_trans.city_id.astype('int16') 
    df_trans.category_1 = (df_trans.category_1 == 'Y') # was Y/N 
    # historical.installments.unique() => [0,1,5,3,4,2,-1,10,6,12,8,7,9,11,999]
    df_trans.loc[df_trans.installments == 999,'installments'] = 99  # 999 likely used as code for "many", 99 allows using 'int8'
    df_trans.installments = df_trans.installments.astype('int8')
    
    # historical.category_3.unique() => ['A', 'B', 'C', nan]
    df_trans.category_3.fillna('?', inplace=True)  # this will produce -1 for NaN
    df_trans.category_3 = df_trans.category_3.apply(ord)-64  # replacing A,B,C with 1,2,3; 
    df_trans.category_3 = df_trans.category_3.astype('int8')
    
    df_trans.merchant_category_id = df_trans.merchant_category_id.astype('int16')
    df_trans.month_lag = df_trans.month_lag.astype('int8')
    df_trans.purchase_amount = df_trans.purchase_amount.astype('float32')
    df_trans.purchase_date = pd.to_datetime(df_trans.purchase_date, infer_datetime_format=True)
    
    # historical.category_2.unique() => [  1.,  nan,   3.,   5.,   2.,   4.]
    df_trans.category_2.fillna(0, inplace=True)
    df_trans.category_2 = df_trans.category_2.astype('int8')
    
    df_trans.state_id = df_trans.state_id.astype('int8') # from -1 to 24
    df_trans.subsector_id = df_trans.subsector_id.astype('int8') # from -1 to 41

In [None]:
# applying recuction function  
%time transactions_reduce(historical)

In [None]:
#Let's see how much memory it uses now:
mem_use = historical.memory_usage(deep=True)
print ('total memory used: {:,} bytes'.format(mem_use.sum()))
print ("Effective memory usage reduction to {0:0.2f}% of original size".format(mem_use.sum() / original_mem_use * 100))
mem_use

Now notice that the memory used got down to about a third of the original size.

The big memory hogs are now Merchant_ID and Card_ID, using 72 bytes per record each.

All *card_id* look like "C_ID_0ab67a22ab" where the last 10 characters are a unique 16-bit number. We may convert it to 'int64' which uses 9 times less memory so "**C_ID_0ab67a22ab**" becomes **46011130539**. 

Similar conversion may be done with *merchant_id*.

**Important note:**  To ensure consistency you must:
1.     apply similar conversion to *card_id* in  *train* and *test* datasets  
2.    apply similar conversion to *merchant_id* in  *merchants*  dataset
3.    apply reverse conversion prior to result submission (or just preserve values order in *test* dataframe and copy *card_id* column from *sample* dataframe)


In [None]:
%%time
def id_gen (old_id):
    a= old_id[-10:]
    return int(a,16)

historical.loc[:,'card_id'] = historical.card_id.apply(id_gen)

historical.merchant_id.fillna('-000000001', inplace=True)  # insert -1 in place of NA
historical.loc[:,'merchant_id'] = historical.merchant_id.apply(id_gen)

In [None]:
#Let's see how much memory it uses now:
mem_use = historical.memory_usage(deep=True)
print ("Effective memory usage reduction to {0:0.2f}% of original size".format(mem_use.sum() / original_mem_use * 100))
print ('total memory used: {:,} bytes'.format(mem_use.sum()))
mem_use

the result may be saver to feather or your favorite format to reduce disk footprint and accelerate loading 

> import feather

> feather.write_dataframe(historical, 'historical_transactions.feather') #uses about 1.1 Gb

> pd.read_feather('historical_transactions.feather') # this is used for loading

In [None]:
# cleanup of memory:
import gc
gc.collect()

In [None]:
"Effective memory usage reduction to {0:0.2f}% of original size or {1:0.1f}x !".format(mem_use.sum() / original_mem_use * 100,original_mem_use / mem_use.sum() )

In this excercise we achieved ** 12x reduction ** of the memory size used 

Now with slimmer dataset it may be easier and faster to work, especially on computers with 8G RAM and below.  Most of the code in Kaggle forums/kernels should be applicable without change or with smallest modifications. 

Remember to apply card_id and merchant_id transformations to all involved datasets and **use original card_id formats ** in the submission.

**Good luck in the competition!**