# Dataframe Processing in Chunks
The objective of this project is to practice processing and optimizing data memory footprint *while* limiting memory usage.
First let's find out how much memory reading the data will take without any optimization.

In [26]:
import pandas as pd
full_data = pd.read_csv('loans_2007.csv')
full_data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
id                            42538 non-null object
member_id                     42535 non-null float64
loan_amnt                     42535 non-null float64
funded_amnt                   42535 non-null float64
funded_amnt_inv               42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
grade                         42535 non-null object
sub_grade                     42535 non-null object
emp_title                     39909 non-null object
emp_length                    41423 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
issue_d                       42535 non-null object
loan_status                   42535 non-null object
p

The data will consume 67MB if read all at once. We will set a goal of using around 5MB maximum, which will require processing the data in chunks.

In [1]:
small_data = pd.read_csv('loans_2007.csv', nrows=5)
small_data

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## Chunk Exploring
### Optimizing chunk size

In [9]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=1000)
chunk_memory = []
for chunk in chunk_iter:
    chunk_memory.append(chunk.memory_usage(deep=True).sum()/1024**2)
sum(chunk_memory)/len(chunk_memory)

1.526184924813204

We can afford to increase the chunksize a bit since we are below 5MB

In [11]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
chunk_memory = []
for chunk in chunk_iter:
    chunk_memory.append(chunk.memory_usage(deep=True).sum()/1024**2)
sum(chunk_memory)/len(chunk_memory)

4.381906572977702

3000 Seems like a good number
###  Data types accross chunks
Let's identify the column types as we batch process, in order to find possible optimizations.

In [30]:
import numpy as np
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
numeric_cols = []
object_cols = []
for chunk in chunk_iter:
    numeric_cols.append(chunk.select_dtypes(include=np.number).shape[1])
    object_cols.append(chunk.select_dtypes(include='object').shape[1])
print(numeric_cols)
print(object_cols)

[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


Apparently, a column switches type accross chunks! Let's find out which

In [34]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
old_num_cols = False
for chunk in chunk_iter:
    new_num_cols = chunk.select_dtypes(include=np.number).columns.to_list()
    if old_num_cols:
        same = new_num_cols == old_num_cols
        if not same:
            print(set(old_num_cols) - set(new_num_cols))
    old_num_cols = new_num_cols


something changed
{'id'}


The id column is not relevant for analysis, so let's just ignore it

### Unique values and ratio in object columns
We want to find the unique values in object columns and identify potential candidates for categorical dtype, which will save a lot of memory.

In [5]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

col_values= {}

for chunk in chunk_iter:
    objects_only = chunk.select_dtypes(include='object')
    for col in objects_only.columns:
        if col in col_values:
            col_values[col].append(objects_only[col].value_counts())
        else:
            col_values[col] = [objects_only[col].value_counts()]
            
for col in col_values:
    conc = pd.concat(col_values[col])
    uniques = conc.groupby(conc.index).sum()
    percentage = len(uniques) / conc.sum() *100
    if percentage < 50:
        print(col, len(uniques), conc.sum(), percentage,'Less than 50% Unique')
    else:
        print(col, len(uniques), conc.sum(), percentage, 'More than 50% Uniques')
    

term 2 42535 0.004702010109321735 Less than 50% Unique
int_rate 394 42535 0.9262959915363819 Less than 50% Unique
grade 7 42535 0.01645703538262607 Less than 50% Unique
sub_grade 35 42535 0.08228517691313036 Less than 50% Unique
emp_title 30658 39909 76.81976496529605 More than 50% Uniques
emp_length 11 41423 0.026555295367308017 Less than 50% Unique
home_ownership 5 42535 0.011755025273304338 Less than 50% Unique
verification_status 3 42535 0.007053015163982603 Less than 50% Unique
issue_d 55 42535 0.12930527800634772 Less than 50% Unique
loan_status 9 42535 0.02115904549194781 Less than 50% Unique
pymnt_plan 2 42535 0.004702010109321735 Less than 50% Unique
purpose 14 42535 0.03291407076525214 Less than 50% Unique
title 21264 42522 50.00705517144066 More than 50% Uniques
zip_code 837 42535 1.967791230751146 Less than 50% Unique
addr_state 50 42535 0.11755025273304338 Less than 50% Unique
earliest_cr_line 530 42506 1.2468827930174564 Less than 50% Unique
revol_util 1119 42445 2.636352

A lot of columns have few unique values compared to total, but that doesnt mean we want to make a category out of every one of them, just in anticipation that future data will increase the number of uniques values for these columns.

### Float columns with no missing values
Let's find if any float columns could be turned into smaller int types.

In [51]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

nulls = []
for chunk in chunk_iter:
    floats_only = chunk.select_dtypes(include='float')
    nulls.append(floats_only.isna())

pd.concat(nulls).sum()

member_id                        3
loan_amnt                        3
funded_amnt                      3
funded_amnt_inv                  3
installment                      3
annual_inc                       7
dti                              3
delinq_2yrs                     32
inq_last_6mths                  32
open_acc                        32
pub_rec                         32
revol_bal                        3
total_acc                       32
out_prncp                        3
out_prncp_inv                    3
total_pymnt                      3
total_pymnt_inv                  3
total_rec_prncp                  3
total_rec_int                    3
total_rec_late_fee               3
recoveries                       3
collection_recovery_fee          3
last_pymnt_amnt                  3
collections_12_mths_ex_med     148
policy_code                      3
acc_now_delinq                  32
chargeoff_within_12_mths       148
delinq_amnt                     32
pub_rec_bankruptcies

No float column has zero missing values unfortunately.
### Total initial memory usage

In [54]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

memory = 0
for chunk in chunk_iter:
    memory+= chunk.memory_usage(deep=True).sum() /1024**2
memory

65.72859859466553

The goal will be to reduce this by applying various optimizations
## Optimizing Data Types
### Optimizing object columns
Let's take a look at our data again, this time just the object columns

In [14]:
small_data_objects_only = small_data.select_dtypes(include='object')
pd.set_option('display.max_columns', 999)
small_data_objects_only

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,last_credit_pull_d,application_type
0,36 months,10.65%,B,B2,,10+ years,RENT,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,Jan-1985,83.7%,f,Jan-2015,Jun-2016,INDIVIDUAL
1,60 months,15.27%,C,C4,Ryder,< 1 year,RENT,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,Apr-1999,9.4%,f,Apr-2013,Sep-2013,INDIVIDUAL
2,36 months,15.96%,C,C5,,10+ years,RENT,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,Nov-2001,98.5%,f,Jun-2014,Jun-2016,INDIVIDUAL
3,36 months,13.49%,C,C1,AIR RESOURCES BOARD,10+ years,RENT,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,Feb-1996,21%,f,Jan-2015,Apr-2016,INDIVIDUAL
4,60 months,12.69%,B,B5,University Medical Group,1 year,RENT,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,Jan-1996,53.9%,f,Jun-2016,Jun-2016,INDIVIDUAL


From looking at it carefully and the exploration conducted earlier, we can:
* Optimize some columns by cleaning the data:
    * term removing months and changing to best float type (unfortunately, it has missing values, so cannot be int)
    * int rate and revol_util by removing the percent symbol and changing to the best possible float type
* Change many columns to category, depending on number of uniques and the logical coherence of doing so. Let's set a threshold at 50 unique values, that will include the states, but not be completely illogical, not everything is a category.
* Read certain columns as datetypes to further total memory: issue_d, earliest_cr_line, last_pymnt_d, last_credit_pull_d

### Creating list of colums that can be turned into category and dtype dictionnary

In [15]:
to_category = []
for col in col_values:
    conc = pd.concat(col_values[col])
    uniques = conc.groupby(conc.index).sum()
    percentage = len(uniques) / conc.sum() *100
    if len(uniques) <= 50:
        print(col, len(uniques), conc.sum(), round(percentage,2))
        to_category.append(col)

col_dtypes = {}
for col in to_category:
    col_dtypes[col] = 'category'
    
print('\n', col_dtypes)

term 2 42535 0.0
grade 7 42535 0.02
sub_grade 35 42535 0.08
emp_length 11 41423 0.03
home_ownership 5 42535 0.01
verification_status 3 42535 0.01
loan_status 9 42535 0.02
pymnt_plan 2 42535 0.0
purpose 14 42535 0.03
addr_state 50 42535 0.12
initial_list_status 1 42535 0.0
application_type 1 42535 0.0

 {'term': 'category', 'grade': 'category', 'sub_grade': 'category', 'emp_length': 'category', 'home_ownership': 'category', 'verification_status': 'category', 'loan_status': 'category', 'pymnt_plan': 'category', 'purpose': 'category', 'addr_state': 'category', 'initial_list_status': 'category', 'application_type': 'category'}


### Memory footprint with Chunk Processing with optimizations on object columns

In [21]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype = col_dtypes, parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])
memory = 0
for chunk in chunk_iter:
    chunk['term'] = chunk['term'].str.replace(' months', '').astype('float')
    chunk['term'] = pd.to_numeric(chunk['term'], downcast='float')
    chunk['int_rate'] = chunk['int_rate'].str.replace('%', '').astype('float')
    chunk['int_rate'] = pd.to_numeric(chunk['int_rate'], downcast='float')
    chunk['revol_util'] = chunk['revol_util'].str.replace('%', '').astype('float')
    chunk['revol_util'] = pd.to_numeric(chunk['revol_util'], downcast='float')
    memory+= chunk.memory_usage(deep=True).sum() /1024**2
    
print(memory)
print('\n', chunk.dtypes)

21.125776290893555

 id                                    object
member_id                            float64
loan_amnt                            float64
funded_amnt                          float64
funded_amnt_inv                      float64
term                                 float32
int_rate                             float32
installment                          float64
grade                               category
sub_grade                           category
emp_title                             object
emp_length                          category
home_ownership                      category
annual_inc                           float64
verification_status                 category
issue_d                       datetime64[ns]
loan_status                         category
pymnt_plan                          category
purpose                             category
title                                 object
zip_code                              object
addr_state                        

We reduced our memory footprint by more than half!
### Optimizing float columns
We know from previous exploration that no float columns contain no missing values, therefore they all have to be kept as float. Let's see if optimizing the float type of each float column will further improve our memory footprint

In [22]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype = col_dtypes, parse_dates=["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])
memory = 0
for chunk in chunk_iter:
    chunk['term'] = chunk['term'].str.replace(' months', '').astype('float')
    chunk['term'] = pd.to_numeric(chunk['term'], downcast='float')
    chunk['int_rate'] = chunk['int_rate'].str.replace('%', '').astype('float')
    chunk['int_rate'] = pd.to_numeric(chunk['int_rate'], downcast='float')
    chunk['revol_util'] = chunk['revol_util'].str.replace('%', '').astype('float')
    chunk['revol_util'] = pd.to_numeric(chunk['revol_util'], downcast='float')
    
    float_cols = chunk.select_dtypes(include=['float'])
    for col in float_cols.columns:
        chunk[col] = pd.to_numeric(chunk[col], downcast='float')
    memory+= chunk.memory_usage(deep=True).sum() /1024**2
    
print(memory)
print('\n', chunk.dtypes)

16.257688522338867

 id                                    object
member_id                            float32
loan_amnt                            float32
funded_amnt                          float32
funded_amnt_inv                      float32
term                                 float32
int_rate                             float32
installment                          float32
grade                               category
sub_grade                           category
emp_title                             object
emp_length                          category
home_ownership                      category
annual_inc                           float32
verification_status                 category
issue_d                       datetime64[ns]
loan_status                         category
pymnt_plan                          category
purpose                             category
title                                 object
zip_code                              object
addr_state                        

## Conclusion 
While staying under our memory limit of 5mb, we were able to process a file that initally took 67MB down to 16.25MB