# Optimizing Dataframe Footprint

In this project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors. You can read more about the marketplace on [its website](https://www.lendingclub.com/public/how-peer-lending-works.action).

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the [origination fee](https://help.lendingclub.com/hc/en-us/articles/214501207-What-is-the-origination-fee-) that Lending Club charges.

We'll be working with a dataset of loans approved from `2007-2011`, which you can download from [Lending Club's website](https://www.lendingclub.com/info/download-data.action). In the version of the dataset we'll be working with here, the `desc` column was removed.

If we read in the entire data set, it will consume about 65 megabytes of memory. Let's imagine that we only have 10 megabytes of memory available throughout this project, so we can practice the concept of dataframe optimization and working with chunks.

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 99

In [2]:
#reading the first 5 rows from the dataset
loans_first_5 = pd.read_csv('dataset/loans_2007.csv', nrows=5)
loans_first_5

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


Reading in the first 1000 rows from the data set, and calculating the total memory usage for these rows.

In [3]:
loans_first_thousand = pd.read_csv('dataset/loans_2007.csv', nrows=1000)
loans_first_thousand.memory_usage(deep=True).sum()/(1024 * 1024)

1.5388107299804688

Increasing the number of rows to 3000 for each chunk to converge on a memory usage under **5 megabytes** (to stay on the conservative side).

In [4]:
chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)

print("Each chunk size (in MB):")
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum()/(1024 * 1024))

Each chunk size (in MB):
4.614727020263672
4.6104736328125
4.612231254577637
4.613583564758301
4.609776496887207
4.611659049987793
4.610250473022461
4.612619400024414
4.610745429992676
4.610795974731445
4.623508453369141
4.62237548828125
4.629182815551758
4.862635612487793
0.874720573425293


In [5]:
#Total number of rows in the dataset
chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)
total = 0
for chunk in chunk_iter:
    total += len(chunk)
    
print(total)

42538


In [6]:
#Finding out the total number of numeric columns and object columns in each chunk

chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)
string_col_count=[]
numeric_col_count=[]
for chunk in chunk_iter:
    string_col_count.append(chunk.select_dtypes(include=['object']).shape[1])
    numeric_col_count.append(chunk.select_dtypes(include=[np.number]).shape[1])

print("Total string columns in each chunk:")
print(string_col_count,"\n")

print("Total numeric columns in each chunk:")
print(numeric_col_count)

Total string columns in each chunk:
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22] 

Total numeric columns in each chunk:
[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]


There are **21 object columns** and **31 numeric columns** in the dataset. However, It looks like in the last 2 chunks there is an increase and decrease by 1 in number of string columns and numeric columns respectively. Let us further investigate which these columns are below.

In [7]:
chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)
overall_obj_col = []
i = 1
for chunk in chunk_iter:
    chunk_obj_col = chunk.select_dtypes(include=['object']).columns.tolist()
    
    if len(overall_obj_col) > 0 :
        if not(len(overall_obj_col) == len(chunk_obj_col)):
            print("chunk {}".format(i))
            print("---------", "\n")
            print("Overall string Cols:\n--------------------\n", overall_obj_col, "\n")
            print("string Cols for current chunk:\n------------------------------\n", chunk_obj_col, "\n")
            print("_____________________________________________________________________________________________________________________________")
    else:
        overall_obj_col = chunk_obj_col
    
    i+=1

chunk 14
--------- 

Overall string Cols:
--------------------
 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

string Cols for current chunk:
------------------------------
 ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

_____________________________________________________________________________________________________________________________
chunk 15
--------- 

Overall string Cols:
--------------------
 ['term', 'int_rate', 'grade', 'sub_grade', 'emp_t

From the above investigation, it seems like one column in particular (the id column) is being cast to numeric in the last 2 chunks but not in the earlier chunks. Since the id column won't be useful for analysis, visualization, or predictive modelling let's ignore this column for optimization at the moment.

Below, we find out the number of unique values for each object type column. We also find out which object type columns have less than 50 unique values so that we can convert those columns to category type.

In [8]:
chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)

unique_cols = {}

for chunk in chunk_iter:
    df_obj_cols = chunk.select_dtypes(include=['object'])
    obj_cols = df_obj_cols.columns
    
    for cols in obj_cols:
        val_counts = df_obj_cols[cols].value_counts()
        if cols in unique_cols:
            unique_cols[cols].append(val_counts)
        else:
            unique_cols[cols] = [val_counts]

unique_combined = {}

print("unique values in each string column:")
print("----------------------------------------------")
for u_c in unique_cols:
    combined_u_c = pd.concat(unique_cols[u_c])
    final_u_c = combined_u_c.groupby(combined_u_c.index).sum()
    unique_combined[u_c] = final_u_c
    print(u_c, ": ", final_u_c.shape[0])
        
print("\nstring columns containing less than 50 unique values:")
print("----------------------------------------------------------------")
for u_c in unique_combined:
    no_of_u_vals = unique_combined[u_c].shape[0]
    if no_of_u_vals < 50:
        print(u_c, ": ", no_of_u_vals)

unique values in each string column:
----------------------------------------------
term :  2
int_rate :  394
grade :  7
sub_grade :  35
emp_title :  30658
emp_length :  11
home_ownership :  5
verification_status :  3
issue_d :  55
loan_status :  9
pymnt_plan :  2
purpose :  14
title :  21264
zip_code :  837
addr_state :  50
earliest_cr_line :  530
revol_util :  1119
initial_list_status :  1
last_pymnt_d :  103
last_credit_pull_d :  108
application_type :  1
id :  3538

string columns containing less than 50 unique values:
----------------------------------------------------------------
term :  2
grade :  7
sub_grade :  35
emp_length :  11
home_ownership :  5
verification_status :  3
loan_status :  9
pymnt_plan :  2
purpose :  14
initial_list_status :  1
application_type :  1


Below, in the `cat_types` variable we have stored the column labels that we have decided to convert from `object` type to `category` type due to each of them having less than 50 unique values each.

In [9]:
cat_types = {"term": "category", "grade":"category", "sub_grade": "category", "emp_length":"category",
           "home_ownership": "category", "verification_status":"category", "loan_status": "category", 
            "pymnt_plan":"category", "purpose":"category", "initial_list_status":"category",
           "application_type":"category"}

Now, from rest of the columns of `object` type, we will try and see which of them to convert to numeric type and which to datetime type. Let us inspect those columns below.

In [10]:
keep_cols = ["int_rate", "emp_title", "issue_d", "title", "zip_code", "addr_state", 
             "earliest_cr_line", "revol_util", "last_pymnt_d", "last_credit_pull_d"]

first_13_rows = pd.read_csv('dataset/loans_2007.csv', nrows=13, usecols=keep_cols)
first_13_rows

Unnamed: 0,int_rate,emp_title,issue_d,title,zip_code,addr_state,earliest_cr_line,revol_util,last_pymnt_d,last_credit_pull_d
0,10.65%,,Dec-2011,Computer,860xx,AZ,Jan-1985,83.7%,Jan-2015,Jun-2016
1,15.27%,Ryder,Dec-2011,bike,309xx,GA,Apr-1999,9.4%,Apr-2013,Sep-2013
2,15.96%,,Dec-2011,real estate business,606xx,IL,Nov-2001,98.5%,Jun-2014,Jun-2016
3,13.49%,AIR RESOURCES BOARD,Dec-2011,personel,917xx,CA,Feb-1996,21%,Jan-2015,Apr-2016
4,12.69%,University Medical Group,Dec-2011,Personal,972xx,OR,Jan-1996,53.9%,Jun-2016,Jun-2016
5,7.90%,Veolia Transportaton,Dec-2011,My wedding loan I promise to pay back,852xx,AZ,Nov-2004,28.3%,Jan-2015,Jan-2016
6,15.96%,Southern Star Photography,Dec-2011,Loan,280xx,NC,Jul-2005,85.6%,May-2016,May-2016
7,18.64%,MKC Accounting,Dec-2011,Car Downpayment,900xx,CA,Jan-2007,87.5%,Jan-2015,Dec-2014
8,21.28%,,Dec-2011,Expand Business & Buy Debt Portfolio,958xx,CA,Apr-2004,32.6%,Apr-2012,Aug-2012
9,12.69%,Starbucks,Dec-2011,Building my credit history.,774xx,TX,Sep-2004,36.5%,Nov-2012,Mar-2013


From the above excerpt of the dataframe, it looks like `int_rate`, `revol_util`, and `zip_code` columns can be converted to numeric type after cleaning them by removing `%` and `xx` substrings wherever needed.

However, we need to be sure all the values ends with certain substrings such as `%` for `int_rate`, `revol_util` columns and `xx` for `zip_code` column. For that reason, the `find_to_numeric_suitabiity()` is defined below that takes the column to be searched for the substring that its values ends with. If the function finds any value that does not end with the expected substring, it will print those values and notify that the column is not suitable for numeric conversion at the moment with current cleaning procedure.

In [11]:
def find_to_numeric_suitabiity(col, ends_with):
    chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)
    vc_list = []
    for chunk in chunk_iter:
        vc_list.append(chunk[col].value_counts())

    to_numeric_flag = True

    combined_vc = pd.concat(vc_list)
    final_vc = combined_vc.groupby(combined_vc.index).sum()
    for f_ind in final_vc.index:
        if(not(f_ind.endswith(ends_with))):
            to_numeric_flag = False
            print(f_ind)

    if to_numeric_flag:
        print("'" + col +"' column is a candidate for conversion to numeric value")
    else:
        print("'" + col +"' column is not a candidate for conversion to numeric value")

After calling the function `find_to_numeric_suitabiity()` on `int_rate`, `revol_util`, and `zip_code` columns, it outputs that all of them are suitable candidates for conversion to numeric type.

In [12]:
find_to_numeric_suitabiity('int_rate', "%")
find_to_numeric_suitabiity('revol_util', "%")
find_to_numeric_suitabiity('zip_code', "xx")

'int_rate' column is a candidate for conversion to numeric value
'revol_util' column is a candidate for conversion to numeric value
'zip_code' column is a candidate for conversion to numeric value


Below, in `num_col_candidates` variable we store the column names that are suitable for conversion to numeric type. On the other hand, in `date_col_candidates` stores the name of the columns that are suitable for conversion to datetime type.

In [13]:
num_col_candidates = ["int_rate", "revol_util", 'zip_code']

date_col_candidates = ["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]

Let's briefly, take a look at the size of the data set one last time before any optimization. It shows about **65 MB** at the moment. Later, after various optimization we will see how much it compresses.

In [14]:
#Total size of the dataset
chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)
total = 0
for chunk in chunk_iter:
    total += chunk.memory_usage(deep=True).sum()/(1024 ** 2)
    
print("Total size of the data set: " + str(total) + " MB")

Total size of the data set: 65.72928524017334 MB


After converting necessary columns to category, numeric, and datetime type, we successfully can optimize the dataframe footpring down to around **21.5 MB** from **65 MB** with each chunk having the size around **1.5 MB** (prevously, around **4.5 MB**)

In [15]:
chunk_iter = pd.read_csv('dataset/loans_2007.csv', dtype=cat_types, chunksize=3000, parse_dates=date_col_candidates)
total = 0
print("Each chunk size (in MB):")
for chunk in chunk_iter:
    int_rate_cleaned = chunk['int_rate'].str.rstrip("%")
    revol_util_cleaned = chunk['revol_util'].str.rstrip("%")
    zip_code_cleaned = chunk['zip_code'].str.rstrip("x")
    chunk['int_rate'] = pd.to_numeric(int_rate_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_util_cleaned)
    chunk['zip_code'] = pd.to_numeric(zip_code_cleaned)
    
    total += chunk.memory_usage(deep=True).sum()/(1024 * 1024)
    print(chunk.memory_usage(deep=True).sum()/(1024 * 1024))

print("\n")
print("Total size of the optimized data set: " + str(total) + " MB")

Each chunk size (in MB):
1.4969415664672852
1.493788719177246
1.4957637786865234
1.495987892150879
1.492879867553711
1.4934778213500977
1.4929475784301758
1.4946022033691406
1.4932737350463867
1.494241714477539
1.5075302124023438
1.5058317184448242
1.5135526657104492
1.6624011993408203
0.30411434173583984


Total size of the optimized data set: 21.43733501434326 MB


Let's verify below for each column what types they possess after optimization. It seeems like all the type conversions were successful. However, if we look closely, we can still optimize the data set further by converting the columns of `float64` type to a lower lever numeric type, such as lowest possible `int` type.

But first, we have to see whether any of these numeric columns have **zero** missing values. If so, then that column becomes a candidate for conversion to integer type.

In [16]:
chunk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 538 entries, 42000 to 42537
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   id                          538 non-null    object        
 1   member_id                   536 non-null    float64       
 2   loan_amnt                   536 non-null    float64       
 3   funded_amnt                 536 non-null    float64       
 4   funded_amnt_inv             536 non-null    float64       
 5   term                        536 non-null    category      
 6   int_rate                    536 non-null    float64       
 7   installment                 536 non-null    float64       
 8   grade                       536 non-null    category      
 9   sub_grade                   536 non-null    category      
 10  emp_title                   499 non-null    object        
 11  emp_length                  536 non-null    category

After further inspection, it looks like there is no such numeric column that has **zero** missing values. That means it is not possible for any of these columns to be converted to `int` type. However, if it is still possible, we can convert the `float64` type to a lower level `float` type.

In [17]:
chunk_iter = pd.read_csv('dataset/loans_2007.csv', chunksize=3000)
missing = []
for chunk in chunk_iter:
    int_rate_cleaned = chunk['int_rate'].str.rstrip("%")
    revol_util_cleaned = chunk['revol_util'].str.rstrip("%")
    zip_code_cleaned = chunk['zip_code'].str.rstrip("x")
    chunk['int_rate'] = pd.to_numeric(int_rate_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_util_cleaned)
    chunk['zip_code'] = pd.to_numeric(zip_code_cleaned)
    
    float_cols = chunk.select_dtypes(include=['float'])
    missing.append(float_cols.isnull().sum())

combined_missing = pd.concat(missing)
combined_missing.groupby(combined_missing.index).sum().sort_values()

zip_code                         3
policy_code                      3
out_prncp_inv                    3
out_prncp                        3
total_rec_prncp                  3
member_id                        3
loan_amnt                        3
last_pymnt_amnt                  3
int_rate                         3
installment                      3
recoveries                       3
funded_amnt_inv                  3
funded_amnt                      3
dti                              3
total_pymnt                      3
total_pymnt_inv                  3
total_rec_int                    3
collection_recovery_fee          3
total_rec_late_fee               3
revol_bal                        3
annual_inc                       7
total_acc                       32
acc_now_delinq                  32
pub_rec                         32
inq_last_6mths                  32
delinq_amnt                     32
delinq_2yrs                     32
open_acc                        32
revol_util          

Here, we additionally convert all the `float64` types to a lower level `float` types and the total size of the data set is optimized further to around **16 MB** from **21.5 MB**.

In [18]:
float_cols = ['zip_code', 'policy_code', 'out_prncp_inv', 'out_prncp', 'total_rec_prncp', 'member_id', 'loan_amnt',
             'last_pymnt_amnt', 'int_rate', 'installment', 'recoveries', 'funded_amnt_inv', 'funded_amnt', 'dti',
             'total_pymnt', 'total_pymnt_inv', 'total_rec_int', 'collection_recovery_fee', 'total_rec_late_fee',
             'revol_bal', 'annual_inc', 'total_acc', 'acc_now_delinq', 'pub_rec', 'inq_last_6mths', 'delinq_amnt',
             'delinq_2yrs', 'open_acc', 'revol_util', 'tax_liens', 'collections_12_mths_ex_med', 'chargeoff_within_12_mths',
             'pub_rec_bankruptcies']

chunk_iter = pd.read_csv('dataset/loans_2007.csv', dtype=cat_types, chunksize=3000, parse_dates=date_col_candidates)
total = 0
for chunk in chunk_iter:
    int_rate_cleaned = chunk['int_rate'].str.rstrip("%")
    revol_util_cleaned = chunk['revol_util'].str.rstrip("%")
    zip_code_cleaned = chunk['zip_code'].str.rstrip("x")
    chunk['int_rate'] = pd.to_numeric(int_rate_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_util_cleaned)
    chunk['zip_code'] = pd.to_numeric(zip_code_cleaned)
    
    for f_col in float_cols:
        chunk[f_col] = pd.to_numeric(chunk[f_col], downcast='float')
        
    total += chunk.memory_usage(deep=True).sum()/(1024 * 1024)
    print(chunk.memory_usage(deep=True).sum()/(1024 * 1024))
    
print(total)

1.1192865371704102
1.116133689880371
1.1181087493896484
1.118332862854004
1.115224838256836
1.1158227920532227
1.1152925491333008
1.1169471740722656
1.1156187057495117
1.116586685180664
1.1298751831054688
1.1281766891479492
1.1358976364135742
1.2847461700439453
0.2363882064819336
16.082438468933105


Upon further inspection, we see that all the columns of `float64` type is converted to `float32` type.

In [19]:
chunk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 538 entries, 42000 to 42537
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   id                          538 non-null    object        
 1   member_id                   536 non-null    float32       
 2   loan_amnt                   536 non-null    float32       
 3   funded_amnt                 536 non-null    float32       
 4   funded_amnt_inv             536 non-null    float32       
 5   term                        536 non-null    category      
 6   int_rate                    536 non-null    float32       
 7   installment                 536 non-null    float32       
 8   grade                       536 non-null    category      
 9   sub_grade                   536 non-null    category      
 10  emp_title                   499 non-null    object        
 11  emp_length                  536 non-null    category