# Optimizing DataFrames and Processing in Chunks Project

In this project, we practice optimizing pandas DataFrame memory usage and working with DataFrames in chunks. We use data from [Lending Club](https://www.lendingclub.com/) on approved loan applications. The data uses about 67 megabytes of memory, and for the purposes of this project we assume we only have 10 megabytes of memory available. 

## Viewing the First Five Rows

In [26]:
import pandas as pd
pd.options.display.max_columns = 99

In [27]:
first_five = pd.read_csv('loans_2007.csv', nrows=5)

In [28]:
first_five

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## Selecting an Appropriate Chunk Size

In [29]:
thousand_rows = pd.read_csv('loans_2007.csv', nrows=1000)

In [30]:
thousand_rows.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          1000 non-null   int64  
 1   member_id                   1000 non-null   float64
 2   loan_amnt                   1000 non-null   float64
 3   funded_amnt                 1000 non-null   float64
 4   funded_amnt_inv             1000 non-null   float64
 5   term                        1000 non-null   object 
 6   int_rate                    1000 non-null   object 
 7   installment                 1000 non-null   float64
 8   grade                       1000 non-null   object 
 9   sub_grade                   1000 non-null   object 
 10  emp_title                   949 non-null    object 
 11  emp_length                  983 non-null    object 
 12  home_ownership              1000 non-null   object 
 13  annual_inc                  1000 n

Above, we see that 1000 rows uses about 1.5 MB of memory. We have 10 MB of memory to work with, so we can safely increase our chunk size to around 5 MB. This would probably be about 3,000 rows. 

In [39]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum()/ 1048576)

4.580394744873047
4.576141357421875
4.577898979187012
4.579251289367676
4.575444221496582
4.577326774597168
4.575918197631836
4.578287124633789
4.576413154602051
4.57646369934082
4.589176177978516
4.588043212890625
4.594850540161133
4.828314781188965
0.868586540222168


Here, we can see that each chunk comes in below 5 MB, which is well within our memory restrictions. 

In [52]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
mem_usage = 0 
for chunk in chunk_iter:
    mem_usage += chunk.memory_usage(deep=True).sum()/ 1048576
print(mem_usage) #Total memory usage of dataset

65.24251079559326


## Exploring Column Datatypes

In [55]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
num_cols = []
str_cols = []
for chunk in chunk_iter:
    num_cols.append(chunk.select_dtypes(include=['float', 'integer']).columns.size)
    str_cols.append(chunk.select_dtypes(include=['object']).columns.size)

### Column Datatypes by Chunk

In [54]:
num_cols

[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]

In [56]:
str_cols

[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]

Above, we see that one column switches from a numeric type to a string type in the last two chunks. To investigate this, we will compare the column datatypes for the first set of rows with the last chunk. 

In [59]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
num_rows_total = 0
for chunk in chunk_iter:
    num_rows_total += chunk.shape[0]
num_rows_total

42538

In [64]:
last_rows = pd.read_csv('loans_2007.csv', skiprows=range(1,40000))

In [69]:
last_rows.dtypes[last_rows.dtypes != thousand_rows.dtypes]

id    object
dtype: object

In [70]:
thousand_rows.dtypes[last_rows.dtypes != thousand_rows.dtypes]

id    int64
dtype: object

Here, we see that the id column changes from an integer to an object datatype in the last couple chunks. However, the other columns appear to maintain the same datatypes across chunks.  

### Percent of Each String Column that is Unique

In [79]:
str_cols = thousand_rows.select_dtypes(include=['object']).columns
str_cols

Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state',
       'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d',
       'last_credit_pull_d', 'application_type'],
      dtype='object')

In [124]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
combined_dics = dict.fromkeys(str_cols)
for key in combined_dics:
    combined_dics[key] = []

for chunk in chunk_iter:
    for col in str_cols:
        values = chunk[col].unique()
        for val in values:
            if val not in combined_dics[col]:
                combined_dics[col].append(val)

7.28 s ± 670 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [125]:
uniq_vals_col = {}
for col in combined_dics:
    uniq_vals_col[col] = len(combined_dics[col])
pd.Series(uniq_vals_col) / num_rows_total

term                   0.000071
int_rate               0.009286
grade                  0.000188
sub_grade              0.000846
emp_title              0.720744
emp_length             0.000282
home_ownership         0.000141
verification_status    0.000094
issue_d                0.001316
loan_status            0.000235
pymnt_plan             0.000071
purpose                0.000353
title                  0.499906
zip_code               0.019700
addr_state             0.001199
earliest_cr_line       0.012483
revol_util             0.026329
initial_list_status    0.000047
last_pymnt_d           0.002445
last_credit_pull_d     0.002562
application_type       0.000047
dtype: float64

Here, we see that every object column except emp_title and title contains values that are much less than 50% unique. Consequently, memory usage could be reducted by converting most columns to categorical. 

### Missing Values for Each Float Column

In [132]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

null_floats = []
for chunk in chunk_iter:
    float_cols = chunk.select_dtypes(include='float')
    null_floats.append(float_cols.isnull().sum())

In [133]:
null_floats = pd.concat(null_floats)

In [136]:
null_floats.groupby(null_floats.index).sum().sort_values()

member_id                        3
total_rec_int                    3
total_pymnt_inv                  3
total_pymnt                      3
revol_bal                        3
recoveries                       3
policy_code                      3
out_prncp_inv                    3
out_prncp                        3
total_rec_late_fee               3
loan_amnt                        3
last_pymnt_amnt                  3
total_rec_prncp                  3
funded_amnt_inv                  3
funded_amnt                      3
dti                              3
collection_recovery_fee          3
installment                      3
annual_inc                       7
inq_last_6mths                  32
total_acc                       32
delinq_2yrs                     32
pub_rec                         32
delinq_amnt                     32
open_acc                        32
acc_now_delinq                  32
tax_liens                      108
collections_12_mths_ex_med     148
chargeoff_within_12_

## Changing Column Datatypes to Reduce Memory Usage

In [137]:
thousand_rows.select_dtypes(include=['object'])

Unnamed: 0,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,last_credit_pull_d,application_type
0,36 months,10.65%,B,B2,,10+ years,RENT,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,Jan-1985,83.7%,f,Jan-2015,Jun-2016,INDIVIDUAL
1,60 months,15.27%,C,C4,Ryder,< 1 year,RENT,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,Apr-1999,9.4%,f,Apr-2013,Sep-2013,INDIVIDUAL
2,36 months,15.96%,C,C5,,10+ years,RENT,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,Nov-2001,98.5%,f,Jun-2014,Jun-2016,INDIVIDUAL
3,36 months,13.49%,C,C1,AIR RESOURCES BOARD,10+ years,RENT,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,Feb-1996,21%,f,Jan-2015,Apr-2016,INDIVIDUAL
4,60 months,12.69%,B,B5,University Medical Group,1 year,RENT,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,Jan-1996,53.9%,f,Jun-2016,Jun-2016,INDIVIDUAL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,36 months,15.27%,C,C4,Lime Energy,3 years,OWN,Not Verified,Dec-2011,Fully Paid,n,home_improvement,top up,088xx,NJ,Jul-2000,76.8%,f,Jan-2015,Jul-2015,INDIVIDUAL
996,36 months,9.91%,B,B1,Real Mex Foods,2 years,RENT,Not Verified,Dec-2011,Fully Paid,n,credit_card,Credit Card Loan,916xx,CA,Jul-2004,83%,f,Sep-2012,Apr-2014,INDIVIDUAL
997,36 months,9.91%,B,B1,JP Morgan Chase,10+ years,OWN,Verified,Dec-2011,Fully Paid,n,credit_card,Relief Refi,322xx,FL,Mar-1995,79.4%,f,Dec-2012,Jun-2016,INDIVIDUAL
998,60 months,20.30%,E,E5,Regional Transportation District,10+ years,MORTGAGE,Verified,Dec-2011,Fully Paid,n,debt_consolidation,CC Consolidation,802xx,CO,Jan-2001,66%,f,Jan-2016,Feb-2016,INDIVIDUAL


Here, we see that the int_rate, zip code, and revol_util can be easily changed to floats.

In [None]:
str_cols = chunk.select_dtypes(include=['object']).columns
str_cols = str_cols.drop(['emp_title', 'title'])

In [182]:
str_cols

Index(['id', 'term', 'grade', 'sub_grade', 'emp_length', 'home_ownership',
       'verification_status', 'issue_d', 'loan_status', 'pymnt_plan',
       'purpose', 'addr_state', 'earliest_cr_line', 'initial_list_status',
       'last_pymnt_d', 'last_credit_pull_d', 'application_type'],
      dtype='object')

In [213]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
mem_usage_all = 0 
for chunk in chunk_iter:
    chunk['int_rate'] = pd.to_numeric(chunk['int_rate'].str.replace('%', ''))
    chunk['revol_util'] = pd.to_numeric(chunk['revol_util'].str.replace('%', ''))
    chunk['zip_code'] = chunk['zip_code'].str.replace('xx', '').astype('float')
    chunk.rename(columns={'int_rate': 'int_rate_per', 'revol_util': 'revol_util_per', 'zip_code': 'zip_code_###xx'},inplace=True)
    for col in str_cols:
        chunk[col] = chunk[col].astype('category')
    for col in chunk.select_dtypes(include='float').columns:
        chunk[col] = pd.to_numeric(chunk[col], downcast='float')
    mem_usage_all += (chunk.memory_usage(deep=True).sum()/ 1048576)


In [214]:
mem_usage_all

14.341007232666016

## Conclusion

After changing the column datatypes, we see that the memory usage the dataset is 14.34 MB, which is much less the 65 MB it was initially. Furthermore, depending on the goals of our project, we could likely drop some columns entirely, which would reduce memory footprint furhter. However, there could be issues with maintaining the categorical datatypes when combining the chunks for analysis. 