# Optimizing Dataframes and Processing in Chunks

In this project, I will be working with chunked dataframes and optimizing a dataframe's memory usage. I will be working with financial lending data from [Lending Club](https://www.lendingclub.com/), which is a marketplace for personal loans that matches borrowers and investors. The website lists approved loans, so investors can view information about the loan. This information includes borrower's credit score, the purpose of the loan, and other details in the loan applications. 

The specific dataset I will be using includes loans approved from 2007 through 2011, which can be downloaded [here](https://www.lendingclub.com/auth/login?login_url=%2Fstatistics%2Fadditional-statistics%3F). 

Reading in the entire dataset would consume about 67 megabytes of memory, and the assumption is that I would have only 10 megabytes of memory available throughout the project. 

## Exploring the data

First, I will read in the first five lines from `loans_2007.csv` to look for any data quality issues.

In [1]:
import pandas as pd
pd.options.display.max_columns = 99

loans_5 = pd.read_csv('loans_2007.csv', nrows=5)
loans_5

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## Calculating memory usage

Next, I will look at the first 1000 rows to calculate the total memory usage for these rows. The result will print in bytes, so I will convert this to megabytes.

In [2]:
loans_1000 = pd.read_csv('loans_2007.csv', nrows=1000)
mem_bytes = loans_1000.memory_usage(deep=True).sum()
mem_megabytes = mem_bytes/(2**20)
print("Memory usage of 1000 rows (in megabytes):\n")
print(mem_megabytes)

Memory usage of 1000 rows (in megabytes):

1.5502090454101562


Now that I have determined the memory usage for 1000 rows to be about 1.55 megabytes, I will try different number of rows until I converge on a memory usage under five megabytes (to stay on the conservative side). I will try 3 times the original number of rows I tried, 3000 rows. 

In [3]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
print("Megabytes of total memory usage in each chunk:\n")
list_mem = []
for chunk in chunk_iter:
    mem_bytes = chunk.memory_usage(deep=True).sum()
    mem_megabytes = mem_bytes/(2**20)
    print(mem_megabytes)
    list_mem.append(mem_megabytes)

print("\nTotal memory usage across all chunks:")
print(sum(list_mem))

Megabytes of total memory usage in each chunk:

4.649013519287109
4.6447601318359375
4.646517753601074
4.647870063781738
4.6440629959106445
4.6459455490112305
4.644536972045898
4.646905899047852
4.645031929016113
4.645082473754883
4.657794952392578
4.6566619873046875
4.663469314575195
4.896910667419434
0.8808088302612305

Total memory usage across all chunks:
66.2153730392456


In [4]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
num_rows = 0
for chunk in chunk_iter:
    num_rows = num_rows + len(chunk)
print("\nTotal number of rows:")
print(num_rows)


Total number of rows:
42538


Each one of these memory usage values is below 5 megabytes, so it will work to use a chunk size of 3000 rows for batch processing our dataset.

## Distinguishing column types

I will now determine the number of numeric-type columns and the number of string-type columns in each chunk. 

In [5]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
num_numeric_cols = []
for chunk in chunk_iter:
    num = len(chunk.select_dtypes(include=['number']).columns)
    num_numeric_cols.append(num)
print("Number of numeric columns in each chunk:")
print(num_numeric_cols)

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
num_string_cols = []
for chunk in chunk_iter:
    num = len(chunk.select_dtypes(include=['object']).columns)
    num_string_cols.append(num)
print("\nNumber of string columns in each chunk:")
print(num_string_cols)

Number of numeric columns in each chunk:
[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]

Number of string columns in each chunk:
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


We can see that the final to chunks have different numbers of numeric columns and string columns than the rest of the chunks. I will determine why this is the case.

In [6]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
string_cols = []
chunk_num = 0
for chunk in chunk_iter:
    chunk_num = chunk_num + 1
    chunk_string_cols = chunk.select_dtypes(include=['object']).columns
    if len(string_cols) > 0:
        if len(string_cols) != len(chunk_string_cols):
            print('Chunk #' + str(chunk_num))
            print('String columns:', chunk_string_cols, '\n')
    else:
        string_cols = chunk_string_cols
        print('Overall string columns:', string_cols, '\n')

Overall string columns: Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state',
       'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d',
       'last_credit_pull_d', 'application_type'],
      dtype='object') 

Chunk #14
String columns: Index(['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code',
       'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status',
       'last_pymnt_d', 'last_credit_pull_d', 'application_type'],
      dtype='object') 

Chunk #15
String columns: Index(['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan',

We can see that in the last two chunks, the first column, `id`, is considered a string, whereas in the first chunk (which is the same as the next 12 chunks), the `id` column is not a string.

## Analyzing string columns

Next, I will find out how many unique values there are in each string column. To do this, I will have to loop through all string columns, create a dictionary with those column names as keys, and add the number of unique values as the keys in that dictionary. I will also be sure to not count those values which are null, or `NaN`. To view the dictionary clearly, I will import the `json` module and use the `json.dumps()` function. 

In [7]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
uniques = {}
nulls = {}
for chunk in chunk_iter:
    string_cols = chunk.select_dtypes(include=['object'])
    cols = string_cols.columns
    for c in cols:
        uniques[c] = []

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in chunk_iter:
    string_cols = chunk.select_dtypes(include=['object'])
    cols = string_cols.columns
    for c in cols:
        for row in string_cols[c]:
            if pd.isnull(row):
                null = 'yes'
            else:
                if row not in uniques[c]:
                    uniques[c].append(row)

In [8]:
import json
lengths = {}
for column in uniques:
    lengths[column] = len(uniques[column])
print("Number of unique values in each string column:\n")
print(json.dumps(lengths, indent=2))

Number of unique values in each string column:

{
  "earliest_cr_line": 530,
  "last_pymnt_d": 103,
  "addr_state": 50,
  "zip_code": 837,
  "term": 2,
  "title": 21264,
  "pymnt_plan": 2,
  "sub_grade": 35,
  "verification_status": 3,
  "emp_title": 30658,
  "home_ownership": 5,
  "int_rate": 394,
  "grade": 7,
  "revol_util": 1119,
  "application_type": 1,
  "issue_d": 55,
  "last_credit_pull_d": 108,
  "id": 3538,
  "emp_length": 11,
  "purpose": 14,
  "initial_list_status": 1,
  "loan_status": 9
}


Now that I have determined how many unique values there are in each string column, I can determine how many of those string columns contain less than 50% unique values. 

In [9]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
total_counts = {}
for chunk in chunk_iter:
    string_cols = chunk.select_dtypes(include=['object'])
    cols = string_cols.columns
    for c in cols:
        total_counts[c] = 0

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in chunk_iter:
    string_cols = chunk.select_dtypes(include=['object'])
    cols = string_cols.columns
    for c in cols:
        for row in string_cols[c]:
            if pd.isnull(row):
                null = 'yes'
            else:
                total_counts[c] = total_counts[c] + 1

print(json.dumps(total_counts, indent=2))

{
  "earliest_cr_line": 42506,
  "last_pymnt_d": 42452,
  "addr_state": 42535,
  "zip_code": 42535,
  "term": 42535,
  "title": 42522,
  "pymnt_plan": 42535,
  "sub_grade": 42535,
  "verification_status": 42535,
  "emp_title": 39909,
  "home_ownership": 42535,
  "int_rate": 42535,
  "grade": 42535,
  "revol_util": 42445,
  "application_type": 42535,
  "issue_d": 42535,
  "last_credit_pull_d": 42531,
  "id": 3538,
  "emp_length": 41423,
  "loan_status": 42535,
  "initial_list_status": 42535,
  "purpose": 42535
}


Above we have the total counts of each column. Since we have the total counts and the unique counts, we can now create a dictionary with only the accounts where the number of unique counts divided by the number of total counts is less that 0.5.

In [10]:
first_chunk = pd.read_csv('loans_2007.csv', nrows=3000)
string_columns = first_chunk.select_dtypes(include=['object'])
less_than_half_unique = []
for c in string_columns:
    percent_unique = lengths[c]/total_counts[c]
    if percent_unique < 0.5:
        less_than_half_unique.append(c)
            
print(less_than_half_unique)

['term', 'int_rate', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']


## Analyzing float columns

Next, I can move onto the float columns. I will find out which float columns have no missing values, which means they could be candidates for conversion to integer type. We know from above that the number of rows in the whole dataset is 42538, which is saved in the variable `num_rows`. We can use the same code from above that we used to count the number of non-null rows in the string columns, and then compare that to the expected number of rows to find the number of rows missing. 

In [11]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
total_counts = {}
for chunk in chunk_iter:
    float_cols = chunk.select_dtypes(include=['float'])
    cols = float_cols.columns
    for c in cols:
        total_counts[c] = 0

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in chunk_iter:
    float_cols = chunk.select_dtypes(include=['float'])
    cols = float_cols.columns
    for c in cols:
        for row in float_cols[c]:
            if pd.isnull(row):
                null = 'yes'
            else:
                total_counts[c] = total_counts[c] + 1

num_missing = {}
for key in total_counts:
    num_missing[key] = num_rows - total_counts[key]
    
print(json.dumps(num_missing, indent=2))

{
  "inq_last_6mths": 32,
  "out_prncp_inv": 3,
  "policy_code": 3,
  "total_pymnt_inv": 3,
  "loan_amnt": 3,
  "collection_recovery_fee": 3,
  "total_acc": 32,
  "total_rec_late_fee": 3,
  "pub_rec_bankruptcies": 1368,
  "installment": 3,
  "recoveries": 3,
  "funded_amnt_inv": 3,
  "delinq_amnt": 32,
  "funded_amnt": 3,
  "dti": 3,
  "chargeoff_within_12_mths": 148,
  "annual_inc": 7,
  "total_pymnt": 3,
  "total_rec_prncp": 3,
  "last_pymnt_amnt": 3,
  "collections_12_mths_ex_med": 148,
  "member_id": 3,
  "tax_liens": 108,
  "out_prncp": 3,
  "delinq_2yrs": 32,
  "open_acc": 32,
  "pub_rec": 32,
  "acc_now_delinq": 32,
  "revol_bal": 3,
  "total_rec_int": 3
}


## Converting string columns

Since I have analyzed the column types as they currently exist, I see that I can now convert some string types to numeric types. I can also convert all columns where the values are less than 50% unique to the category type, and the columns that contain numeric values to the float type. 

I will start with converting the string columns. It looks like some of the string columns are not useful, and others can be converted to different types. I will go through each column now to determine which ones are necessary, and what types I should change each necessary column to. 

Columns to remove:

* `zip_code` - last two numbers are missing of all zipcodes 
* `application type` - only one option
* `initial_list_status` - only one option
* `title` - not consistent
* `id` - not consistent

Useful columns: 

* `grade` (category) 
* `home_ownership` (category)
* `emp_length` (category) 
* `verification_status` (category)
* `loan_status` (category)
* `addr_state` (category)
* `sub_grade` (category)
* `purpose` (category)
* `earliest_cr_line` (datetime)
* `issue_d` (datetime)
* `last_pymnt_d` (datetime)
* `last_credit_pull_d` (datetime)
* `int_rate` (float)
* `revol_util` (float)
* `pymnt_plan` (boolean)
* `term` (numeric)
* `emp_title` (string)

In [12]:
useful_cols = ['earliest_cr_line', 'issue_d', 'grade', 'home_ownership', 
               'last_pymnt_d', 'int_rate', 'emp_length', 'pymnt_plan', 
               'last_credit_pull_d', 'verification_status','loan_status', 
               'term', 'addr_state', 'emp_title','revol_util', 'sub_grade',
               'purpose']

In [13]:
str_chunks = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=useful_cols,
                          parse_dates= ['earliest_cr_line', 'issue_d', 
                         'last_pymnt_d', 'last_credit_pull_d'])

print("Megabytes of total memory usage in each chunk:\n")
list_mem = []
for chunk in str_chunks:
    chunk['grade'] = chunk['grade'].astype('category')
    chunk['home_ownership'] = chunk['home_ownership'].astype('category')
    chunk['emp_length'] = chunk['emp_length'].astype('category')
    chunk['loan_status'] = chunk['loan_status'].astype('category')
    chunk['addr_state'] = chunk['addr_state'].astype('category')
    chunk['sub_grade'] = chunk['sub_grade'].astype('category')
    chunk['purpose'] = chunk['purpose'].astype('category')
    chunk['verification_status'] = chunk['verification_status'].astype('category')
    cleaned_int_rate = chunk['int_rate'].str.rstrip("%")
    chunk['int_rate'] = pd.to_numeric(cleaned_int_rate)
    cleaned_revol_util = chunk['revol_util'].str.rstrip("%")
    chunk['revol_util'] = pd.to_numeric(cleaned_revol_util)
    cleaned_term = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    chunk['term'] = pd.to_numeric(cleaned_term)
    chunk['pymnt_plan'] = chunk['pymnt_plan'].astype('bool')
    mem_bytes = chunk.memory_usage(deep=True).sum()
    mem_megabytes = mem_bytes/(2**20)
    print(mem_megabytes)
    list_mem.append(mem_megabytes)

print("\nTotal memory usage across all chunks:")
print(sum(list_mem))

Megabytes of total memory usage in each chunk:

0.40640735626220703
0.4031820297241211
0.40488529205322266
0.405303955078125
0.40362548828125
0.4049654006958008
0.4032754898071289
0.4050636291503906
0.4051513671875
0.40613269805908203
0.4052286148071289
0.40256404876708984
0.4092569351196289
0.4066162109375
0.0806112289428711

Total memory usage across all chunks:
5.752269744873047


## Results of converting string columns

I can see based on these new memory usages that after I have cleaned the data, removed columns that are not useful, and set columns to the appropriate category type, the total memory footprint has decreased to less than 10% of the intial total.


## Converting float columns

Now, I will move onto converting float columns that contain missing values, and can be changed to a more space efficient type. First, I will have to look at the first few rows of all numeric colummns.

In [14]:
first_few = pd.read_csv('loans_2007.csv', nrows=10)
print(first_few.select_dtypes(include=['number']).dtypes)

id                              int64
member_id                     float64
loan_amnt                     float64
funded_amnt                   float64
funded_amnt_inv               float64
installment                   float64
annual_inc                    float64
dti                           float64
delinq_2yrs                   float64
inq_last_6mths                float64
open_acc                      float64
pub_rec                       float64
revol_bal                     float64
total_acc                     float64
out_prncp                     float64
out_prncp_inv                 float64
total_pymnt                   float64
total_pymnt_inv               float64
total_rec_prncp               float64
total_rec_int                 float64
total_rec_late_fee            float64
recoveries                    float64
collection_recovery_fee       float64
last_pymnt_amnt               float64
collections_12_mths_ex_med    float64
policy_code                   float64
acc_now_deli

As stated earlier, we can leave out the `id` columns. The following columns seem like they can be cast to integers:

* `member_id`
* `delinq_2yrs`
* `inq_last_6mths`
* `open_acc`
* `public_rec`
* `revol_bal`
* `total_acc`
* `collections_12_mths_ex_med`
* `policy_code`  
* `acc_now_delinq`
* `chargeoff_within_12_mths`  
* `delinq_amnt`  
* `pub_rec_bankruptcies`  
* `tax_liens` 

The following can still be cast as floats:

* `loan_amnt`
* `funded_amnt`
* `funded_amnt_inv`
* `installment`
* `annual_inc`
* `dti`
* `out_prncp`
* `out_prncp_inv`
* `total_pymnt`
* `total_pymnt_inv`
* `total_rec_prncp`
* `total_rec_int`
* `total_rec_late_fee`
* `recoveries`
* `recovery_collection_fee`
* `last_pymnt_amnt`

In [15]:
int_cols = ['member_id', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 
            'pub_rec', 'revol_bal', 'total_acc', 'collections_12_mths_ex_med',
            'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 
            'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=int_cols)
for chunk in chunk_iter:
        for col in int_cols:
            chunk[col] = pd.to_numeric(chunk[col], downcast='integer')

In [16]:
float_cols = ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment',
              'annual_inc', 'dti', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
              'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 
              'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 
              'last_pymnt_amnt']

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=float_cols)
for chunk in chunk_iter:
        for col in float_cols:
            chunk[col] = pd.to_numeric(chunk[col], downcast='float')

Now that I have successfully converted all useful columns to their optimal types, it is time to put it all together. I will create a list with all necessary columns, then print the memory usage of all of the optimized columns. 

In [17]:
all_useful_cols = []
for col in useful_cols:
    all_useful_cols.append(col)
for col in int_cols:
    all_useful_cols.append(col)
for col in float_cols:
    all_useful_cols.append(col)

chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, usecols=all_useful_cols)
list_mem = []
print("Megabytes of total memory usage in each chunk:\n")
for chunk in chunk_iter:
    mem_bytes = chunk.memory_usage(deep=True).sum()
    mem_megabytes = mem_bytes/(2**20)
    print(mem_megabytes)
    list_mem.append(mem_megabytes)

print("\nTotal memory usage across all chunks:")
print(sum(list_mem))

Megabytes of total memory usage in each chunk:

3.8584365844726562
3.854074478149414
3.8555965423583984
3.857219696044922
3.8547658920288086
3.857466697692871
3.8548221588134766
3.8573246002197266
3.8569259643554688
3.8569107055664062
3.8556394577026367
3.853343963623047
3.858841896057129
3.9395370483398438
0.7084341049194336

Total memory usage across all chunks:
54.77933979034424


## Results of converting float columns

I have determined that the amount of memory needed for all columns will decrease by over 10 megabytes when all of the columns are optimized. 