# Part 1: The data
## Objectives
1. Read in the data from `loans_2007.csv`
2. Identify any data quality issues

# Part 2: Explore the data in chunks
## Objectives
1. For each chunk:   
    1. How many columns have a numeric type? 
    2. How many columns have a string type?
    3. How many unique values are there in each string column? 
    4. How many of the string columns contain values that are less than 50% unique?
    5. Which float columns have no missing values and could be candidates for conversion to the integer type?
2. Calculate the total memory usage across all of the chunks

# Part 3: Optimize string columns
## Objectives
1. While working with dataframe chunks:
    1. Determine which string columns you can convert to a numeric type if you clean them. For example, the `int_rate` column is only a string because of the `%` sign at the end.
    2. Determine which columns have a few unique values and convert them to the category type. For example, you may want to convert the `grade` and `sub_grade` columns.
    3. Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint, and compare it with the previous one.
    
# Part 4: Optimize numeric columns
## Objectives
1. While working with dataframe chunks:
    1. Identify float columns that contain missing values, and that we can convert to a more space efficient subtype.
    2. Identify float columns that don't contain any missing values, and that we can convert to the integer type because they represent whole numbers.
    3. Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint and compare it with the previous one.

In [100]:
import pandas as pd
pd.options.display.max_columns = 99

# Explore the Data

In [2]:
df = pd.read_csv('loans_2007.csv', low_memory=False)
df.head(2).append(df.tail(2))

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
42536,Total amount funded in policy code 1: 471701350,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
42537,Total amount funded in policy code 2: 0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [3]:
df.shape

(42538, 52)

# Memory Usage

Figure out how many rows we can process at a time in order to stay below 5 MB of memory usage. This will be the size of each chunk when we process the data in chunks.

In [4]:
num_rows = 3000
df = pd.read_csv('loans_2007.csv', nrows=num_rows)
mem_usage = df.memory_usage(deep=True).sum()/(1024*1024)

print(num_rows, 'rows require about', mem_usage.round(2), 'MB of memory')

3000 rows require about 4.65 MB of memory


# Process the Data in Chunks

## How many columns have a numeric type? 

In [13]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    df_numeric_types = chunk.select_dtypes(include='number')
    cols = list(df_numeric_types.columns)
    print(len(cols), 'numeric type columns:\n')
    print(cols)
    
    print('='*50)

31 numeric type columns:

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']
31 numeric type columns:

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeo

In the results above, we can see that the last two chunks only have 30 numeric type columns. The rest have 31. This is because the `id` column in the last two chunks is not considered a numeric column. It is considered a string. There must be some bad ids we need to filter out. After we do that, we can then cast the `id` column in the last two chunks to the same numeric type as in the other chunks (`int64`).

In [14]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    # Filter out non-numeric ids
    good_id_mask = pd.to_numeric(chunk['id'], errors='coerce').notna()
    chunk = chunk[good_id_mask]
    
    # Cast id column
    if chunk['id'].dtype != 'int64':
        chunk['id'] = chunk['id'].astype('int64') 
    
    # Number of numeric type columns
    df_numeric_types = chunk.select_dtypes(include='number')
    cols = list(df_numeric_types.columns)
    print(len(cols), 'numeric type columns:\n')
    print(cols)
    
    print('='*50)        

31 numeric type columns:

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']
31 numeric type columns:

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeo

After filtering out the non-numeric ids and casting the `id` column in all chunks to an `int64`, we can see that each chunk now lists 31 numeric type columns. Great!

## How many columns have a string type?

In [15]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    # Filter out non-numeric ids
    good_id_mask = pd.to_numeric(chunk['id'], errors='coerce').notna()
    chunk = chunk[good_id_mask]
    
    # Cast id column
    if chunk['id'].dtype != 'int64':
        chunk['id'] = chunk['id'].astype('int64') 
    
    # Number of string type columns
    df_string_types = chunk.select_dtypes(include='object')
    cols = list(df_string_types.columns)
    print(len(cols), 'string type columns:\n')
    print(cols)
    
    print('='*50)   

21 string type columns:

['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
21 string type columns:

['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type']
21 string type columns:

['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'applicat

We can see that each chunk has 21 string type columns.

## How many unique values in each string column?

In [17]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    # Filter out non-numeric ids
    good_id_mask = pd.to_numeric(chunk['id'], errors='coerce').notna()
    chunk = chunk[good_id_mask]
    
    # Cast id column
    if chunk['id'].dtype != 'int64':
        chunk['id'] = chunk['id'].astype('int64') 
    
    # Number of unique values in each string column
    df_string_types = chunk.select_dtypes(include='object')  
    print(df_string_types.nunique())
    
    print('='*50)

term                      2
int_rate                 36
grade                     7
sub_grade                35
emp_title              2653
emp_length               11
home_ownership            3
verification_status       3
issue_d                   2
loan_status               6
pymnt_plan                1
purpose                  13
title                  1406
zip_code                568
addr_state               43
earliest_cr_line        366
revol_util              884
initial_list_status       1
last_pymnt_d             54
last_credit_pull_d       55
application_type          1
dtype: int64
term                      2
int_rate                 36
grade                     7
sub_grade                35
emp_title              2588
emp_length               11
home_ownership            3
verification_status       3
issue_d                   3
loan_status               6
pymnt_plan                1
purpose                  13
title                  1454
zip_code                570
addr_st

## How many of the string columns contain values that are less than 50% unique?
These are columns we should consider converting to a category type.

In [103]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

category_type_candidates = []
for chunk in chunk_iter:
    # Filter out non-numeric ids
    good_id_mask = pd.to_numeric(chunk['id'], errors='coerce').notna()
    chunk = chunk[good_id_mask]
    
    # Cast id column
    if chunk['id'].dtype != 'int64':
        chunk['id'] = chunk['id'].astype('int64') 
    
    # String columns that contain values that are less than 50% unique
    df_string_types = chunk.select_dtypes(include='object')  
    unq = df_string_types.nunique()
    cnt = df_string_types.count()
    unq_percent = (unq/cnt)*100
    unq_percent_less_than_50 = unq_percent[unq_percent < 50]
    category_type_candidates.append(set(unq_percent_less_than_50.index))
    
    print(unq_percent_less_than_50)
    print(unq_percent_less_than_50.size)
    print('='*50)

convert_to_category = set.intersection(*category_type_candidates)
print(sorted(convert_to_category))

term                    0.066667
int_rate                1.200000
grade                   0.233333
sub_grade               1.166667
emp_length              0.377100
home_ownership          0.100000
verification_status     0.100000
issue_d                 0.066667
loan_status             0.200000
pymnt_plan              0.033333
purpose                 0.433333
title                  46.866667
zip_code               18.933333
addr_state              1.433333
earliest_cr_line       12.200000
revol_util             29.466667
initial_list_status     0.033333
last_pymnt_d            1.801201
last_credit_pull_d      1.833333
application_type        0.033333
dtype: float64
20
term                    0.066667
int_rate                1.200000
grade                   0.233333
sub_grade               1.166667
emp_length              0.383008
home_ownership          0.100000
verification_status     0.100000
issue_d                 0.100000
loan_status             0.200000
pymnt_plan              0

## Which float columns have no missing values and could be candidates for conversion to the integer type?

In [62]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:    
    chunk = chunk.select_dtypes(include='floating')
    has_missing_values = chunk.isna().sum() > 0
    print(has_missing_values)
    
    print('='*50)

member_id                     False
loan_amnt                     False
funded_amnt                   False
funded_amnt_inv               False
installment                   False
annual_inc                    False
dti                           False
delinq_2yrs                   False
inq_last_6mths                False
open_acc                      False
pub_rec                       False
revol_bal                     False
total_acc                     False
out_prncp                     False
out_prncp_inv                 False
total_pymnt                   False
total_pymnt_inv               False
total_rec_prncp               False
total_rec_int                 False
total_rec_late_fee            False
recoveries                    False
collection_recovery_fee       False
last_pymnt_amnt               False
collections_12_mths_ex_med    False
policy_code                   False
acc_now_delinq                False
chargeoff_within_12_mths      False
delinq_amnt                 

From the results above, we can see that only the last four chunks have float columns with missing values. All of the float columns in the last two chunks containg missing values.

## Total memory usage across all chunks

In [74]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

total_memory = 0
for chunk in chunk_iter:
    mem = (chunk.memory_usage(deep=True)/(1024*1024)).sum()
    total_memory += mem

print(total_memory, 'MB')

66.2153730392456 MB


## Determine which string columns can be converted to a numeric type if the data is cleaned

In [83]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    # Filter out non-numeric ids
    good_id_mask = pd.to_numeric(chunk['id'], errors='coerce').notna()
    chunk = chunk[good_id_mask]
    
    # Cast id column
    if chunk['id'].dtype != 'int64':
        chunk['id'] = chunk['id'].astype('int64') 
    
    print(chunk.select_dtypes(include='object').tail(3))
    
    print('='*50)

            term int_rate grade sub_grade                 emp_title  \
2997   36 months   16.77%     D        D2          Northrop Grumman   
2998   60 months   12.42%     B        B4  Commonwealth of Virginia   
2999   36 months    6.03%     A        A1                BNY Mellon   

     emp_length home_ownership verification_status   issue_d loan_status  \
2997    2 years           RENT            Verified  Nov-2011  Fully Paid   
2998    5 years       MORTGAGE            Verified  Nov-2011  Fully Paid   
2999    5 years       MORTGAGE        Not Verified  Nov-2011  Fully Paid   

     pymnt_plan             purpose                    title zip_code  \
2997          n  debt_consolidation                2011 Loan    913xx   
2998          n  debt_consolidation  Debt Consolidation Loan    225xx   
2999          n  debt_consolidation                 Personal    027xx   

     addr_state earliest_cr_line revol_util initial_list_status last_pymnt_d  \
2997         CA         Mar-1995     

By looking at an overview of the data above, we see that we could convert the following string columns to a numeric type if we cleaned them:

| Column      | How to Clean              |
|-------------|---------------------------|
| `term`      | remove the word 'months'  |
| `int_rate`  | remove the percent sign   |
| `revol_util`| remove the percent sign   |

We also know from our previous exploration of the data, that the following string columns are good candidates to be converted to a category type because they have a very low percentage of unique values:

- `addr_state`
- `application_type` 
- `earliest_cr_line` 
- `emp_length` 
- `grade` 
- `home_ownership` 
- `initial_list_status` 
- `int_rate` 
- `issue_d` 
- `last_credit_pull_d` 
- `last_pymnt_d` 
- `loan_status` 
- `purpose` 
- `pymnt_plan` 
- `sub_grade` 
- `term` 
- `verification_status` 
- `zip_code`

So, let's perform these changes when we process the chunks.