# Optimizing Reading CSV
Instructions:

1. Read in the first five lines from loans_2007.csv and look for any data quality issues.
2. Read in the first 1000 rows from the data set, and calculate the total memory usage for these rows. Increase or decrease the number of rows to converge on a memory usage under five megabytes (to stay on the conservative side).

Study of the columns for memory optimization:

1. id seems to be a unique digit identifier (integer)
2. member_id seems to be an potential integer
3. loan_amnt - float
4. funded_amnt - float
5. funded_amnt_inv - float
6. term - seems to be a date in months, now string but could be converted to integer (to be checked)
7. int_rate - float if we remove the % sign
8. installment - float
9. grade - Letters categorizing the credit worthiness of the borrower, could be converted to categorical
10. sub_grade - same as 9
11. emp_title - string
12. emp_length - integer/float
13. home_ownership - category
14. annual_inc - float
15. verifiation_status - category
16. issue_d - date
17. loan_status - category
18. pymnt_plan - CHECK
19. purpose - CHECK
20. title - CHECK
21. zip_code - CHECK
22. addr_state - CATEGORY
23. dti - the rest....

Actions:
term - remove the months string
int_rate / revol_util - remove the % sign
grade / sub_grade / home_ownership / verification_status / loan_status / pymnt_plan / purpose - category?
addr_state / initial_list_status / application_type

emp_length - remove the string objects
issue_d / earliest_cr_line / last_pymnt_d / last_credit_pull_d / -> convert to date
zip_code -> check if possible to convert to integer



In [4]:
import pandas as pd

df = pd.read_csv(r"/Users/matteo/Desktop/Python Projects & Notes/loans_2007.csv", nrows=5)

for n, row in df.iterrows():
    print(row)


id                                1077501
member_id                       1296599.0
loan_amnt                          5000.0
funded_amnt                        5000.0
funded_amnt_inv                    4975.0
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                        24000.0
verification_status              Verified
issue_d                          Dec-2011
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                       

In [45]:
df = pd.read_csv(r"/Users/matteo/Desktop/Python Projects & Notes/loans_2007.csv", nrows=3250)

df.memory_usage(deep=True).sum() / (1024 * 1024)

#Reading 3,250 rows reaches 5 MB memory consumption

4.962096214294434

Actions:
term - remove the months string
int_rate / revol_util - remove the % sign
grade / sub_grade / home_ownership / verification_status / loan_status / pymnt_plan / purpose - category?
addr_state / initial_list_status / application_type

emp_length - remove the string objects
issue_d / earliest_cr_line / last_pymnt_d / last_credit_pull_d / -> convert to date
zip_code -> check if possible to convert to integer


In [74]:
import re

cols_digits = ["term", "int_rate", "revol_util", "emp_length"]
cols_date = ["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]


def extract_digits(col):
    col = col.str.extract(r"(\d+)", expand=False)
    return col

renaming = {"term":"term months", "int_rate":"int_rate %", "revol_util": "revol_util%", "emp_length":"emp_length years"}

df[cols_date] = df[cols_date].apply(pd.to_datetime)
df[cols_digits] = df[cols_digits].apply(extract_digits).apply(pd.to_numeric)

df = df.rename(columns=renaming)
df.head()

KeyError: "None of [Index(['term', 'int_rate', 'revol_util', 'emp_length'], dtype='object')] are in the [columns]"

In [47]:
df.memory_usage(deep=True).sum() / (1024 * 1024)

3.560114860534668

We have reduced by 1 MB the data by simply converting to date time and numeric. Now we will categorize data into categories if unique values are less than 20.

In [48]:
check_cat = ["grade", "sub_grade", "home_ownership", "verification_status", "loan_status",
             "pymnt_plan", "purpose", "addr_state", "initial_list_status", "application_type"]

num_unique = {}

path = r"/Users/matteo/Desktop/Python Projects & Notes/loans_2007.csv"

for cat in check_cat:
    test1 = pd.read_csv(path, usecols=[cat])
    num_unique[cat] = df[cat].nunique()
    

test = pd.DataFrame(num_unique.items(), columns=["Col_name", "Number Unique"])

test

Unnamed: 0,Col_name,Number Unique
0,grade,7
1,sub_grade,35
2,home_ownership,3
3,verification_status,3
4,loan_status,6
5,pymnt_plan,1
6,purpose,13
7,addr_state,43
8,initial_list_status,1
9,application_type,1


Except for addr_state, the rest will be converted into categories for memory optimization

In [54]:
check_cat = ["grade", "sub_grade", "home_ownership", "verification_status", "loan_status",
             "pymnt_plan", "purpose", "initial_list_status", "application_type"]


for cat_col in check_cat:
    df[cat_col] = df[cat_col].astype("category")
    
def memory_usage(df):
    return str(round(df.memory_usage(deep=True).sum() / (1024 * 1024), 2)) + " MB"

memory_usage(df)

'1.83 MB'

In [87]:


df.select_dtypes(include=["integer", "float"]).columns

to_integer = ['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term months', 'installment', 'emp_length years',
       'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'total_acc', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 
        'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq',
       'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies',
       'tax_liens']

df[to_integer] = df[to_integer].apply(pd.to_numeric, downcast="integer")
df.select_dtypes(include="integer")

types = df.dtypes.to_frame()

types = types[types[0] != "datetime64[ns]"]

types

Unnamed: 0,0
id,int32
member_id,int32
loan_amnt,int32
funded_amnt,int32
funded_amnt_inv,float64
term months,int8
int_rate %,int64
installment,float64
grade,category
sub_grade,category


We reduced from 5 MB to 1.83 MB 1000 rows memory usage.

Now read all the dataframe by cleaning each chunk and adapting the optimal data types.

In [98]:
def extract_digits(col):
    col = col.str.extract(r"(\d+)", expand=False)
    return col


def chunk_processing(path, chunksize=5000):
    #Clean those columns
    cols_digits = ["term", "int_rate", "revol_util", "emp_length"]
    #Parse as dates
    cols_date = ["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]
    #Convert to integer
    numerical = ['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term months', 'installment', 'emp_length years',
       'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'total_acc', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 
        'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq',
       'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies',
       'tax_liens']
    #Convert to categorical
    check_cat = ["grade", "sub_grade", "home_ownership", "verification_status", "loan_status",
             "pymnt_plan", "purpose", "initial_list_status", "application_type"]
    #Rename columns
    renaming = {"term":"term months", "int_rate":"int_rate %", "revol_util": "revol_util%", "emp_length":"emp_length years"}
    
    #Category dictionary
    categorical = {}
    for cat in check_cat:
        categorical[cat] = "category"
    
    chunk_iter = pd.read_csv(path, chunksize=chunksize, parse_dates=cols_date, dtype=categorical)
    
    final_chunk = pd.DataFrame()
    
    for chunk in chunk_iter:
        chunk[cols_digits] = chunk[cols_digits].apply(lambda x: extract_digits(x)).apply(pd.to_numeric)
        chunk = chunk.rename(columns=renaming)
        chunk[numerical] = chunk[numerical].apply(pd.to_numeric, errors="coerce")
        
        if len(final_chunk) == 0:
            final_chunk = chunk
        else:
            final_chunk = pd.concat([final_chunk, chunk])
    
    return final_chunk
        
        
        
    
    
    
    

In [103]:
optimized = memory_usage(chunk_processing(r"/Users/matteo/Desktop/Python Projects & Notes/loans_2007.csv", 10000))

In [102]:
non_optimized = memory_usage(pd.read_csv(r"/Users/matteo/Desktop/Python Projects & Notes/loans_2007.csv"))

In [106]:
print("Optimized:", optimized, "//", "Non-optimized", non_optimized)

Optimized: 34.33 MB // Non-optimized 66.44 MB
