# Part 1: The data
## Objectives
1. Read in the data from `loans_2007.csv`
2. Identify any data quality issues

# Part 2: Explore the data in chunks
## Objectives
1. For each chunk:   
    1. How many columns have a numeric type? 
    2. How many columns have a string type?
    3. How many unique values are there in each string column? 
    4. How many of the string columns contain values that are less than 50% unique?
    5. Which float columns have no missing values and could be candidates for conversion to the integer type?
2. Calculate the total memory usage across all of the chunks

# Part 3: Optimize string columns
## Objectives
1. While working with dataframe chunks:
    1. Determine which string columns you can convert to a numeric type if you clean them. For example, the `int_rate` column is only a string because of the `%` sign at the end.
    2. Determine which columns have a few unique values and convert them to the category type. For example, you may want to convert the `grade` and `sub_grade` columns.
    3. Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint, and compare it with the previous one.
    
# Part 4: Optimize numeric columns
## Objectives
1. While working with dataframe chunks:
    1. Identify float columns that contain missing values, and that we can convert to a more space efficient subtype.
    2. Identify float columns that don't contain any missing values, and that we can convert to the integer type because they represent whole numbers.
    3. Based on your conclusions, perform the necessary type changes across all chunks. Calculate the total memory footprint and compare it with the previous one.

In [1]:
import pandas as pd
pd.options.display.max_columns = 99

# Explore the Data

In [2]:
df = pd.read_csv('loans_2007.csv', low_memory=False)
df.head(2).append(df.tail(2))

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
42536,Total amount funded in policy code 1: 471701350,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
42537,Total amount funded in policy code 2: 0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [3]:
df.shape

(42538, 52)

# Memory Usage

Figure out how many rows we can process at a time in order to stay below 5 MB of memory usage. This will be the size of each chunk when we process the data in chunks.

In [4]:
num_rows = 3000
df = pd.read_csv('loans_2007.csv', nrows=num_rows)
mem_usage = df.memory_usage(deep=True).sum()/(1024*1024)

print(num_rows, 'rows require about', mem_usage.round(2), 'MB of memory')

3000 rows require about 4.65 MB of memory


# Process the Data in Chunks

How many columns have a numeric type? 

In [5]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    cols = list(chunk.select_dtypes(include='number').columns)
    print(len(cols), 'numeric type columns:\n')
    print(cols)
    print('='*50)

31 numeric type columns:

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies', 'tax_liens']
31 numeric type columns:

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq', 'chargeo

In the results above, we can see that the last two chunks only have 30 numeric type columns. The rest have 31. This is because the `id` column in the last two chunks is not considered a numeric column. It is considered a string. There must be some bad ids in the last two chunks we need to filter out. After we do that, we can then cast the `id` column to a number.

In [6]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    # Filter out non-numeric ids
    good_id_mask = pd.to_numeric(chunk['id'], errors='coerce').notna()
    chunk = chunk[good_id_mask]
    
    # Cast id column if not already int64
    if chunk['id'].dtype != 'int64':
        chunk['id'] = chunk['id'].astype('int64') 
    
    # Number of numeric type columns
    cols = list(chunk.select_dtypes(include='number').columns)
    print(len(cols), 'numeric type columns')
    print('='*50)        

31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns
31 numeric type columns


Now that we've cleaned up the bad ids and made sure the `id` column is an int64 in every chunk, each chunk now lists 31 numeric type columns.

# Scratch work below

In [88]:
# For each chunk, how many columns have a string type?
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    string_types = chunk.select_dtypes(include='object')
    print(string_types.columns)

Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state',
       'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d',
       'last_credit_pull_d', 'application_type'],
      dtype='object')
Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state',
       'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d',
       'last_credit_pull_d', 'application_type'],
      dtype='object')
Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state',
       'earliest_cr_line', 'revol_util', 'ini

In [28]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, skiprows=[39786, 42536, 42537])

for chunk in chunk_iter:    
    # String type columns
    string_types = chunk.select_dtypes(include='object')
    
    print('String type columns ( n =', string_types.columns.size, '):')
    print(string_types.columns, '\n')
    
    # String type columns that can be converted to numeric type
    #     int_rate   - after removing '%'
    #     revol_util - after removing '%'
    print('String type columns that can be converted to numeric type:')
    for col in string_types.columns:
        try:
            if col == 'term':
                # Remove 'months'
                chunk[col] = chunk[col].apply(lambda term: term.split()[0])
            elif col == 'int_rate':
                # Remove the percent sign
                chunk[col] = chunk[col].apply(lambda int_rate: int_rate.split('%')[0])
            elif col == 'revol_util':
                # Remove the percent sign
                chunk[col]=  chunk[col].apply(lambda revol_util: revol_util.split('%')[0])
                
            pd.to_numeric(chunk[col])
            print('Can convert', col, 'to a numeric type')
        except ValueError:
            print('Cannot convert', col, 'to a numeric type')

    # Unique values in string type columns with percentage
    string_types_nunique = chunk.select_dtypes(include='object').nunique()
    string_types_counts = chunk.select_dtypes(include='object').count()  
    
    print('\nUnique values in each string type column:')
    
    max_colname_len = max(map(len, string_types.columns))
    for column, unique, count in zip(string_types.columns, string_types_nunique, string_types_counts):
        convert_to_category = 'yes' if unique < count/2 else 'no'        
        format_string = 'Column: {0:' + str(max_colname_len) + \
            '} Unique: {1:<4} Count: {2:<4} ({3:6.2%} unique)' + \
            ' Convert to category type? {4:3}' 
        print(format_string.format(column, unique, count, (unique/count), convert_to_category))
    
    # Float columns with no missing values
    print('\nFloating type columns with no missing values:')
    
    float_types = chunk.select_dtypes(include='floating')
    na_counts = float_types.isna().sum()
    print(na_counts.loc[lambda x : x == 0].index)
    
    # Chunk separator
    print('{0}{1:^14}{2}'.format('='*30, 'End of chunk', '='*30))



String type columns ( n = 21 ):
Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state',
       'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d',
       'last_credit_pull_d', 'application_type'],
      dtype='object') 

String type columns that can be converted to numeric type:
Can convert term to a numeric type
Can convert int_rate to a numeric type
Cannot convert grade to a numeric type
Cannot convert sub_grade to a numeric type
Cannot convert emp_title to a numeric type
Cannot convert emp_length to a numeric type
Cannot convert home_ownership to a numeric type
Cannot convert verification_status to a numeric type
Cannot convert issue_d to a numeric type
Cannot convert loan_status to a numeric type
Cannot convert pymnt_plan to a numeric type
Cannot convert purpose to a numeric type
Cannot convert title to

AttributeError: 'float' object has no attribute 'split'