# Data Wrangling Lending Club Data

### Summary
This notebook walks through the data wrangling steps used on Lending Club Data. These include:

**Removing Extraneous Data**
1. Removing columns with more than 50% missing values 
2. Removing columns based on description that: 
    * Leaked information from the future.
    * Contained redundant information.
3. Removing columns with only one unique value.  

**Preparing features for data exploration and machine learning**
1. Preparing Categorical columns by:
    * Mapping ordinal values to integers.
    * Encoding nominal values as dummy variables.
2. Removing percentage signs from continous data. 
3. Investigating the target column.
4. Handling missing values by:
    * Dropping rows with missing values under certain criteria.
    * Imputing the remaining missing values using observations from data.

## Importing the data

In [1]:
# importing relevant packages
import pandas as pd
import math
import matplotlib.pyplot as plt

# using jupyter magic to display plots in line
%matplotlib inline

# importing the dataset
loan_data = pd.read_csv('Loan_data.csv', low_memory=False, skiprows=1)

In [2]:
# viewing the size of the dataset
print('The size of the dataset: ' + str(loan_data.shape))

# viewing the first few columns of the dataset
loan_data.head()

The size of the dataset: (42538, 151)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,1077501,,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,,,Cash,N,,,,,,
1,1077430,,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,,,Cash,N,,,,,,
2,1077175,,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,,,Cash,N,,,,,,
3,1076863,,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,,,Cash,N,,,,,,
4,1075358,,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,,,Cash,N,,,,,,


From the above output we can see the dataset has 42538 rows and 151 columns. This is a large volume of data and not all of it is needed for the intended loan classification analysis.

## Removing Extraneous Data
**Columns with more than 50% missing values**

These columns will be hard to work with and will therefore be removed from the dataset. The code used to do this is shown below.

In [3]:
# Removing columns with more than 50% 
loan_data = loan_data.dropna(thresh = 0.5*len(loan_data), axis = 1)

# print the size of the dataset
print('The size of the dataset: ' + str(loan_data.shape))

The size of the dataset: (42538, 60)


**Removing columns based on descriptions**

The data now has 60 columns, each of which will be reviewed based on descriptions found in the [Lending Club Data Dictionary](https://resources.lendingclub.com/LCDataDictionary.xlsx). A list of the remaining 60 columns is shown below: 

In [4]:
# displaying columns remaining 
loan_data.columns

Index(['id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
       'last_fico_range_high', 'last_fico_range_low',
       'collections_12_mths_ex_med', 'policy_code', 'application_type',
       'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
       'pub_rec_bankruptcies', 't

From the above list, the columns not needed are listed below with a breif description of why.
1. id - a random unique identifier created by Lending Club
2. funded_amnt - leaks information from the future (amount funded)
3. funded_amnt_inv - leaks information from the future (amount investors funded)
4. issue_d - leaks information from the future (month which loan was funded)
5. url - does not provide useful information
6. zip_code - only first 3 letters of zipcode given, provides the same information as addr_state 
7. out_prncp - leaks data from the future (outstanding principal)
8. out_prncp_inv - leaks data from the future (outstanding principal investors portion of fund)
9. total_pymnt - leaks data from the future (payments received to date on loan funded)
10. total_pymnt_inv - leaks data from the future (payments received to date on loan funded)
11. total_rec_prncp - leaks information from the future
12. total_rec_int - leaks information from the future
13. total_rec_late_feev- leaks information from te future
14. recoveries - leaks information from the future
15. collection_recovery_fee - leaks information from the future
16. last_pymnt_d - leaks information from the future
17. last_pymnt_amnt - leaks information from the future
18. last_credit_pull_d - leaks information from the future
19. last_fico_range_high - leaks information from the future 
20. last_fico_range_low - leaks infromation from the future
21. sub_grade - contains redundant information.

It should be noted that while issue_d is on this list, it will not be dropped immediately. We start by dropping the other 20 columns.

In [5]:
# creating a list of the columns listed above 
cols_to_drop = ['id', 'funded_amnt', 'funded_amnt_inv', 'url', 'sub_grade',
                'zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 
                'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
                'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
                'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
                'last_fico_range_high', 'last_fico_range_low']

# dropping the columns listed above  
loan_data = loan_data.drop(cols_to_drop, axis=1)

The descriptions alone were not enough to decide whether to drop certain columns. The columns listed below require further investigation to decide how to deal with them:
* fico_range_high and fico_range_low
* purpose and title 
* earliest_cr_line and issue_d

**FICO score columns:** The fico_range_high and fico_range_low columns display represent the region within which a borrower's FICO score is in. There are 44 unique ranges. Having the range values in two columns is inefficient as the average of the range can be used to form one categorical column. 

In [6]:
# creating the fico_average column
loan_data['fico_average'] = (loan_data['fico_range_high'] + loan_data['fico_range_low'])/2

# dropping the fico range columns 
loan_data = loan_data.drop(['fico_range_low','fico_range_high'], axis=1)

**Purpose and title columns:** The purpose and tilte columns are both provided by the borrower. The purpose column contains categorical information on the purpose of the loan while the title column contains the name the borrower assigns the loan. These two columns contain the very similar information however, the purpose column is better categorized (as shown below). For this reason, the title column is dropped.

In [7]:
# printing the number of unique values in each column
print('Number of unique values in the purpose column: ' + str(loan_data['purpose'].nunique()))
print('Number of unique values in the title column: ' + str(loan_data['title'].nunique()))

# dropping the title column
loan_data = loan_data.drop(['title'], axis=1)

Number of unique values in the purpose column: 14
Number of unique values in the title column: 21264


**Earliest Credit line:** An important feature when determining credit scores is the age of the oldest account. For this reason, the earliest_cr_line column will be engineered to estimate the age of each borrowers oldest account. This will be achieved by calculating the difference between the month which a loan was funded (issue_d) and the borrower's earliest credit line (earliest_cr_line). It is a reasonable approximation as it provides a good estimate of the age of credit line an investor will see when deciding whether to invest in the loan.

In [8]:
# converting the earliest credit line column to datetime
loan_data['earliest_cr_line']= pd.to_datetime(loan_data['earliest_cr_line'])

# converting the loan issue date column to datetime
loan_data['issue_d'] = pd.to_datetime(loan_data['issue_d'])

# estimating the age of the oldest credit line
loan_data['age_cr_line'] = loan_data['issue_d']- loan_data['earliest_cr_line'] 

# dropping the earliest credit line and loan issue date columns
loan_data = loan_data.drop(['earliest_cr_line', 'issue_d'], axis =1)

In [9]:
# print the size of the dataset
print('The size of the dataset: ' + str(loan_data.shape))

The size of the dataset: (42538, 37)


There are now 37 columns left in the dataset.

**Removing columns with one unique value**

Columns that have only one unique value are not useful for loan classification. These colums are removed below. 

In [10]:
# removing columns with only one unique value 
loan_data = loan_data.loc[:,loan_data.apply(func=pd.Series.nunique, args=(False)) > 1]

# printing the size of the dataset
print('The size of the dataset: ' + str(loan_data.shape))

The size of the dataset: (42538, 29)


There are now 29 columns remaining. The name of these columns are shown below. 

In [11]:
# displaying name of remaining columns
loan_data.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'loan_status', 'desc', 'purpose', 'addr_state', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'total_acc', 'acc_now_delinq', 'delinq_amnt', 'pub_rec_bankruptcies',
       'tax_liens', 'debt_settlement_flag', 'fico_average', 'age_cr_line'],
      dtype='object')

## Preparing features for data exploration and machine learning

### Wrangling Categorical columns

**Categorizing the employer title column**

As fully categorizing the employer title column will be significant work, only some of the most common employer titles are categorized into the following groups:
1. No response
2. Unemployed
3. Self employed
4. Government (and government contractors)
5. Financial, consulting and insurance services
6. Postal, transportation and utilities services
7. Technology and telecommunications
8. Retail, food & beverage, and hospitality services
9. Educational and health services
10. Construction, Design and Manufaturing industries
11. All uncategorized employer titles 

In [12]:
# creating a function to categorize employrt title
def employer_categorizer(emp_title):
    
    # creating a placeholder for the category
    category = int()
    
    try:
        # creating a list of unemployed/employed words
        unemploy = ["unemploy", 'retire', 'un-employ', 'un employ']
        
        # check if unemployed
        if any(word in emp_title.lower() for word in unemploy):
            category = 'unemployed'
        
        # check if self-employed
        elif "self" in emp_title.lower():
            category = 'self_employed'
        
        else:
            
            # list of words for government/government contractor related employer title
            government = ["usaf", 'army', "air force", "marine corps", "revenue", 'northrop',
                          "military", "navy", "usmc", "coast guard", "state of", "lockheed",
                          "defense", 'department', 'government','nypd', "patrol", "federal",
                          "irs", "fbi", "usmc", 'county', 'city of', 'social security', 
                          'honeywell', 'police', 'nasa','raytheon', 'johnson', 'bae system', 
                          'general dynamics']
            
            # list of words for financial/insurance/consulting services related work
            Financial_services = ["bank", "jp", "chase", "wells", "morgan", 'deloitte',
                                  "fidelity", "investment", "american express", "lynch", 'aig', 
                                  "edward jones", "hsbc", "barclays", "capital one", 'kpmg',
                                  "schwab", "insurance", 'pricewater', 'geico', 'goldman',
                                  "accenture",  "ubs", "citi", 'ernst', 'credit', 'booz', 
                                  'advis', 'mutual', 'consult', 'adp', 'financ', 'manag']
            
            # list of words for retail, restaurants and hotel related work
            Retail_services = ['walmart', 'walgreens', 'target', 'cvs', 'best buy', 'bee',
                               'depot', 'nordstrom', 'costco', 'wal-mart', 'market', 'food',
                               'macy\'s', 'trader', 'staples', 'safeway', 'rite aid', 'hotel', 
                               'marriot', 'deli', 'pizza', 'resort', 'bucks', 'macy', 'cafe', 
                               'disney', 'sears', 'gap', 'autozone', 'steak']
            
            # list of words for postal, transport and utilities
            post_services = ["usps", "postal", "ups", "parcel", 'fedex', 'federal express', 
                             "motor", 'electric', 'water', 'car', 'airline', 'car', 
                             'chevrolet', 'truck', 'rail', 'aero' ]
            
            
            # list of words for telecommunications/tech related 
            Tele_tech = ['verizon', 'at&t', 'comcast', 'sprint', "mobile",'apple', 'amazon',
                         'time warner', 'thomson reuters', 'microsoft', 'technology', 'hp',
                         'ibm',  'att', 'dell', 'hewlet', 'communication', 'computer', 'intel', 
                         'cisco', 'oracle', 'cable', 'software', 'internet','at and t', 'mtv',
                         'siemens', 'tv']
            
            # list of words for educational/health related services
            Educational = ['georgia', 'college', 'university', 'school', 'education', 
                           'california', 'ucsf', 'institute', "health", 'kaiser', 'pfizer', 
                           'hospital', 'aetna', 'life', 'clinic', 'ucla', 'medic', 'cancer', 
                           'research', 'saic', 'diagnostic']
           
            # list of words for construction/manufacturing services
            manufacture = ['construction', 'development', 'excavation', 'boeing', 'conoco', 
                           'bechtel', 'manufactur', 'schlum', 'chevron', 'mobil', 'urs']
            
            # check if borrower works in the US military
            if any(employer in emp_title.lower() for employer in government):
                category = 'government/contractors' 
            
            # check if borrower works in financial/consulting services
            elif any(employer in emp_title.lower() for employer in Financial_services):
                category = 'finance/consulting'
           
            # check if borrower works in transportation and postal services
            elif any(employer in emp_title.lower() for employer in post_services):
                category = 'post/transportation/utilities'
            
            # check if borrower works in telecommunications
            elif any(employer in emp_title.lower() for employer in Tele_tech):
                category = 'tech/telecom'
           
            # check if borrower works in telecommunications/technology
            elif any(employer in emp_title.lower() for employer in Retail_services):
                category = 'retail/food/hospitality'
                
            # check if borrower works in education
            elif any(employer in emp_title.lower() for employer in Educational):
                category = 'education/health'
                
            # check if borrower works in health
            elif any(employer in emp_title.lower() for employer in manufacture):
                category = 'construction/manufacture'
            
            else:
                category = 'uncategorized'
               
    except Exception:
        # check if no response
        if math.isnan(emp_title):
            category = 'no_response'
    return category

# creating categorical employer title columns
loan_data['emp_title_cat'] = loan_data.emp_title.apply(employer_categorizer)

# dropping the employer_title column
loan_data = loan_data.drop(['emp_title'], axis =1)

# viewing the results 
loan_data['emp_title_cat'].value_counts()

uncategorized                    21942
education/health                  4391
government/contractors            4032
finance/consulting                3169
no_response                       2629
post/transportation/utilities     2177
tech/telecom                      1751
retail/food/hospitality           1626
construction/manufacture           499
self_employed                      238
unemployed                          84
Name: emp_title_cat, dtype: int64

**Categorizing the loan description column**

A proper categorization of the loan description column will require natural language processing. However, for this project a simple classification will be done based on the borrowers that provided a description and the borrowers that did not. The categories used are shown below. 
* 0: No response
* 1: Description provided

In [13]:
def desc_categorizer(description):
    
    # creating a placeholder for categort
    category = int()
    
    # check if response was not provided
    try:
        message = description.lower()
        category = 1
    
    except Exception:
        category = 0
    
    return category

# creating categorical description columns
loan_data['desc_cat'] = loan_data['desc'].apply(desc_categorizer)

# dropping the employer_title column
loan_data = loan_data.drop(['desc'], axis =1)

**Using ordinal values to categorize the employment length and grade columns**

The employment length and grade columns are converted to numeric type for data exploration and machine learning. For the employment length column, 10 or more years of employment is categorized 10 years of employment, while missing values and responses indicating less than 1 year of experience are categorized as 0 years of employment. The map for the employment column is shown below:

In [14]:
# Map for the employment length column
mapping_dict = {"emp_length": {"10+ years": 10,
                                     "9 years": 9,
                                     "8 years": 8,
                                     "7 years": 7,
                                     "6 years": 6,
                                     "5 years": 5,
                                     "4 years": 4,
                                     "3 years": 3,
                                     "2 years": 2,
                                     "1 year": 1,
                                     "< 1 year": 0,
                                     "n/a": 0
                                    },
                      "grade":{"A": 1,
                               "B": 2,
                               "C": 3,
                               "D": 4,
                               "E": 5,
                               "F": 6,
                               "G": 7
                              }
                     }


# converting the columns
loan_data = loan_data.replace(mapping_dict)

**Using dummy columns to categorize the nominal variables**

Since nominal variables cannot be ranked, dummy columns will be made to categorize them. This code for this is shown below for the columns: "home_ownership", "verification_status", "purpose", "term", "debt_settlement_flag" and 
"emp_title_cat".

In [15]:
# creating a list of nominal columns
nominal_columns = ["home_ownership", "verification_status", "purpose", "term", 
                   "debt_settlement_flag", "emp_title_cat"]

# creating dummy columns 
dummy_df = pd.get_dummies(loan_data[nominal_columns], drop_first = True)

# concatenating the columns to loan_data dataframe
loan_data = pd.concat([loan_data, dummy_df], axis=1)

# dropping the nominal columns
loan_data = loan_data.drop(nominal_columns, axis=1)

The addr_state column contains too many nominal variables. For this reason, the states are categorixed based on the region of the country it is in (West, Midwest, Northeast and South). Subsequently, dummy columns are made for them. 

In [16]:
# creating a function to categorize states by region
def state_categorizer(state):
    
    # making lists of states by region
    West = ["CA", "OR", "NV", "WA", "ID", "UT", "AZ", "NM", "CO", "WY", "MT"]
    Midwest = ["MD", "MN", "WY", "SD", "NE", "KS", "MO", "IA", "WI", "IL", "MI",
               "IN", "OH"]
    Northeast = ["ME", "NH", "VT", 'PA', "CT", "NY", "MA", "CT", "NJ", "RI"]
    South = ["TX", "OK", "AR", "LA", "MS", "AL", "TN", "KY", "GA", "FL", "SC"
             "NC", "VA", "WV", "DC", "MD", "DE"]
    
        
    # check which category state belongs to
    if any(state in state.upper() for state in West):
            category = 'West'
    elif any(state in state.upper() for state in Midwest):
            category = 'Midwest'
    elif any(state in state.upper() for state in South):
            category = 'South'
    elif any(state in state.upper() for state in Northeast):
            category = 'Northeast'
    else: 
        category = 'uncategorized'
    
    return category

# creating categorical employer title columns
loan_data['region'] = loan_data.addr_state.apply(state_categorizer)

# making dummy region columns
dummy_region = pd.get_dummies(loan_data['region'], drop_first = True)

# concatenating the columns to loan_data dataframe
loan_data = pd.concat([loan_data, dummy_region], axis=1)

# dropping the region and addr_state columns
loan_data = loan_data.drop(['addr_state', 'region'], axis=1)

### Cleaning the reovolving utililization and interest rate columns

The revolving utilization and interest rate columns have percentage signs in front of them that need to be removed for analysis. This is done in the code below.

In [17]:
# converting the interest rate and revolving utilization columns to float
loan_data["int_rate"] = loan_data["int_rate"].str.rstrip("%").astype("float")
loan_data["revol_util"] = loan_data["revol_util"].str.rstrip("%").astype("float")

### Preparing the target column
The target column for loan classification is the loan_status column. A quick look at the variables and their respective counts in the loan_status column is shown below.

In [18]:
# showing variables and count
loan_data['loan_status'].value_counts()

Fully Paid                                             34116
Charged Off                                             5670
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Name: loan_status, dtype: int64

According to Lendinc Club, loans of the type which does not meet their credit policy will no more be offered to investors. Consequently, these columns will be discarded and the remaining rows will be categorized such that:
* Fully Paid: 1
* Charged Off: 0

In [19]:
# removing rows that do not meet Lending Club's credit policy
loan_data = loan_data[(loan_data['loan_status'] == 'Fully Paid')|
                       (loan_data['loan_status'] == 'Charged Off')]

# converting loan_status to numerical values where 1 represents paid and 0 represents charged off 
loan_data['loan_status'] = loan_data[['loan_status']].replace({'Fully Paid':1, 'Charged Off':0})

# printing the size of the dataset
print('The size of the dataset: ' + str(loan_data.shape))

The size of the dataset: (39786, 53)


There are currently 53 columns. Reducing the number of rows may have affected the number of unique values in some columns. Once again the columns with only one unique value are removed.

In [20]:
# removing columns with only one unique value 
loan_data = loan_data.loc[:,loan_data.apply(func=pd.Series.nunique, args=(False)) > 1]

# printing the size of the dataset
print('The size of the dataset: ' + str(loan_data.shape))

The size of the dataset: (39786, 50)


### Handling Missing Values
With the categorical columns prepared, missing values will now be handled. Below we take a look at the number of missing values in columns with missing values. 



In [21]:
# counting the number of missing values
null_counts = loan_data.isnull().sum()

# displaying results 
null_counts[null_counts != 0]

emp_length              1078
revol_util                50
pub_rec_bankruptcies     697
dtype: int64

**Strategy for handling missing revolving utilization missing values:**
* There are 50 rows with missing data in the revolving utilization column. This represents less than 1% (398 rows) of the rows in the data. These rows will be dropped.

In [22]:
# Dropping rows with missing values in revol_util and tax_liens columns
loan_data = loan_data[pd.notnull(loan_data['revol_util'])]

# displaying results for the missing values
null_counts = loan_data.isnull().sum()
null_counts[null_counts != 0]

emp_length              1075
pub_rec_bankruptcies     697
dtype: int64

Verifying that dropping these rows did not affect the number of unique values in a column.

In [23]:
# removing columns with only one unique value 
loan_data = loan_data.loc[:,loan_data.apply(func=pd.Series.nunique, args=(False)) > 1]

# printing the size of the dataset
print('The size of the dataset: ' + str(loan_data.shape))

The size of the dataset: (39736, 50)


There are still 50 columns. Next, the missing values of the employment length column and public record bankruptcies column are dealt with. 

**Strategy for handling employment length missing values:**
* Borrowers that did not provide their employment length and employment title will be assumed to be unemployed. 
* Borrowers that did not provide their employment length are unemployed/retired will be assigned an employment length of 0.

In [24]:
# borrowers with no employment length or title data
no_el_no_title = loan_data[(pd.isnull(loan_data['emp_length'])) & 
                        (loan_data['emp_title_cat_no_response'] == 1)]

print('Count of Borrowers with no employment length or title: '
          + str(len(no_el_no_title)))

# borrowers with no employment length but unemployed title 
no_el_unemp_title = loan_data[(pd.isnull(loan_data['emp_length'])) & 
                        (loan_data['emp_title_cat_unemployed'] == 1)]

print('Count of borrowers with no employment length and unemployed: ' + 
      str(len(no_el_unemp_title)))

'''creating a function that converts missing values in the employment
length column to 0 under certain constraints'''
def emp_length_converter(row):
    
    #borrowers with no employment length and title data
    if (math.isnan(row['emp_length'])) & (row['emp_title_cat_no_response'] ==1):
        value = 0
        
    #borrowers with no employment length and unemployed
    elif (math.isnan(row['emp_length'])) & (row['emp_title_cat_unemployed'] ==1):
          value = 0
    
    else:
          value = row['emp_length']
          
    return value
          
loan_data['emp_length'] = loan_data.apply(emp_length_converter, axis=1)
        
# displaying results for the missing values
print("\n The new frequency of missing values:")
null_counts = loan_data.isnull().sum()
null_counts[null_counts != 0]

Count of Borrowers with no employment length or title: 1019
Count of borrowers with no employment length and unemployed: 2

 The new frequency of missing values:


emp_length               54
pub_rec_bankruptcies    697
dtype: int64

The number of missing values in the employment length column has been reduced from 1075 to 54. As this values are few and we are aware these borrowers have jobs, median imputation will be used to assign the remaining employment lengths. The code for this is shown below. 

In [25]:
# finding the median employment length 
median_length = loan_data['emp_length'].median()

# filling in the missing values with the median
loan_data['emp_length'] = loan_data['emp_length'].fillna(median_length)

# displaying results of missing values
null_counts = loan_data.isnull().sum()
null_counts[null_counts != 0]

pub_rec_bankruptcies    697
dtype: int64

**Strategy for handling public record bankruptcies**
* A correlation matrix will be made and the variables that strongly correlate with public recorded bankruptcies will be identified.
* These variables will be used to predict what the missing entries are.

In [26]:
# creating a correlation matrix  using the loan dataset
corr_matrix = loan_data.corr()

# selecting the column with count of public record bankruptcies
PBR_corr = corr_matrix['pub_rec_bankruptcies']

# sorting the values  
PBR_sorted = PBR_corr.abs().sort_values(ascending = False)

PBR_sorted.head(10)

pub_rec_bankruptcies    1.000000
pub_rec                 0.845979
fico_average            0.130303
int_rate                0.082816
grade                   0.078831
revol_util              0.060660
revol_bal               0.049611
loan_status             0.048200
emp_length              0.045872
loan_amnt               0.037270
Name: pub_rec_bankruptcies, dtype: float64

Aside public records, no other variables correlate strongly with public record bankruptcies. Taking a somewhat conservative approach in the eye of an investor, it will be assumed that:
1. Anyone with a public derogatory record that did not provide a response to public recorded bankruptcy, also has a public recorded bankruptcy.
2. Every borrower that does not have a public record also doesn't have a public recorded bankruptcy based on observations from the dataset (see code below). 

In [27]:
# checking if any borrower that doesn't have a public record has a bankruptcy record
loan_data[(loan_data['pub_rec_bankruptcies']>0)&(loan_data['pub_rec'] == 0)]

Unnamed: 0,loan_amnt,int_rate,installment,grade,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,...,emp_title_cat_education/health,emp_title_cat_finance/consulting,emp_title_cat_government/contractors,emp_title_cat_no_response,emp_title_cat_post/transportation/utilities,emp_title_cat_retail/food/hospitality,emp_title_cat_self_employed,emp_title_cat_tech/telecom,emp_title_cat_uncategorized,emp_title_cat_unemployed


Now the second assumption has been verified, the public record bankruptcy column wil be cleaned using the above assumptions.

In [28]:
# creating a function to predict missing public recorded bankruptcy

def bankruptcy_maker(row):
    
    value = float()
    
    if (row['pub_rec'] > 0) & (math.isnan(row['pub_rec_bankruptcies'])):
        value = 1
    elif (row['pub_rec'] == 0) & (math.isnan(row['pub_rec_bankruptcies'])):
        value = 0
    else:
        value = row['pub_rec_bankruptcies']
    return value 

loan_data['pub_rec_bankruptcies'] = loan_data.apply(bankruptcy_maker, axis = 1)

A quick check to verify there are no missing values in the dataset.

In [29]:
# displaying results of missing values
null_counts = loan_data.isnull().sum()
null_counts[null_counts != 0]

Series([], dtype: int64)

The data is now ready for exploration and machine learning. 

In [30]:
# exporting data
loan_data.to_csv('Wrangled_Loan_data.csv', index = False)