# Deriving and Communicating Insights with Data Visualization
## Part 1: Exploratory Data Analysis of Prosper Loan Data
### Jong Min Lee

In [1]:
# import all packages
import pandas as pd
import numpy as np
import copy

## Data Wrangling
`prosperLoanData.csv` which contains the Prosper loan data was made available by [Udacity](https://www.udacity.com/) for download. The file was manually downloaded to the local `data` directory and imported to this project as shown below.

In [2]:
# load .csv file containing the prosper loan data
df = pd.read_csv('data/prosperLoanData.csv')

### 1) Structure of the Dataset
The loan dataset contains 113,937 rows of data for 81 variables, ranging from the `ListingKey` which uniquely identifies each listing posted in [Prosper](www.prosper.com) for the loans requested by borrowers to the `Investors` which indicates the number of investors who funded the loan associated with the listing in Prosper. Review of the key information on and the first five rows of the dataset revealed several aspects of the dataset such as columns with inappropriate data types and/or missing values which render data cleaning necessary. Systematic approach to cleaning the dataset, which includes defining the necessary cleaning operations, coding and performing these operations, and verifying the results are documented in the next two sections.

In [3]:
# summary of the dataframe object
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
ListingKey                             113937 non-null object
ListingNumber                          113937 non-null int64
ListingCreationDate                    113937 non-null object
CreditGrade                            28953 non-null object
Term                                   113937 non-null int64
LoanStatus                             113937 non-null object
ClosedDate                             55089 non-null object
BorrowerAPR                            113912 non-null float64
BorrowerRate                           113937 non-null float64
LenderYield                            113937 non-null float64
EstimatedEffectiveYield                84853 non-null float64
EstimatedLoss                          84853 non-null float64
EstimatedReturn                        84853 non-null float64
ProsperRating (numeric)                84853 non-null float64
ProsperRating (Alpha) 

In [4]:
# first five rows of the data
df.head()

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
0,1021339766868145413AB3B,193129,2007-08-26 19:09:29.263000000,C,36,Completed,2009-08-14 00:00:00,0.16516,0.158,0.138,...,-133.18,0.0,0.0,0.0,0.0,1.0,0,0,0.0,258
1,10273602499503308B223C1,1209647,2014-02-27 08:28:07.900000000,,36,Current,,0.12016,0.092,0.082,...,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
2,0EE9337825851032864889A,81716,2007-01-05 15:00:47.090000000,HR,36,Completed,2009-12-17 00:00:00,0.28269,0.275,0.24,...,-24.2,0.0,0.0,0.0,0.0,1.0,0,0,0.0,41
3,0EF5356002482715299901A,658116,2012-10-22 11:02:35.010000000,,36,Current,,0.12528,0.0974,0.0874,...,-108.01,0.0,0.0,0.0,0.0,1.0,0,0,0.0,158
4,0F023589499656230C5E3E2,909464,2013-09-14 18:38:39.097000000,,36,Current,,0.24614,0.2085,0.1985,...,-60.27,0.0,0.0,0.0,0.0,1.0,0,0,0.0,20


### 2) Key Features related to a Borrower's APR
This project explores the loan data to identify the correlation of different variables associated with the listings for each loan with a borrower's APR. Insights from this data analysis would be applicable for not only answering such potential questions as whether a loan with a longer term for paying back is associated with a lower APR or if specific categories for which Prosper provides borrowers with its online platform for requesting a loan are associated with a high or low APR but comparing the correlations with the APR across these variables.

To streamline this data analysis, 16 features listed below were selected from the dataset. While few of these columns such as `ListingKey`, `BorrowerState` or `ListingCategory` serve to facilitate the operations for cleaning the dataset or define groups for which further trends in correlations with the APR can be identified, other columns are expected to directly correlate with the APR in differing degrees. For instance, with other conditions being equal, the loan of a borrower who is employed full-time but showed frequent delinquencies may require a higher APR than that for an unemployed borrower with no past delinquency.
* `ListingKey`
* `Term`
* `BorrowerAPR`
* `ListingCategory (name)`
* `BorrowerState`
* `EmploymentStatus`
* `EmploymentstatusDuration`
* `CreditScoreRangeLower`
* `CreditScoreRangeUpper`
* `TotalCreditLinespast7years`
* `TotalInquiries`
* `DelinquenciesLast7Years`
* `BankcardUtilization`
* `DebtToIncomeRatio`
* `StatedMonthlyIncome`
* `LoanOriginalAmount`

A new dataframe object `df_clean` containing only these _main_ features was created from the original dataframe object `df`.

In [5]:
# create sub-dataset from Prosper loan data which contains only the 16 features
df_clean = df.copy()
df_clean = df_clean.iloc[:, np.r_[0, 4, 7, 16, 17, 19, 20, 25, 26, 30, 34, 37, 41, 46, 49, 63]]
df_clean.head()

Unnamed: 0,ListingKey,Term,BorrowerAPR,ListingCategory (numeric),BorrowerState,EmploymentStatus,EmploymentStatusDuration,CreditScoreRangeLower,CreditScoreRangeUpper,TotalCreditLinespast7years,TotalInquiries,DelinquenciesLast7Years,BankcardUtilization,DebtToIncomeRatio,StatedMonthlyIncome,LoanOriginalAmount
0,1021339766868145413AB3B,36,0.16516,0,CO,Self-employed,2.0,640.0,659.0,12.0,3.0,4.0,0.0,0.17,3083.333333,9425
1,10273602499503308B223C1,36,0.12016,2,CO,Employed,44.0,680.0,699.0,29.0,5.0,0.0,0.21,0.18,6125.0,10000
2,0EE9337825851032864889A,36,0.28269,0,GA,Not available,,480.0,499.0,3.0,1.0,0.0,,0.06,2083.333333,3001
3,0EF5356002482715299901A,36,0.12528,16,GA,Employed,113.0,800.0,819.0,29.0,1.0,14.0,0.04,0.15,2875.0,10000
4,0F023589499656230C5E3E2,36,0.24614,2,MN,Employed,44.0,680.0,699.0,49.0,9.0,0.0,0.81,0.26,9583.333333,15000


### 3) Data Cleaning