# Part II - Factors that Influences the ProsperLoans' LoanOriginalAmount
## by John Ametepe Agboku

>**Before you start**: You must have the README.md file ready that include a summary of main findings that reflects on the steps taken during the data exploration (Part I notebook). The README.md file should also describes the key insights that will be conveyed by the explanatory slide deck (Part II  outcome)



## Investigation Overview


> Describe the overall goals of your presentation here. Add a summary of key insights at the start of the notebook, just as you added in the README.md. This will help your notebook to stay aligned to the key insights you want to include in your slide deck.
> Investigating the ProsperLoan dataset, i wanted to look at the factors that influences or determines the ProsperLoan LoanOriginalAmount. The main focus was on the BorrowerRate and the MonthlyLoanPayment.
>
> But it turns out that though the MonthlyLoanPayment explains approximately 92 percent of the variability in the LoanOriginalAmount i.e. the MonthlyLoanPayment has a strong influence in determining the LoanOriginalAmount, the BorrowerRate only influences the LoanOriginalAmount with an estimate of 41 percent which was unexpected because BorrowerRate was expected to have a strong relationship as MonthlyLoanPayment.
>
> This led into including Term of the Loan, ProsperRating (Alpha), ProsperScore, IsBorrowerHomeowner, and IncomeRange into our main focus.

> **Rubric Tip**: The key insights in the slideshow must match those documented in the README.md summary. 


## Dataset Overview

> The data contains 113,937 loan data with 81 features.
> 14 features were selected and investigated but only 8 features were thoroughly investigated.
 > The loans data before July 2009 were removed because the features selected because some of the selected features do not have those data and to make the analysis unbiased they were removed.
 > The 14 features selected and investigated are:
1. LoanOriginationDate
2. Term
3. BorrowerRate
4. LoanOriginalAmount
5. IncomeRange
6. CurrentDelinquencies
7. DelinquenciesLast7years
8. ProsperRating (Alpha)
9. Occupation
10. EmploymentStatus
11. IsBorrowerHomeowner
12. MonthlyLoanPayment
13. LoanStatus
14. ProsperScore

Which was later trimmed down to these 8 features
1. Term
2. BorrowerRate
3. LoanOriginalAmount
4. IncomeRange
5. ProsperRating (Alpha)
6. IsBorrowerHomeowner
7. MonthlyLoanPayment
8. ProsperScore

In [2]:
# =import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [3]:
# load in the dataset into a pandas dataframe
loans = pd.read_csv('prosperLoanData.csv')

In [6]:
#stripping the data set down to only the features of interest
columns_to_keep = ["IsBorrowerHomeowner","Term","BorrowerRate","LoanOriginationDate", "IncomeRange","CurrentDelinquencies","DelinquenciesLast7Years","ProsperRating (Alpha)","Occupation","EmploymentStatus","MonthlyLoanPayment","LoanStatus","ProsperScore","LoanOriginalAmount"]

loans = loans[columns_to_keep]

## Data Wrangling

In [7]:
#convert Term, IncomeRange, ProsperRating (Alpha) and ProsperScore into categorical data types

#convert prosperRating(Alpha) into ordered categorical types using ProsperRatingNumeric
rating = ["N/A","HR","E","D","C","B","A","AA"]
ordered_rating = pd.api.types.CategoricalDtype(ordered=True,
                                               categories=rating)
loans["ProsperRating (Alpha)"] = loans["ProsperRating (Alpha)"].astype(ordered_rating)

#Converting term into categorical data type
loans['Term'] = loans['Term'].astype(pd.CategoricalDtype(ordered=True))

#converting the IncomeRange into ordered categorical dtypes
range_list = ['Not displayed','Not employed','$0','$1-24,999','$25,000-49,999','$50,000-74,999','$75,000-99,999','$100,000+']
ordered_range = pd.api.types.CategoricalDtype(ordered=True, categories=range_list)
loans['IncomeRange'] = loans['IncomeRange'].astype(ordered_range)

#converting ProsperScore into categorical datatype
loans['ProsperScore'] = loans['ProsperScore'].astype(pd.CategoricalDtype(ordered=True))

#convert LoanOriginationDate into Datetime variable type
loans["LoanOriginationDate"] = loans["LoanOriginationDate"].astype(np.Datetime64)

In [8]:
#the dataframe with loans that originated after july 2009
loans = loans[~loans['ProsperScore'].isna() & ~loans['ProsperRating (Alpha)'].isna()]
loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84853 entries, 1 to 113936
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   IsBorrowerHomeowner      84853 non-null  bool          
 1   Term                     84853 non-null  category      
 2   BorrowerRate             84853 non-null  float64       
 3   LoanOriginationDate      84853 non-null  datetime64[ns]
 4   IncomeRange              84853 non-null  category      
 5   CurrentDelinquencies     84853 non-null  float64       
 6   DelinquenciesLast7Years  84853 non-null  float64       
 7   ProsperRating (Alpha)    84853 non-null  category      
 8   Occupation               83520 non-null  object        
 9   EmploymentStatus         84853 non-null  object        
 10  MonthlyLoanPayment       84853 non-null  float64       
 11  LoanStatus               84853 non-null  object        
 12  ProsperScore             84853 

In [9]:
loans.describe()

Unnamed: 0,BorrowerRate,CurrentDelinquencies,DelinquenciesLast7Years,MonthlyLoanPayment,LoanOriginalAmount
count,84853.0,84853.0,84853.0,84853.0,84853.0
mean,0.196022,0.322452,3.659435,291.93072,9083.440515
std,0.074631,1.111996,9.347957,186.678314,6287.860058
min,0.04,0.0,0.0,0.0,1000.0
25%,0.1359,0.0,0.0,157.33,4000.0
50%,0.1875,0.0,0.0,251.94,7500.0
75%,0.2574,0.0,2.0,388.35,13500.0
max,0.36,51.0,99.0,2251.51,35000.0


## Distribution of LoanOriginalAmount

> The LoanOriginalAmount has values that ranges from 1,000 to 35,000. Using a log transformation to plot a h, most of the loans are concentrated around 5,000

> **Rubric Tip**: Provide at least 3 **polished** visualizations to convey key insights. The total number of visualizations in the slideshow shoould be less than 50% of the number of visualizations in the exploratory analysis. For example, if the exploratory analysis (Part I) has 18 visualizations, the slideshow can have (3 - 8) visualizations. 


> **Rubric Tip**: Each visualization in the slideshow is associated with **descriptive comments** that accurately depict their purpose and your observation. 


> **Rubric Tip**: All plots in the slideshow are appropriate, meaning the plot type, encodings, and transformations are suitable to the underlying data. 

> **Rubric Tip**: All plots in the slideshow are polished, meaning all plots have a title, labeled x/y axes (with units), x/y ticks, and legends. 

## (Visualization 2)

> You should have at least three visualizations in your presentation,
but feel free to add more if you'd like!

## (Visualization 3)



>**Generate Slideshow**: Once you're ready to generate your slideshow, use the `jupyter nbconvert` command to generate the HTML slide show. . From the terminal or command line, use the following expression.

In [None]:
!jupyter nbconvert <Part_II_Filename>.ipynb --to slides --post serve --no-input --no-prompt

> This should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation! At last, you can stop the Kernel. 