# Findings of Exploration from Porsper Loan Data
## by Kuo, Yen-Chen

## Investigation Overview

> Our goal is to find out variabless that effect the borrower APR the most.

## Dataset Overview

> Originally, we have 113937 rows and 81 columns in this dataset. Then, we cut off those row without borrower APR values, and simplfy it by only keeping 'LoanOriginalAmount', 'BorrowerAPR', 'StatedMonthlyIncome', 'Term', 'ProsperScore' variables.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [None]:
# load in the dataset into a pandas dataframe
df = pd.read_csv('prosperLoanData.csv')
# filter out those prosper scores are not null
df_copy= df.dropna(axis = 0, subset = ['ProsperScore'], inplace = True)
# Extract those columns we need for later use
columns = ['LoanOriginalAmount', 'BorrowerAPR', 'StatedMonthlyIncome', 'Term', 'ProsperScore', 'Occupation']
df_copy = df[columns]
df_copy.head()

> Note that the above cells have been set as "Skip"-type slides. That means
that when the notebook is rendered as http slides, those cells won't show up.

## The distribution of borrower APR

> We can see that its roughly normal distributed with a really high bin of the percentage around 0.35.

In [None]:
bins = np.arange(0, df_copy['BorrowerAPR'].max(), 0.005)
plt.hist(data = df_copy, x = 'BorrowerAPR', bins = bins)
plt.title('Borrower Annual Percentage Rates counts')
plt.xlabel('Borrower Annual Percentage Rates (%)')
plt.ylabel('Counts')
plt.xticks(np.arange(0, df_copy['BorrowerAPR'].max()+0.05, 0.05))

## The Correlation Plot

> We can tell that the correlation between 'BorrowAPR' and 'ProsperScore' is the highest and the correlation between 'BorrowAPR' and 'LoanOriginalAmount' is worthy for us to take a look in it.

In [None]:
var_columns = ['LoanOriginalAmount', 'BorrowerAPR', 'StatedMonthlyIncome', 'Term', 'ProsperScore']

plt.figure(figsize = [10, 8])
sb.heatmap(df_copy[var_columns].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.title('Correlation Plot') 
plt.show()

## Porsper score effect on relationship of borrower APR and original loan amount

> We can find that with higher prosper score the relationship between borrower APR and loan original amount tend to be smaller and then get positive when score reach 10. We can say that it's becuase with higher prosper score, it's more possible for them to get larger amount of loan with lower APR.

In [None]:
g=sb.FacetGrid(data=df_copy, aspect=1.2, height=5, col='ProsperScore', col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend()

> Once you're ready to finish your presentation, check your output by using
nbconvert to export the notebook and set up a server for the slides. From the
terminal or command line, use the following expression:
> > `jupyter nbconvert <file_name>.ipynb --to slides --post serve --template output_toggle`

> This should open a tab in your web browser where you can scroll through your
presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent
slide. Make sure you remove all of the quote-formatted guide notes like this one
before you finish your presentation!