# Prosper Loan Data Exploration
## by Mohanad Salem

## Investigation Overview


I chose the univariate plots to be the Loan status, employment status and the stated monthly income.
The bivariate to be the Loan Status vs Loan Amount and Loan Status vs Prosper Rating. 
Finally, in the multivariate phase I plotted (Listing Categories vs Loan Amount vs Loan Status) and (Listing Categories vs Prosper Rating vs Loan Status)


## Dataset Overview

Loan Data from Prosper: This data set contains 113,937 loans with 81 variables on each loan, 
including loan amount, employment status, current loan status, borrower income, 
and many others.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [None]:
# load in the dataset into a pandas dataframe
df= pd.read_csv('D:/Udacity Projects/Communicate your findings/Project Template/prosperLoanData.csv')

In [None]:
project_columns =  ['BorrowerRate', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'EmploymentStatus',
                   'StatedMonthlyIncome', 'TotalProsperLoans', 'LoanOriginalAmount',
                   'LoanOriginationDate','Term', 'LoanStatus']

project_df= df[project_columns]
project_df.dropna(inplace=True)

# change LoanOriginationDate column to datetime data type 
project_df['LoanOriginationDate'] = pd.to_datetime(project_df['LoanOriginationDate'])

In [None]:
def cat_bar_plot(df, column, gen_order): # defining a function that plots a bar for a categorical univariate

    base_color= sb.color_palette()[0] # choosing the color
    plt.xticks(rotation= 90) # rotating the xticks so they can be readable
    sb.countplot(data=df, x= column, color=base_color, order=gen_order); # plotting the bars
    plt.title(column)

## Loan Status

1st Observation:

* I can see here that most of the loans are current.
* Also a high amount of the loans are completed.
* The charged off loans are low but not as low as the defaulted loans.
* lastly we have the past due loans which are split into groups based on a range of days

In [None]:
_order= ['Completed','FinalPaymentInProgress', 'Current', 'Past Due (1-15 days)',
            'Past Due (16-30 days)', 'Past Due (31-60 days)',
            'Past Due (61-90 days)', 'Past Due (91-120 days)', 
            'Past Due (>120 days)','Defaulted', 'Chargedoff']

In [None]:
cat_bar_plot(project_df, 'LoanStatus', _order);

## Employment Status

2nd Observation:

* As expected, people who are Employed are the majority and Not-Employed are the least.
* From the rest of the minorities, the Full-time is the highest.

In [None]:
cat_bar_plot(project_df, 'EmploymentStatus', project_df['EmploymentStatus'].value_counts().index)

## Stated Monthly Income

3rd Observation:

* The Monthly Income is right skewed and the its peak is at 4000

In [None]:
plt.hist(data=project_df, x='StatedMonthlyIncome', bins=100);
plt.xlim(0,15000);
plt.xlabel('Stated Monthly Income')
plt.ylabel('Count');
plt.title('Stated Monthly Income')

In [None]:
# Separating Loan Status
project_df= project_df.loc[(project_df['LoanStatus'] == "Defaulted") | (project_df['LoanStatus'] == "Chargedoff") | \
                           (project_df['LoanStatus'] == "Completed")]

# Change the word Chargedoff to Defaulted
project_df["LoanStatus"]= project_df["LoanStatus"].str.replace("Chargedoff", "Defaulted", case = False)

In [None]:
# rechaping the loan categories to keep the highest categories and add the rest to the 'other'
categories_dict = {1: 'Debt Consolidation', 2: 'Home Improvement', 3: 'Business', 6: 'Auto', 7: 'Other'}

def categorize(df):
    loan_category = df['ListingCategory (numeric)']
    if  loan_category in categories_dict:
        return categories_dict[loan_category]
    else:
        return categories_dict[7]
    
project_df['ListingCategory (numeric)'] = project_df.apply(categorize, axis=1)

## Loan Status and Prosper Rating

1st Observation:

* The most frequent rating in the completed loans is AA
* The most frequent rating in the defaulted loans is E

In [None]:
base_color = sb.color_palette()[0]
sb.countplot(data=project_df, x='LoanStatus', hue='ProsperRating (Alpha)', 
             color=base_color);
plt.title('Loan Status VS Prosper Rating');

## Loan Status and Loan Amount

2nd observation:

* It looks like the defaulted loans are larger than the completed loans

In [None]:
sb.boxplot(data=project_df, x='LoanStatus', y='LoanOriginalAmount', 
           color= base_color);
plt.title('Loan Status VS Loan Amount');

### Emplyment Status, Loan Amount and Loan Status

1st observation:

* Defaulted loans are larger than completed loans
* big loans tends to be for employed then self employed.
* also some big loans were assigned to 'other' and its percentage of defaulted loans is high.

In [None]:
plt.figure(figsize = [12, 8])
sb.barplot(data=project_df, x='EmploymentStatus', y='LoanOriginalAmount', hue='LoanStatus', palette = 'Blues');
plt.title('Emplyment Status vs Loan Amount vs Loan Status');

## Listing Categories, Loan Amount and Loan Status

2nd  observation:

* Defaulted loans tends to be higher than completed loans in all categories
* Business category seems to have the highest amount

In [None]:
sb.boxplot(data=project_df, x='ListingCategory (numeric)', y='LoanOriginalAmount', 
              hue='LoanStatus', palette = 'Blues');
plt.title('Listing Categories vs Loan Amount vs Loan Status');

## Listing Categories, Prosper Rating and Loan Status

3rd observation:

* the ratio between the defaulted and completed loans looks the same in all categories.



In [None]:
_order2= ['AA', 'A', 'B', 'C', 'D', 'E', 'HR'] # ordering the data

In [None]:
sb.catplot(x = 'ProsperRating (Alpha)', hue = 'LoanStatus', col = 'ListingCategory (numeric)',data = project_df,
           kind = 'count', col_wrap = 3, palette = 'Blues', order= _order2);

>**Generate Slideshow**: Once you're ready to generate your slideshow, use the `jupyter nbconvert` command to generate the HTML slide show. . From the terminal or command line, use the following expression.

In [None]:
!jupyter nbconvert MS_slide_deck_explanatory.ipynb --to slides --post serve --no-input --no-prompt

> This should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation! At last, you can stop the Kernel. 