# Part I - Loan Data from Prosper
## by Jasmine W.

## Introduction

> Introduce the dataset

>**Rubric Tip**: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

> **Rubric Tip**: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

>**Rubric Tip**: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.

The dataset holds data about 113,937 loans from a company called _Prosper_. This includes the following loan information: **"Listing Key"**, **"Listing Number"**, **"Listing Creation Date"**, **"Credit Grade"**, **"Term"**, **"Loan Status", "Closed Date"**, **"Borrower APR"**, **"Borrower Rate"**, **"Lender Yield"**, **"Estimated Effective Yield", "Estimated Loss"**, **"Estimated Return"**, **"Prosper Rating (numeric)"**, **"Prosper Rating (Alpha)", "Prosper Score"**, **"Listing Category (numeric)"**, **"Borrower State"**, **"Occupation"**, **"Employment Status", "Employment Status Duration"**, **"Is Borrower Homeowner"**, **"Currently In Group"**, **"Group Key", "Date Credit Pulled"**, **"Credit Score Range Lower"**, **"Credit Score Range Upper", "First Recorded Credit Line"**, **"Current Credit Lines"**, **"Open Credit Lines", "Total Credit Lines past 7 years"**, **"Open Revolving Accounts", "Open Revolving Monthly Payment"**, **"Inquiries Last 6 Months"**, **"Total Inquiries"**, **"Current Delinquencies", "Amount Delinquent"**, **"Delinquencies Last 7 Years"**, **"Public Records Last 10 Years", "Public Records Last 12 Months"**, **"Revolving Credit Balance"**, **"Bankcard Utilization", "Available Bankcard Credit"**, **"Total Trades"**, **"Trades Never Delinquent (percentage)", "Trades Opened Last 6 Months"**, **"Debt To Income Ratio"**, **"Income Range"**, **"Income Verifiable", "Stated Monthly Income"**, **"Loan Key"**, **"Total Prosper Loans"**, **"Total Prosper Payments Billed", "On Time Prosper Payments"**, **"Prosper Payments Less Than One Month Late", "Prosper Payments One Month Plus Late"**, **"Prosper Principal Borrowed"**, **"Prosper Principal Outstanding", "Score x Change At Time Of Listing"**, **"Loan Current Days Delinquent"**, **"Loan First Defaulted Cycle Number", "Loan Months Since Origination"**, **"Loan Number"**, **"Loan Original Amount"**, **"Loan Origination Date", "Loan Origination Quarter"**, **"Member Key"**, **"Monthly Loan Payment"**, **"LP_Customer Payments", "LP_Customer Principal Payments"**, **"LP_Interest and Fees"**, **"LP_Service Fees"**, **"LP_Collection Fees", "LP_Gross Principal Loss"**, **"LP_Net Principal Loss"**, **"LP_NonPrincipal Recovery Payments", "Percent Funded"**, **"Recommendations"**, **"Investment From Friends Count"**, **"Investment From Friends Amount"**, and **"Investors"**.

## Preliminary Wrangling

In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

> Note that the collective size of all your files in the current workspace **must not exceed 1 GB** in total. 


In [22]:
import zipfile

with zipfile.ZipFile("prosperLoanData.zip", "r") as zip_ref: # opens zip file in read-mode
    zip_ref.extractall("prosperLoanData/") # extracts all data in zip file

df_prosper_loan_data = pd.read_csv("prosperLoanData/prosperLoanData.csv") # loads CSV file of dataset
#df_prosper_loan_data.head() # displays first 5 rows of raw data from dataset
#df_prosper_loan_data.info() # displays columns' non-null count & dtype from dataset
#df_prosper_loan_data.describe() # displays columns' count, mean, standard deviation, minimum, 25% quartile, 50% quartile, 75% quartile, and maximum from dataset
#df_prosper_loan_data.count() # displays amount of each unique row in dataset
#df_prosper_loan_data.shape # displays dimensions in dataset
df_prosper_loan_data

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
0,1021339766868145413AB3B,193129,2007-08-26 19:09:29.263000000,C,36,Completed,2009-08-14 00:00:00,0.16516,0.1580,0.1380,...,-133.18,0.0,0.0,0.0,0.0,1.0,0,0,0.0,258
1,10273602499503308B223C1,1209647,2014-02-27 08:28:07.900000000,,36,Current,,0.12016,0.0920,0.0820,...,0.00,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
2,0EE9337825851032864889A,81716,2007-01-05 15:00:47.090000000,HR,36,Completed,2009-12-17 00:00:00,0.28269,0.2750,0.2400,...,-24.20,0.0,0.0,0.0,0.0,1.0,0,0,0.0,41
3,0EF5356002482715299901A,658116,2012-10-22 11:02:35.010000000,,36,Current,,0.12528,0.0974,0.0874,...,-108.01,0.0,0.0,0.0,0.0,1.0,0,0,0.0,158
4,0F023589499656230C5E3E2,909464,2013-09-14 18:38:39.097000000,,36,Current,,0.24614,0.2085,0.1985,...,-60.27,0.0,0.0,0.0,0.0,1.0,0,0,0.0,20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113932,E6D9357655724827169606C,753087,2013-04-14 05:55:02.663000000,,36,Current,,0.22354,0.1864,0.1764,...,-75.58,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
113933,E6DB353036033497292EE43,537216,2011-11-03 20:42:55.333000000,,36,FinalPaymentInProgress,,0.13220,0.1110,0.1010,...,-30.05,0.0,0.0,0.0,0.0,1.0,0,0,0.0,22
113934,E6E13596170052029692BB1,1069178,2013-12-13 05:49:12.703000000,,60,Current,,0.23984,0.2150,0.2050,...,-16.91,0.0,0.0,0.0,0.0,1.0,0,0,0.0,119
113935,E6EB3531504622671970D9E,539056,2011-11-14 13:18:26.597000000,,60,Completed,2013-08-13 00:00:00,0.28408,0.2605,0.2505,...,-235.05,0.0,0.0,0.0,0.0,1.0,0,0,0.0,274


### What is the structure of your dataset?

> Your answer here!

The dataset was loaded into a dataframe called **df_prosper_loan_data**. The original dataset is held in a CSV file. The CSV file was compressed and held in a zip file because the CSV file itself could not be uploaded into the GitHub repository because it was too large.

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 

> **Rubric Tip**: This part (Univariate Exploration) should include at least one histogram, and either a bar chart of count plot.

>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

> **Rubric Tip**: This part (Bivariate Exploration) should include at least one scatter plot, one box plot, and at least one clustered bar chart or heat map.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

> **Rubric Tip**: This part (Multivariate Exploration) should include at least one Facet Plot, and one Plot Matrix or Scatterplot with multiple encodings.

>**Rubric Tip**: Think carefully about how you encode variables. Choose appropriate color schemes, markers, or even how Facets are chosen. Also, do not overplot or incorrectly plot ordinal data.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.

> **Rubric Tip**: Create a list of summary findings to make it easy to review.

> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML.


> At the end of your report, make sure that you export the notebook as an html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!

