# ACCY 570 Group Project  

## Overview 
-----

### Goal   

Complete a data analytics project that demonstrates your mastery of the course content.
  
  1.  Demonstrate the ability to use Markdown and basic Python to summarize data and to produce insightful visualizations.
  2.  Show that you can use the proper statistical methods to summarize your data.  
  3.  Show that you can create useful plots to highlight the important points of your data.  
  
### Prompt  

After several years of operation, the Lending Club&trade; wants to understand how the individual components (i.e., features) of a loan, such as the amount loaned, the term length, or the interest rate, affect the profitability of a making a specific loan. In particular, they're interested in understanding how likely a loan is to be repaid and what type of return they can expect. In this project, you will use the provided loans dataset to help the Lending Club&trade; figure out which loans are most profitable.

-----

  
## Criteria
-----

You will work in groups of __4-5__ students to analyze these data to make recommendations based on the variables in the `loan.csv` dataset. Specifically, you should address the following questions:
1. What is the probability of a full repayment? and 
2. What is the return on the loan?
To simplify the analysis, your group must focus on at least five features (or variables) that you think might be important (you can include more if you wish, but you must use at least five features). 

You will complete three tasks for this group project:
1. A group report in the form of a Jupyter notebook,
2. An in-class presentation where your group will present your results, and 
3. Peer evaluation of the contributions of each member of your group.

Your final group report will be a single Jupyter notebook that will integrate Markdown, Python code, and the results from your code, such as data visualizations. Markdown cells should be used to explain any decisions you make regarding the data, to discuss any plots or visualizations generated in your notebook, and the results of your analysis. As a general guideline, the content should be written in a way that a fellow classmate (or an arbitrary data scientist/analyst) should be able to read your report and understand the results, implications, and processes that you followed to achieve your result. If printed (not that you should do this), your report should be at least fifteen pages.

Your group will present the material in-class in a format that is left up to each group. For example, you can use presentation software such as MS Powerpoint, PDFs, your Notebook, or Prezi, or, alternatively, you can choose some other presentation style (feel free to discuss your ideas with the course staff). The presentations should cover all steps in your analytics process and highlight your results. The presentation should take between eight to twelve minutes, and will be graded by your discussion teaching assistant.

### Rubric
  - Notebook Report (60%)
  - Class presentation (30%)
  - Peer assessment from your group-mates (10%)

### General

Your report should 
  1. use proper markdown.  
  2. include all of the code used for your analysis.
  3. include properly labeled plots (eg., use axis labels and titles).
  4. use a consistent style between graphs.
  5. be entirely the work of your own group, **Do not plagiarize code, this includes anything you might find online**. All code should be written by you and your group.

-----

### Exploratory Data Analysis (EDA)

1. For each of the features you select, briefly explain why your group decided to include it in the analysis. Specifically, think about why __you think__ they might be helpful in predicting which loans are likely to be repaid and which loans are likely to have the highest return. Other points to keep in mind:
  - At least one variable must be categorical and at least one must be numerical.  
  - You do not need to include summaries of any feature not included in your analysis.
  - **Hint**: features with no (or very few) missing values are likely to be the most useful in making predictions.
2. Create histograms to visualize each of the features included in your analysis.
  - Briefly describe any issues or irregularities that you see in the data, for example:
    - Are there any major outliers? 
    - Does the data look _well_ dispersed or does it _clump_ around some point? 
    - Are there a lot of missing values?
3. Compute appropriate descriptive statistics for each feature. 
  - Numerical features should include at least a measure of centrality and a measure of dispersion.
  - Categorical features should at a minimum indicate which category is most popular.
4. Create histograms for both the _Repaid_ and _Return_ features, do they look normally distributed?
  - Fit a normal distribution to the histogram.
  - Include and discuss a QQ plot. 
5. Finally, include plots that demonstrate the relationship between each explanatory feature and each target feature.
  - You might find a box plot, violin plot, scatter plot, or heat map to be helpful.
  - Analyze these plots and note anything interesting or unusual that you see in your plots. 
  - Do any features look to be strongly related to either target? 
  - Comment on why you think features might be correlated.

-----

### Machine Learning 
1. Preprocess all data appropriately.
 - Normalize any continuous feature.
 - Encode any categorical feature.
 - Randomly split the data into _Training_ and _Testing_ data sets.
 
2. Build a classifier on _Training_ dataset to classify loans as either repaid or not. 
 - Try at least two different classification algorithms and compare their performance on the _Testing_ dataset using some measure of predictive power.
 - Include a confusion matrix for each algorithm.
 - Create a ROC curve comparing the performance of each algorithm on _Testing_ dataset and compute and display the AUC for each algorithm.
 - Create at least one other visualization that compares their relative performance on _Testing_ data. 
 - Explain which model you prefer, include justification beyond the value of a performance metric.
 
3. Build a regression model on _Training_ dataset to predict the loan return.
 - Try at least two different regression algorithms and compare their performance on _Testing_ dataset using some measure of predictive power.
 - Create a lift chart comparing the performance of each model.
 - Explain which model you prefer, include justification beyond the value of a performance metric.

-----

### Conclusion
Summarize the results of your analysis. This summary should include anything interesting you found when performing EDA. Also, discuss the results of each classification and regression model. Be sure to address whether your classifier was much better than random on the test data, and comment on how accurate the predictions were from your regression model. Next, be sure to discuss the importance of each feature in both machine learning tasks. Finally, comment on your results and how they might be used to improve the performance of future loans made by the Lending Tree&trade;.

-----

## Getting Started

In order to ensure everyone starts at the same point, we provide the following Code cell that creates the two **target** columns that you will use for the analyses required to complete this group project. The `return` feature encodes the return for each loan and the `repaid` feature encodes whether the loan was repaid in full or not (for simplicity, we  assume that the loan is repaid if the borrower pays more than the loan amount).

You should include these code cells in your own group project notebook.

-----

In [1]:
import numpy as np
import pandas as pd

loan = pd.read_csv('loan.csv', low_memory=False)
loan['return'] = (loan['total_pymnt_inv'] - loan['funded_amnt_inv']) / loan['funded_amnt_inv']
loan['repaid'] = loan['total_pymnt_inv'] > loan['funded_amnt_inv']