# <b>Extreme Data Challenge</b>

##  Today's Mission
- Your objective is to devise the best possible model to predict successful/default loans using Lending Club loan data.

- Class into divided into 4 groups. Groups were decided by an extremely high tech clustering algorithm.

        Team Seaborn: Zahra, Jeremy, Sierra, Aseem
        Team Pandas: Alvin, Kalyn, TJ, Julia
        Team Numpy: Armando, Erik, Joyce, Cherry
        Team Sklearn: Jamie, Monica, Patrick, Yudi, Lucas

- The training data is 100000 loans labeled either as 1 (successful) or 0 (default). Comes with 33 categorical and numerical features. The testing data is 50000 loans.

- A data dictionary file is included as well. It is a table explaining each what each feature means.

- Groups will judged on how much money their model makes. You will use your model on the testing dataset by making predictions on it and testing them. Assume that each loan is 1000 dollars and the interest rate is 10 percent. That means for every loan you issue that is successfully repaid, you will earn 100 dollars and for every loan you issue that defaults, you will lose 1000 dollars.
    
        Profit = 100*(Number of True Positives) - 1000*(Number of False Positives) 
        
- Mario, Zack, and George will be on be hand for guidance. However we want you to primarily use your teammates for help. 

- Use all the tools at your disposal, try all the models we've learned in class. Refer to past class notebooks for help. Be sure to use modeling evaluating techniques such as ROC curves, confusion matrix, recall/precision, etc.

- To optimize model, find the right combination of features and the right model with the right parameters. Get creative!

- Remember to use your time wisely, it will go by fast. Communicate amongst yourselves often.
   

### Online resources on Lending Club loan data
Kaggle Page: https://www.kaggle.com/wendykan/lending-club-loan-data. Make sure to check out the kernels section.

Y Hat tutorial (It's in R, but its still useful): http://blog.yhat.com/posts/machine-learning-for-predicting-bad-loans.html

Blog tutorial on the data from Kevin Davenport: http://kldavenport.com/lending-club-data-analysis-revisted-with-python/


### Class Time
No class breaks. But individual breaks are allowed of course.

- 6:30 - 7:10
    - Feature engineering/selection: make dummy variables, dropping features, log transformation, scaling, and other methods of transforming data. 
    - Exploratory data analysis aka get to know your features time.
    
    
- 7:10 - 8:50
    - Modeling time!!
    
    
- 8:50 - 9:25
    - Model testing.
    
    
- 9:25 - 9:30
    - Exit tickets

In [5]:
#Imports and set pandas options
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
pd.set_option("max.columns", 100)
pd.set_option("max.colwidth", 100)

In [6]:
# Load in training data.
# Loan_status column is the target variable. Remember to drop it from df.
df = pd.read_csv("loan_training_data.csv")

In [7]:
#Load in data dictionary
# Loan S
data_dict = pd.read_csv("the_data_dictionary.csv")
data_dict

Unnamed: 0,dtypes,name,description
0,float64,loan_amnt,"The listed amount of the loan applied for by the borrower. If at some point in time, the credit ..."
1,object,term,The number of payments on the loan. Values are in months and can be either 36 or 60.
2,float64,installment,The monthly payment owed by the borrower if the loan originates.
3,object,grade,LC assigned loan grade
4,object,emp_length,Employment length in years. Possible values are between 0 and 10 where 0 means less than one yea...
5,object,home_ownership,The home ownership status provided by the borrower during registration or obtained from the cred...
6,float64,annual_inc,The self-reported annual income provided by the borrower during registration.
7,object,verification_status,"Indicates if income was verified by LC, not verified, or if the income source was verified"
8,object,loan_status,Current status of the loan
9,object,purpose,A category provided by the borrower for the loan request.


In [8]:
df.head()

Unnamed: 0,loan_amnt,term,installment,grade,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,open_acc,revol_bal,total_acc,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,avg_cur_bal,bc_util,mort_acc,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_rev_tl_bal_gt_0,num_tl_90g_dpd_24m,num_tl_op_past_12m,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,fico_average
0,15000.0,36 months,485.14,B,9 years,MORTGAGE,125000.0,Not Verified,1,debt_consolidation,17.22,0.0,11.0,37651.0,24.0,0.0,87483.0,58300.0,7953.0,53.6,2.0,0.0,2.0,7.0,2.0,8.0,7.0,0.0,1.0,87483.0,8500.0,45764.0,687.0
1,12250.0,60 months,295.37,C,7 years,RENT,35000.0,Source Verified,0,credit_card,19.51,0.0,9.0,12681.0,19.0,0.0,13938.0,19345.0,1742.0,69.1,0.0,0.0,5.0,5.0,7.0,16.0,5.0,0.0,1.0,13938.0,18345.0,14793.0,722.0
2,17000.0,36 months,556.48,B,10+ years,MORTGAGE,67000.0,Source Verified,1,debt_consolidation,21.26,0.0,14.0,27320.0,33.0,0.0,43035.0,43500.0,3074.0,66.2,1.0,0.0,4.0,6.0,6.0,12.0,6.0,0.0,0.0,43035.0,31800.0,27657.0,747.0
3,8250.0,36 months,263.01,B,9 years,OWN,29000.0,Not Verified,1,debt_consolidation,24.34,0.0,14.0,8253.0,22.0,0.0,46863.0,13600.0,3347.0,71.8,0.0,0.0,4.0,5.0,4.0,4.0,5.0,0.0,0.0,46863.0,11400.0,40279.0,702.0
4,7125.0,36 months,256.06,D,9 years,RENT,87000.0,Source Verified,0,house,13.92,2.0,12.0,2426.0,19.0,1215.0,22318.0,5400.0,1860.0,47.2,0.0,0.0,5.0,7.0,7.0,8.0,8.0,0.0,2.0,22318.0,4700.0,26334.0,677.0


### Ready, Set, Go!!