# <b>Extreme Data Challenge</b>

##  Today's Mission
- Your objective is to devise the best possible model to predict successful/default loans using Lending Club loan data.

- Class into divided into 4 groups. Groups were decided by an extremely high tech clustering algorithm.

        Team Seaborn: Zahra, Jeremy, Sierra, Aseem
        Team Pandas: Alvin, Kalyn, TJ, Julia
        Team Numpy: Armando, Erik, Joyce, Cherry
        Team Sklearn: Jamie, Monica, Patrick, Yudi, Lucas

- The training data is 100000 loans labeled either as 1 (successful) or 0 (default). Comes with 33 categorical and numerical features. The testing data is 50000 loans.

- A data dictionary file is included as well. It is a table explaining each what each feature means.

- Groups will judged on how much money their model makes. You will use your model on the testing dataset by making predictions on it and testing them. Assume that each loan is 1000 dollars and the interest rate is 10 percent. That means for every loan you issue that is successfully repaid, you will earn 100 dollars and for every loan you issue that defaults, you will lose 1000 dollars.
    
        Profit = 100*(Number of True Positives) - 1000*(Number of False Positives) 
        
- Mario, Zack, and George will be on be hand for guidance. However we want you to primarily use your teammates for help. 

- Use all the tools at your disposal, try all the models we've learned in class. Refer to past class notebooks for help. Be sure to use modeling evaluating techniques such as ROC curves, confusion matrix, recall/precision, etc.

- To optimize model, find the right combination of features and the right model with the right parameters. Get creative!

- Remember to use your time wisely, it will go by fast. Communicate amongst yourselves often.
   

### Online resources on Lending Club loan data
Kaggle Page: https://www.kaggle.com/wendykan/lending-club-loan-data. Make sure to check out the kernels section.

Y Hat tutorial (It's in R, but its still useful): http://blog.yhat.com/posts/machine-learning-for-predicting-bad-loans.html

Blog tutorial on the data from Kevin Davenport: http://kldavenport.com/lending-club-data-analysis-revisted-with-python/


### Class Time
No class breaks. But individual breaks are allowed of course.

- 6:30 - 7:10
    - Feature engineering/selection: make dummy variables, dropping features, log transformation, scaling, and other methods of transforming data. 
    - Exploratory data analysis aka get to know your features time.
    
    
- 7:10 - 8:50
    - Modeling time!!
    
    
- 8:50 - 9:25
    - Model testing.
    
    
- 9:25 - 9:30
    - Exit tickets

In [68]:
#Imports and set pandas options
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
pd.set_option("max.columns", 100)
pd.set_option("max.colwidth", 100)

In [320]:
# Load in training data.
# Loan_status column is the target variable. Remember to drop it from df.
df = pd.read_csv("loan_training_data.csv").sample(frac=.25, random_state=1234)
df.reset_index(drop=True, inplace=True)

In [321]:
#Load in data dictionary
# Loan S
data_dict = pd.read_csv("the_data_dictionary.csv")
data_dict

Unnamed: 0,dtypes,name,description
0,float64,loan_amnt,"The listed amount of the loan applied for by the borrower. If at some point in time, the credit ..."
1,object,term,The number of payments on the loan. Values are in months and can be either 36 or 60.
2,float64,installment,The monthly payment owed by the borrower if the loan originates.
3,object,grade,LC assigned loan grade
4,object,emp_length,Employment length in years. Possible values are between 0 and 10 where 0 means less than one yea...
5,object,home_ownership,The home ownership status provided by the borrower during registration or obtained from the cred...
6,float64,annual_inc,The self-reported annual income provided by the borrower during registration.
7,object,verification_status,"Indicates if income was verified by LC, not verified, or if the income source was verified"
8,object,loan_status,Current status of the loan
9,object,purpose,A category provided by the borrower for the loan request.


In [322]:
df['grade'] = df['grade'].replace('A',1).replace('B',2).replace('C',3).replace('D',4).replace('E',5).replace('F',6).replace('G',7)
df['emp_length'] = df['emp_length'].replace('1 year',1).replace('10+ years',10).replace('2 years',2).replace('3 years',3).replace('4 years',4).replace('5 years',5).replace('6 years',6).replace('7 years',7).replace('8 years',8).replace('9 years',9).replace('< 1 year',0.5).replace('n/a',6.12)

In [323]:
df.head()

Unnamed: 0,loan_amnt,term,installment,grade,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,open_acc,revol_bal,total_acc,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,avg_cur_bal,bc_util,mort_acc,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_rev_tl_bal_gt_0,num_tl_90g_dpd_24m,num_tl_op_past_12m,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,fico_average
0,23800.0,60 months,539.7,2,10.0,MORTGAGE,54000.0,Verified,1,home_improvement,1.71,0.0,2.0,1575.0,9.0,0.0,62345.0,20000.0,31173.0,7.9,1.0,0.0,1.0,1.0,1.0,6.0,1.0,0.0,0.0,1575.0,20000.0,0.0,807.0
1,5000.0,36 months,171.11,2,1.0,RENT,127500.0,Not Verified,0,debt_consolidation,11.18,1.0,8.0,8054.0,22.0,171.0,85771.0,9200.0,10721.0,93.2,0.0,0.0,3.0,5.0,3.0,5.0,5.0,0.0,1.0,85771.0,3000.0,83606.0,672.0
2,5000.0,60 months,119.22,3,5.0,RENT,40000.0,Verified,1,credit_card,2.25,0.0,2.0,3787.0,5.0,0.0,3787.0,9300.0,1894.0,40.7,0.0,0.0,1.0,1.0,2.0,4.0,1.0,0.0,0.0,3787.0,9300.0,0.0,722.0
3,14975.0,60 months,422.07,6,8.0,MORTGAGE,50000.0,Source Verified,1,debt_consolidation,18.48,0.0,8.0,5723.0,15.0,0.0,31813.0,19100.0,3977.0,96.5,1.0,0.0,2.0,5.0,2.0,3.0,5.0,0.0,3.0,31813.0,5000.0,27122.0,672.0
4,7500.0,36 months,279.5,5,7.0,RENT,32000.0,Verified,1,debt_consolidation,19.24,0.0,14.0,609.0,29.0,0.0,9826.0,16200.0,702.0,6.1,1.0,1.0,2.0,6.0,2.0,9.0,6.0,0.0,6.0,9826.0,3400.0,11551.0,682.0


### Ready, Set, Go!!

In [324]:
X_all = df.drop('loan_status', 1)
y_all = df['loan_status']

In [325]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (51 total features):
['loan_amnt', 'term_ 36 months', 'term_ 60 months', 'installment', 'grade', 'emp_length', 'home_ownership_MORTGAGE', 'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT', 'annual_inc', 'verification_status_Not Verified', 'verification_status_Source Verified', 'verification_status_Verified', 'purpose_car', 'purpose_credit_card', 'purpose_debt_consolidation', 'purpose_home_improvement', 'purpose_house', 'purpose_major_purchase', 'purpose_medical', 'purpose_moving', 'purpose_other', 'purpose_renewable_energy', 'purpose_small_business', 'purpose_vacation', 'purpose_wedding', 'dti', 'delinq_2yrs', 'open_acc', 'revol_bal', 'total_acc', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim', 'avg_cur_bal', 'bc_util', 'mort_acc', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_rev_tl_bal_gt_0', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'total_bal_ex_mort', 'total_bc_

In [326]:
X_all.shape

(25000, 51)

In [327]:
from sklearn.preprocessing import StandardScaler

df_n = StandardScaler().fit_transform(X_all)
hep_pca = PCA()
hep_pca.fit(df_n)
stats_pcs = hep_pca.transform(df_n)

print stats_pcs.shape

stats_pcs

(25000, 51)


array([[ -1.68568679e+00,   4.64463377e+00,  -2.00390192e-01, ...,
         -1.30588868e-15,   4.31351839e-16,  -9.33621893e-16],
       [ -2.35804537e+00,  -2.81342696e-01,   5.30888835e-01, ...,
          4.18758703e-16,  -7.63485968e-17,  -3.80257677e-16],
       [ -4.22752906e+00,   1.32702719e+00,  -1.18256114e+00, ...,
         -2.71070691e-16,  -1.77430050e-16,  -1.62820153e-16],
       ..., 
       [  2.01125713e+00,   9.50647647e-01,   5.13032073e-01, ...,
          6.20870371e-16,  -4.95538449e-16,   2.43588877e-16],
       [ -1.02062960e+00,  -2.11291456e+00,  -1.20123979e+00, ...,
         -5.27967704e-16,   1.62130121e-15,   9.01259105e-16],
       [  2.48105115e-01,   6.87287735e-01,   2.38869968e+00, ...,
          2.21793494e-16,  -1.78210701e-17,  -8.22789202e-17]])

In [328]:
stats_pcs = pd.DataFrame(stats_pcs, columns=['PC'+str(i) for i in range(1,52)])

stats_pcs.head(5)




Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20,PC21,PC22,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30,PC31,PC32,PC33,PC34,PC35,PC36,PC37,PC38,PC39,PC40,PC41,PC42,PC43,PC44,PC45,PC46,PC47,PC48,PC49,PC50,PC51
0,-1.685687,4.644634,-0.20039,2.384549,-4.219232,-1.906168,0.669088,-1.521054,-0.522403,1.217671,-0.745609,-0.536076,-0.592538,0.418783,1.749187,1.570789,0.698227,0.290933,0.571645,-0.280846,0.319034,-0.081619,-0.045348,-0.179773,-0.280532,-2.20866,-0.759726,1.534072,-0.227355,0.679759,0.625817,0.26644,-0.673626,-0.804716,0.690629,-0.556244,0.6131,0.066666,0.844593,-0.640668,-0.011058,0.181104,-0.109271,-0.547782,0.094152,0.030741,0.122045,-3.163291e-16,-1.305889e-15,4.313518e-16,-9.336219e-16
1,-2.358045,-0.281343,0.530889,-0.432648,2.66213,0.222627,-0.49439,1.049854,0.202318,-0.206212,-1.128365,1.116979,-0.770012,0.799999,0.121705,-0.140493,-0.014857,-0.297649,0.073634,0.027089,-0.091686,-0.062848,-0.388846,0.256316,-0.060586,-0.240317,-0.37297,-0.781799,-0.865429,0.699826,-0.559442,0.667945,0.089848,-0.237566,-0.237887,-0.025214,-0.172829,0.679308,0.466956,-0.482187,0.121137,0.053563,0.292667,0.031094,0.01774,-0.013612,-0.006639,6.304293e-16,4.187587e-16,-7.63486e-17,-3.802577e-16
2,-4.227529,1.327027,-1.182561,2.450379,-1.314099,-0.557746,1.709582,-1.935468,-0.749533,0.873668,-1.540424,0.002996,1.421906,-0.211984,0.240647,-0.351088,0.002343,0.237453,0.196306,-0.318211,0.207118,-0.020902,0.110962,-0.042088,-0.165195,-0.228207,0.281515,-0.823269,-0.895735,1.540271,0.063938,0.666149,-0.079631,-0.71618,-0.418579,-0.040641,0.106118,-0.272728,0.291113,-0.327402,-0.225385,-0.120556,-0.057313,0.321176,-0.027847,-0.011572,-0.343972,-4.346524e-17,-2.710707e-16,-1.774301e-16,-1.628202e-16
3,-1.174292,0.735106,-4.169884,0.140238,-0.38503,0.734413,-1.129897,-0.715775,1.921333,-0.225509,-0.138612,-0.274624,0.032787,-0.32765,0.258284,-0.002554,0.018623,0.076159,-0.219203,0.18953,-0.159563,-0.000323,0.178878,-0.038562,-0.150394,-0.012517,0.603095,0.298635,1.067524,0.054136,-0.004701,0.490025,0.738991,0.192537,-0.300284,-0.348192,-0.10974,-0.586686,0.191864,-0.321314,0.199978,-0.172332,0.08093,0.021446,0.092777,-0.002354,-0.090446,-4.290494e-16,1.569554e-16,-1.143714e-18,-8.158466e-16
4,-1.550897,-1.654098,-0.964123,-0.711741,0.070058,-1.839189,-0.877975,0.069774,-1.232675,1.673164,1.413122,0.685984,0.683668,-0.718723,0.353887,0.036795,-0.028499,0.491389,-0.117611,-0.128704,0.093367,0.219329,0.616347,-0.085055,0.10964,-0.426501,2.191945,-0.185066,0.966874,0.625494,0.124475,-0.579318,-0.726072,-0.022949,0.141488,0.993464,0.152597,-0.945358,-0.375458,-0.496762,-0.078095,-0.005412,-0.518679,-0.242191,-0.051275,0.027387,0.091838,2.741395e-16,1.219553e-15,-1.263e-15,8.651306e-16


In [329]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(stats_pcs, y_all, random_state=1)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 18750 samples.
Testing set has 6250 samples.


In [330]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
#Import 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.ensemble import GradientBoostingClassifier

# Create the parameters list you wish to tune
parameters = [{'max_depth':[2]},{'n_estimators':[100]},{'learning_rate':[.1]}]
# learning rate denotes the iterative contribution of each new step
# max depth is depth of decision tree we are including
# max features is number of nodes
# some other shit
# n estimators is number of 'boosting stages' to perform-- i.e. number of simple models you are testing

# Initialize the classifier
clf = GradientBoostingClassifier(random_state=1)

# Make an f1 scoring function using 'make_scorer' 
# f1_scorer = make_scorer(f1_score, pos_label=1)

# Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, scoring='roc_auc',cv=5)

# Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)

# Get the estimator
clf = grid_obj.best_estimator_

In [331]:
print clf

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=2,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=1,
              subsample=1.0, verbose=0, warm_start=False)


In [332]:
y_pred = clf.predict_proba(X_test)

In [333]:
from sklearn.metrics import confusion_matrix
#Profit calculator
def profit_calculator(y_true, y_preds):
    cm = confusion_matrix(y_true, y_preds)
    tp = cm[1,1]
    fp = cm[0,1]
    return 100*tp - 1000*fp

In [334]:
y_pred = clf.predict_proba(X_test)
y_pred=y_pred[:,1]
y_pred[y_pred>0.9] = 1
y_pred[y_pred<=0.9] = 0
profit_calculator(y_test, y_pred)

41800

(25000, 51)

In [409]:
# Load in training data.
# Loan_status column is the target variable. Remember to drop it from df.
dft = pd.read_csv("loan_testing_data.csv").sample(frac=.2, random_state=1234)
dft.reset_index(drop=True, inplace=True)

In [410]:
dft['grade'] = dft['grade'].replace('A',1).replace('B',2).replace('C',3).replace('D',4).replace('E',5).replace('F',6).replace('G',7)
dft['emp_length'] = dft['emp_length'].replace('1 year',1).replace('10+ years',10).replace('2 years',2).replace('3 years',3).replace('4 years',4).replace('5 years',5).replace('6 years',6).replace('7 years',7).replace('8 years',8).replace('9 years',9).replace('< 1 year',0.5).replace('n/a',6.12)

In [411]:
Xt_all = dft.drop('loan_status', 1)
yt_all = dft['loan_status']

In [412]:
print dft.shape
print Xt_all.shape

(10000, 33)
(10000, 32)


In [413]:
def preprocess_features(Xt):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = Xt.index)

    # Investigate each feature column for the data
    for col, col_data in Xt.iteritems():
        

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

Xt_all = preprocess_features(Xt_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (51 total features):
['loan_amnt', 'term_ 36 months', 'term_ 60 months', 'installment', 'grade', 'emp_length', 'home_ownership_MORTGAGE', 'home_ownership_NONE', 'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT', 'annual_inc', 'verification_status_Not Verified', 'verification_status_Source Verified', 'verification_status_Verified', 'purpose_car', 'purpose_credit_card', 'purpose_debt_consolidation', 'purpose_home_improvement', 'purpose_house', 'purpose_major_purchase', 'purpose_medical', 'purpose_moving', 'purpose_other', 'purpose_renewable_energy', 'purpose_small_business', 'purpose_vacation', 'purpose_wedding', 'dti', 'delinq_2yrs', 'open_acc', 'revol_bal', 'total_acc', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim', 'avg_cur_bal', 'bc_util', 'mort_acc', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_rev_tl_bal_gt_0', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'total_bal_ex_mort', 'total_bc_

In [414]:
from sklearn.preprocessing import StandardScaler

dft_n = StandardScaler().fit_transform(Xt_all)
hep_pcat = PCA()
hep_pcat.fit(dft_n)
stats_pcst = hep_pcat.transform(dft_n)

print stats_pcst.shape

stats_pcst

(10000, 50)


array([[ -1.94443139e+00,  -7.62416743e-01,   4.18053204e-01, ...,
          1.76482708e-16,   6.47627531e-16,   1.44160483e-16],
       [ -2.10514851e-01,   7.31017583e-02,  -4.24616552e+00, ...,
          1.04779442e-16,   6.65031399e-16,  -6.07944120e-16],
       [  4.33596968e+00,   7.61231354e+00,   2.52914106e+00, ...,
          4.35572864e-16,   2.28112142e-15,   6.50213807e-16],
       ..., 
       [ -1.58891538e+00,  -8.46802687e-01,   8.50083117e-01, ...,
          1.11067650e-15,  -4.18466216e-16,   1.26958161e-15],
       [ -4.40309082e-01,  -1.75400406e+00,  -1.72792645e+00, ...,
          1.99353000e-16,   3.90308425e-16,   3.56902191e-16],
       [ -4.06145588e-01,  -9.15074964e-01,   2.57838952e+00, ...,
         -3.67989040e-16,  -2.63107762e-16,  -5.33016067e-16]])

In [403]:
stats_pcst = pd.DataFrame(stats_pcst, columns=['PC'+str(i) for i in range(1,52)])

stats_pcst.shape


stats_pcst

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20,PC21,PC22,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30,PC31,PC32,PC33,PC34,PC35,PC36,PC37,PC38,PC39,PC40,PC41,PC42,PC43,PC44,PC45,PC46,PC47,PC48,PC49,PC50
0,-1.944431,-0.762417,0.418053,0.607060,1.159433,-1.401600,0.610780,0.710893,-0.306903,-0.809162,2.066007,1.056046,-0.867222,0.523999,-2.994582,0.222464,-0.825969,-0.075795,0.200714,0.109091,0.023208,0.211201,0.325812,-0.812251,-0.779198,-1.839513,1.787497,-0.274632,0.352189,0.139437,0.751202,-0.145095,-0.393488,0.420374,0.081516,-0.900477,0.777086,0.297459,-0.092634,-0.320257,0.187944,-0.005789,0.205136,0.163412,-0.004900,-0.010630,-1.236119e-16,1.764827e-16,6.476275e-16,1.441605e-16
1,-0.210515,0.073102,-4.246166,0.031356,2.516075,1.155031,-1.384124,0.839872,-2.009548,-0.130495,-0.463606,-0.446675,-0.419936,0.490171,0.195837,0.106261,-0.381257,0.009069,-0.195454,-0.074362,-0.089599,0.016294,0.046596,0.048591,0.208312,-0.222334,-0.008682,2.261851,0.732539,0.516984,0.694095,0.605824,-0.137726,-0.348312,0.579426,-0.265864,0.052553,0.794693,0.395980,0.498491,0.076318,0.000339,0.162496,-0.103819,-0.029560,-0.091088,7.930630e-16,1.047794e-16,6.650314e-16,-6.079441e-16
2,4.335970,7.612314,2.529141,-0.317992,1.661092,-1.561539,1.099818,0.236725,-1.215070,0.858564,-0.146738,-0.218305,-0.212966,0.328772,0.067519,-0.514846,0.175164,-0.004348,0.327074,-0.251260,-0.166361,0.299504,0.169973,0.253988,-0.693133,-0.307759,-0.076837,-0.594677,2.363670,0.657461,-0.052688,2.711600,-0.799528,-1.919781,0.772797,-0.944605,-1.433958,0.858587,0.000352,0.234830,-0.053147,0.472738,-0.737080,0.222693,0.052838,0.000680,2.039197e-15,4.355729e-16,2.281121e-15,6.502138e-16
3,-3.511269,-1.021016,1.253802,-0.998643,0.606031,0.338642,-0.597242,-0.952887,0.123861,-1.280195,0.748796,-0.600005,-0.222509,0.082751,0.138217,0.213162,-0.026229,0.010602,-0.129789,0.059907,0.172873,0.010964,-0.094531,-0.543948,0.587927,0.008382,-0.242298,-0.217654,0.559810,0.523044,-0.301682,0.683587,1.138355,-0.617621,0.263060,-0.118374,-0.772084,0.206781,-0.226112,0.200529,-0.277770,-0.098621,0.070324,0.001094,0.014181,0.024616,-8.247823e-16,9.891887e-17,-3.006230e-16,2.498946e-16
4,1.057767,1.292991,-1.605424,-1.207382,-1.550297,0.223137,-0.769339,1.995206,-1.147707,0.162909,1.102024,-0.118425,0.825765,-0.285225,-0.938154,-6.031747,1.948376,0.407418,-2.468908,-0.262124,0.427258,0.047314,-0.195068,1.227363,2.741624,-0.346197,0.441821,0.941071,-0.354284,-0.232932,0.358174,-0.980280,-0.467215,0.477273,-0.402725,-0.458711,1.045924,0.276280,-0.149574,-0.410673,0.037130,-0.040459,0.342367,0.235524,-0.000822,0.017713,2.149967e-16,3.110077e-15,-7.825732e-16,-1.711982e-15
5,-0.861906,-3.170034,1.046204,1.247799,-0.671391,1.806026,-1.149876,3.176955,0.157881,-0.755299,-0.398204,2.691261,3.128173,0.420727,0.221969,-0.663097,2.190735,-2.537325,9.452007,-1.619352,-2.250026,-0.271772,1.714761,2.759099,1.653363,0.516646,-1.333200,0.357709,-1.679466,0.914337,-1.087216,-0.046427,-0.173256,-0.205356,-0.695083,-0.054793,0.950913,0.160129,0.397317,0.170253,-0.546792,0.359333,0.059415,-0.042156,0.019806,0.092427,1.491421e-15,1.297027e-15,-2.398504e-16,-9.815415e-16
6,-0.122060,-1.606029,-2.649926,-2.184945,0.438506,0.716591,-0.143889,-0.044212,-1.110425,-0.654312,0.498342,-1.055332,-0.174775,-0.086694,0.017273,0.199411,0.307019,0.145496,-0.006689,0.021095,-0.068862,0.110222,0.063908,-0.368198,0.479728,1.068782,-0.634541,-0.872889,0.128423,0.260901,0.575626,0.057751,0.080109,-0.162369,-0.026202,-0.119368,-0.045263,0.080292,0.140220,0.232217,0.052406,-0.032030,0.025505,0.023475,-0.010265,-0.027394,4.464305e-16,-8.744049e-16,-9.236364e-17,-7.937119e-16
7,3.455280,-0.293751,-2.203846,-0.746288,1.705251,0.411270,-0.214395,0.054803,-0.952307,-0.334903,1.219492,-0.718315,0.054254,0.312745,0.097077,-0.198204,0.283854,0.121389,0.259894,-0.011730,-0.080058,0.127577,0.137439,-0.319228,-0.129542,0.284517,-0.931497,-0.436941,1.300542,-0.200457,1.517318,-0.257818,0.816552,-0.005510,-0.475905,-0.474018,0.583562,0.346647,-0.217785,0.023895,0.106523,-0.116919,-0.040634,-0.028922,-0.000023,0.027939,1.776745e-15,-1.039126e-15,-3.911902e-17,-7.751675e-16
8,1.648612,-2.433404,0.017132,1.458480,-1.101860,0.697672,-1.310877,-0.257546,1.497392,1.512984,-0.259853,0.422080,-0.295261,-0.253190,-0.134641,-0.291076,-0.048462,-0.033905,0.168910,0.119954,-0.027597,-0.208901,0.137177,0.310514,-0.629823,-0.932540,-0.064042,-0.972885,-0.355925,-0.657082,-1.792368,-0.963205,-0.397383,-0.265440,0.303167,-0.660991,-1.182501,0.089049,-0.249588,0.696095,-0.025141,0.002227,0.091183,-0.036812,-0.020916,0.034817,-3.041263e-16,3.205377e-16,6.999708e-16,-2.466423e-16
9,-2.758582,-0.966940,0.627586,-0.644864,0.138566,-0.820352,-0.482875,-1.268695,0.202128,-1.758864,-1.355004,0.738693,2.629615,0.418377,-0.244685,0.152284,0.080893,0.138230,-0.156671,-0.022209,0.261240,0.390088,-0.525168,0.013447,-0.684604,0.553252,0.418973,-0.860636,-0.579001,0.248483,-0.114122,0.263779,0.092132,0.494681,0.252086,-0.419875,0.179355,-0.055135,0.123563,-0.046844,-0.036668,-0.183512,0.108449,-0.009535,-0.013900,-0.039463,-9.746800e-16,-1.505257e-16,-6.408409e-16,3.120486e-18


In [404]:
clf

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=2,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=1,
              subsample=1.0, verbose=0, warm_start=False)

In [405]:
yt_pred = clf.predict_proba(stats_pcst)

ValueError: Number of features of the model must match the input. Model n_features is 51 and input n_features is 50 

In [377]:
from sklearn.metrics import confusion_matrix
#Profit calculator
def profit_calculator(y_true, y_preds):
    cm = confusion_matrix(y_true, y_preds)
    tp = cm[1,1]
    fp = cm[0,1]
    return 100*tp - 1000*fp

In [381]:
yt_pred = clf.predict_proba(stats_pcst)
yt_pred=yt_pred[:,1]
yt_pred[yt_pred>0.9] = 1
yt_pred[yt_pred<=0.9] = 0
profit_calculator(yt_all, yt_pred)

ValueError: Number of features of the model must match the input. Model n_features is 51 and input n_features is 50 