<a id='all'></a>
## Assignment 2  - Ensemble Learning
-- QF624 Machine Learning and Financial Applications  He Man 01387756

[0. Data Pre-processing](#1)<br>
[1. [Question 1] Ensemble Learning](#2)<br>
[2. [Question 2] Ensemble Learning with External Data](#3)<br>


### **<code style="color:red">Directly to the results part</code>**
>[[Question 1] Ensemble Learning](#4)<br>
>[[Question 2] Ensemble Learning with External Data](#5)<br>

<a id='1'></a>
# Data Pre-processing

[back to main list](#all)

In [1]:
import pandas as pd

In [2]:
loanStats = pd.read_csv('LoanStats_2016Q4.csv', skiprows = 1)
rejectStats = pd.read_csv('RejectStats_2016Q4.csv', skiprows = 1)

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Drop last two rows
loanStats = loanStats.drop(loanStats.index[-2:])

# Drop first two columns since they are empty
loanStats = loanStats.drop(['id', 'member_id'], axis = 1)

print(loanStats.shape)
print(loanStats.columns)
print(rejectStats.shape)
print(rejectStats.columns)

print(loanStats['loan_amnt'].describe())
print(rejectStats['Amount Requested'].describe())

print(loanStats['title'].value_counts())
print(rejectStats['Loan Title'].value_counts())

"""
There are only 9 columns in reject data, let's try to match loan data with reject data:

'loan_amnt' in loanStats should match with 'Amount Requested' in rejectStats.
'issue_d' in loanStats is the issue date, it should be later than 'Application Date' in rejectStats, let's ignore this pair for now.
'title' in loanStats should match with 'Loan Title' in rejectStats. However, we need to deal with upper/lower case and space/underscore issues.
There is no column in loanStats matches with 'Risk_Score' in rejectStats, would be very useful if there was one.
'dti' in loanStats matches with 'Debt-To-Income Ratio' in rejectStats.
'zip_code' and 'addr_state' in loanStats match with 'Zip Code' and 'State' in rejectStats respectively. Zip codes are only available in first 3 digits, and we might need to get external data for the demographics about the zip code.
'emp_length' in loanStats matches with 'Employment Length' in rejectStats.
'policy_code' in loanStats matches with 'Policy Code' in rejectStats.
"""

print(loanStats['emp_length'].value_counts())
print(rejectStats['Employment Length'].value_counts())

print(loanStats['policy_code'].value_counts())
print(rejectStats['Policy Code'].value_counts())

#However, policy code is useless since it carries different meaning in the two datasets, 
#so we should ignore this pair of columns.
#Let's now build two new dataframes, and keep their columns consistent.

approved = pd.DataFrame()
rejected = pd.DataFrame()

#Copy values from 'loan_amnt' and 'Amount Requested' directly to the new dataframes
approved['amount'] = loanStats['loan_amnt']
rejected['amount'] = rejectStats['Amount Requested']

#Using lambda functions to convert strings, first make all characters lower cases, and split them by space, 
#then join them using '_'. While joining, remove words like 'and', 'expenses', 'financing', 'loan', 'refinancing', 
#as these words are not important in reasons for loans. To further consolidate the reasons, 
#we merge 'housing' to 'home_buying', 'moving' to 'moving_relocation', 'renewable_energy' to 'green', 
#and 'small_business' to 'business'.
stop_words = ['and', 'expenses', 'financing', 'loan', 'refinancing']
approved['reason'] = loanStats['title'].apply(lambda x: 'other' if type(x) != str else '_'.join([i for i in x.lower().split() if i not in stop_words]))
rejected['reason'] = rejectStats['Loan Title'].apply(lambda x: 'other' if type(x) != str else '_'.join([i for i in x.lower().split() if i not in stop_words]))
convert = {'house': 'home_buying', 'moving': 'moving_relocation', 'renewable_energy': 'green', 'small_business': 'business'}
rejected['reason'] = rejected['reason'].apply(lambda x: convert[x] if x in convert else x)

#Copy values from 'dti' and 'Debt-To-Income Ratio', however, to identify Not-a-Number floats, we test x == x, in it's a NaN, 
#we set it to the maximum in the dataset.
approved['debt_to_income'] = loanStats['dti'].apply(lambda x: max(0.0, min(x / 100, 1.0)) if x == x else 1.0)
rejected['debt_to_income'] = rejectStats['Debt-To-Income Ratio'].apply(lambda x: max(0.0, min(float(x[:-1]) / 100, 1.0)))

#Keep the first 3 digits of the zip codes since the last 2 digits are masked, however, it is unnecessary to convert them to numbers.
approved['zip3'] = loanStats['zip_code'].apply(lambda x: x[:3] if type(x) == str else 'N/A')
rejected['zip3'] = rejectStats['Zip Code'].apply(lambda x: x[:3] if type(x) == str else 'N/A')
approved['state'] = loanStats['addr_state']
rejected['state'] = rejectStats['State']

#Convert the employment length to numerical values, if it is not specified, or less than a year, we take them as 0.
approved['employ_length'] = loanStats['emp_length'].apply(lambda x: 0 if type(x) != str or x[:3] == '< 1' else int(x[:2]))
rejected['employ_length'] = rejectStats['Employment Length'].apply(lambda x: 0 if type(x) != str or x[:3] == '< 1' else int(x[:2]))

#The column 'reason' is now the only nominal column, we may use function get_dummies, 
#drop 'other' is recommended since 'reason_=_other' is fuzzy, 
#and should be expressed by 0 in all other reason columns.
approved = pd.concat([approved, pd.get_dummies(approved['reason'], prefix = 'reason', prefix_sep = '_=_').drop('reason_=_other', axis = 1)], axis = 1).drop('reason', axis = 1)
rejected = pd.concat([rejected, pd.get_dummies(rejected['reason'], prefix = 'reason', prefix_sep = '_=_').drop('reason_=_other', axis = 1)], axis = 1).drop('reason', axis = 1)

#Add the response column to the two dataframes.
approved['approved'] = 1
rejected['approved'] = 0

#Convert the two dataframes to numpy arrays.
data_pos = approved.drop(['zip3', 'state'], axis = 1).values
data_neg = rejected.drop(['zip3', 'state'], axis = 1).values

print(approved.columns)
print(rejected.columns)

(103546, 142)
Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length',
       ...
       'orig_projected_additional_accrued_interest',
       'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
       'debt_settlement_flag', 'debt_settlement_flag_date',
       'settlement_status', 'settlement_date', 'settlement_amount',
       'settlement_percentage', 'settlement_term'],
      dtype='object', length=142)
(1404490, 9)
Index(['Amount Requested', 'Application Date', 'Loan Title', 'Risk_Score',
       'Debt-To-Income Ratio', 'Zip Code', 'State', 'Employment Length',
       'Policy Code'],
      dtype='object')
count    103546.000000
mean      14151.435835
std        9215.032376
min        1000.000000
25%        7000.000000
50%       12000.000000
75%       20000.000000
max       40000.000000
Name: loan_amnt, dtype: float64
count    1.404490e+06
mean     1.293313e+04
std      1.567272e+04
mi

<a id='2'></a>
<a id='list'></a>
# <code style="color:green">[Question 1] Ensemble Learning</code>
In this assignment, you are given the lending club dataset on the approve/reject classification of personal loans, and the .ipynb demonstrated in class on how to build classifiers with grid search and with your own F 0.5 scoring function. You are to build ensemble models for the same task.

[1.1 Data preparation](#a)<br>
[1.2 Simple Model (logistic, Decision Tree, Random Forest, KNN)](#b)<br>
[1.3 Bagging](#c)<br>
[1.4 Boosting (Adaboost, Gradient boosting)](#d)<br>
[1.5 Stacking](#e)<br>
[1.6 Performance Comparation](#f)<br>
<br>
<br>
[back to main list](#all)

<a id='a'></a>
## 1.1 Data preparation
- Test size: 0.25
- Random state: 2020

[back to contents list](#list)

In [4]:
import numpy as np
data = np.vstack((data_pos, data_neg))

In [5]:
seed = 2020
np.random.seed(seed)

In [6]:
from sklearn import model_selection
x_train, x_test, y_train, y_test = model_selection.train_test_split(data[:,:-1], data[:,-1].astype(int), 
                                                                    test_size = 0.25, random_state = 2020)

In [7]:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
x_train = min_max_scaler.fit_transform(x_train)
x_test = min_max_scaler.transform(x_test)

In [8]:
from sklearn import ensemble
from sklearn import metrics
from sklearn.metrics import fbeta_score, make_scorer
from sklearn import linear_model
from sklearn import tree
from sklearn.metrics import roc_curve

In [9]:
def gen_table(df, model_name, y_test, y_pred, y_pred_s):
    fpr, tpr, _ = roc_curve(y_test, y_pred_s)
    res = np.array([model_name, 
                    round(metrics.precision_score(y_test, y_pred),6), 
                    round(metrics.recall_score(y_test, y_pred), 6),
                    round(metrics.f1_score(y_test, y_pred), 6),
                    round(metrics.auc(fpr, tpr), 6)])
    df = df.append(dict(zip(df.columns, res)), ignore_index = True)
    return df

In [10]:
table = pd.DataFrame(columns=['model','precision','recall','F_beta','AUC'])

<a id='b'></a>
## 1.2 Simple Model
- Logistic
- Decision Tree
- Random Forest
- KNN

[back to contents list](#list)

### 1.2.1 Logistc
Parameter setting:

    'C': 10
    'solver': 'newton-cg'

In [30]:
%%time
fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid = model_selection.GridSearchCV(linear_model.LogisticRegression(random_state = 2020), 
                                    param_grid = {'C': [1, 10, 100, 1000], 
                                                  'solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']}, 
                                    scoring = fhalf_scorer, cv = 10)
grid.fit(x_train, y_train)
print(grid.best_params_)

{'C': 10, 'solver': 'newton-cg'}
CPU times: user 39min 58s, sys: 30.9 s, total: 40min 29s
Wall time: 22min 46s


    {'C': 10, 'solver': 'newton-cg'}
    CPU times: user 39min 58s, sys: 30.9 s, total: 40min 29s
    Wall time: 22min 46s

In [11]:
%%time
# params = grid.best_params_
# params_C = params['C']
# params_solver = params['solver']
params_C = 10
params_solver = 'newton-cg'

logit = linear_model.LogisticRegression(C = params_C, solver = params_solver)
logit.fit(x_train, y_train)
y_pred = logit.predict(x_test)
y_pred_s = logit.predict_proba(x_test)[:,1]
table = gen_table(table, 'logistic', y_test, y_pred, y_pred_s)
print(table)

      model precision    recall    F_beta       AUC
0  logistic  0.709591  0.336463  0.456479  0.854925
CPU times: user 22.5 s, sys: 588 ms, total: 23.1 s
Wall time: 5.97 s


### 1.2.2 Decision Tree
Parameter setting:

    default

In [12]:
%%time
DT = tree.DecisionTreeClassifier()
DT.fit(x_train, y_train)
y_pred = DT.predict(x_test)
y_pred_s = DT.predict_proba(x_test)[:,1]
table = gen_table(table, 'DT', y_test, y_pred, y_pred_s)
print(table)

      model precision    recall    F_beta       AUC
0  logistic  0.709591  0.336463  0.456479  0.854925
1        DT  0.649097  0.606941  0.627311  0.793089
CPU times: user 7.05 s, sys: 78.7 ms, total: 7.13 s
Wall time: 7.14 s


### 1.2.3 Random Forest
Parameter setting:

    default

In [None]:
%%time
fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), 
                                    param_grid = {'max_features':['auto', 'sqrt', 'log2']}, 
                                    scoring = fhalf_scorer, cv = 5)
grid.fit(x_train, y_train)
print(grid.best_params_)

In [None]:
SVC = SVC(gamma='scale', class_weight='balanced')

In [None]:
SVC = SVC(gamma='scale', class_weight='balanced')
fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
param_grid = {
        'kernel': [‘linear’, ‘poly’, ‘rbf’],
        'C':[10, 100],
        'max_iter':[20, 50, 100],
        }
grid = RandomizedSearchCV(SVC, param_grid, cv = 5, scoring = fhalf_scorer, n_iter = 20, n_jobs = -1)
grid.fit(x_train, y_train)

params = grid.best_params_
params_kernel = params['kernel']
params_C = params['C']
params_max_iter = params['max_iter']

SVC = SVC(kernel = params_kernel, C = params_C, max_iter = params_max_iter,
          gamma='scale', class_weight='balanced')
SVC.fit(x_train, y_train)
y_pred = SVC.predict(x_test) #to get precision, recall...
y_pred_s = SVC.predict_proba(x_test)[:,1] #to get AUC

In [None]:
params = grid.best_params_
params_kernel = params['kernel']
params_C = params['C']
params_max_iter = params['max_iter']

SVC = SVC(kernel = params_kernel, C = params_C, max_iter = params_max_iter,
          gamma='scale', class_weight='balanced')
SVC.fit(x_train, y_train)
y_pred = SVC.predict(x_test) #to get precision, recall...
y_pred_s = SVC.predict_proba(x_test)[:,1] #to get AUC

    {'max_features': 'log2'}
    CPU times: user 58min 45s, sys: 25.7 s, total: 59min 11s
    Wall time: 59min 16s

In [13]:
%%time
# params = grid.best_params_
# params_max_features = params['max_features']

RF = ensemble.RandomForestClassifier(max_features = 'log2')
RF.fit(x_train, y_train)
y_pred = RF.predict(x_test)
y_pred_s = RF.predict_proba(x_test)[:,1]
table = gen_table(table, 'RF', y_test, y_pred, y_pred_s)
print(table)

      model precision    recall    F_beta       AUC
0  logistic  0.709591  0.336463  0.456479  0.854925
1        DT  0.649097  0.606941  0.627311  0.793089
2        RF  0.679593  0.625024  0.651167  0.922701
CPU times: user 4min 16s, sys: 2.7 s, total: 4min 19s
Wall time: 4min 20s


### 1.2.4 KNN

In [18]:
from sklearn.neighbors import KNeighborsClassifier

In [21]:
%%time
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
y_pred_s = knn.predict_proba(x_test)[:,1]
table = gen_table(table, 'KNN', y_test, y_pred, y_pred_s)
print(table)

               model precision    recall    F_beta       AUC
0           logistic  0.709591  0.336463  0.456479  0.854925
1                 DT  0.649097  0.606941  0.627311  0.793089
2                 RF  0.679593  0.625024  0.651167  0.922701
3   bagging-logistic  0.709628  0.336617  0.456628  0.854954
4         bagging-DT  0.739342  0.671271  0.703664  0.938534
5    boosting-Ada-DT   0.66146  0.617483  0.638715   0.87244
6  boosting-Gradient  0.715461  0.749375  0.732025   0.96119
7                KNN   0.59917  0.577661  0.588219  0.774524
8                KNN  0.708381  0.637682  0.671175  0.899424
CPU times: user 37min 43s, sys: 7.9 s, total: 37min 51s
Wall time: 37min 53s


<a id='c'></a>
## 1.3 Bagging
    1. Logistic based bagging
    2. Decision Tree based bagging
    
[back to contents list](#list)

### 1.3.1 Logistc based bagging
Parameter setting:

    'C': 10
    'solver': 'newton-cg'
    'max_samples': 0.3
    'n_estimators': 80

In [35]:
%%time
bagging_logistic = ensemble.BaggingClassifier(logit, 
                                              max_features = 1.0, 
                                              oob_score = True, 
                                              random_state = 2020, 
                                              n_jobs = -1)

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid2 = model_selection.GridSearchCV(bagging_logistic, 
                                     param_grid = {'n_estimators': [10, 20, 50, 80, 100], 
                                                   'max_samples':[0.3,0.5,0.7,0.8]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid2.fit(x_train, y_train)
print(grid2.best_params_)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


{'max_samples': 0.3, 'n_estimators': 80}
CPU times: user 1h 29min 25s, sys: 3min 26s, total: 1h 32min 52s
Wall time: 7h 57min 44s


    {'max_samples': 0.3, 'n_estimators': 80}
    CPU times: user 1h 29min 25s, sys: 3min 26s, total: 1h 32min 52s
    Wall time: 7h 57min 44s

In [14]:
%%time
# params = grid2.best_params_
# params_n_estimators = params['n_estimators']
# params_max_samples = params['max_samples']
params_n_estimators = 80
params_max_samples = 0.3

best_bagging_logistic = ensemble.BaggingClassifier(logit, max_features = 1.0, 
                                                   max_samples = params_max_samples, 
                                                   n_estimators = params_n_estimators, 
                                                   oob_score = True, n_jobs = -1)
best_bagging_logistic.fit(x_train, y_train)
y_pred = best_bagging_logistic.predict(x_test)
y_pred_s = best_bagging_logistic.predict_proba(x_test)[:,1]
table = gen_table(table, 'bagging-logistic', y_test, y_pred, y_pred_s)
print(table)

              model precision    recall    F_beta       AUC
0          logistic  0.709591  0.336463  0.456479  0.854925
1                DT  0.649097  0.606941  0.627311  0.793089
2                RF  0.679593  0.625024  0.651167  0.922701
3  bagging-logistic  0.709628  0.336617  0.456628  0.854954
CPU times: user 57 s, sys: 3.03 s, total: 60 s
Wall time: 4min 23s


### 1.3.2 Decision Tree based bagging
Parameter setting:

    'max_samples': 0.3
    'n_estimators': 100

In [37]:
%%time
bagging_DT = ensemble.BaggingClassifier(DT, 
                                        max_features = 1.0, 
                                        oob_score = True, 
                                        random_state = 2020, 
                                        n_jobs = -1)

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid3 = model_selection.GridSearchCV(bagging_DT, 
                                     param_grid = {'n_estimators': [10, 20, 50, 80, 100], 
                                                   'max_samples':[0.3,0.5,0.7,0.8]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid3.fit(x_train, y_train)
print(grid3.best_params_)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


{'max_samples': 0.3, 'n_estimators': 100}
CPU times: user 45min 45s, sys: 5min 2s, total: 50min 48s
Wall time: 3h 6min 46s


    {'max_samples': 0.3, 'n_estimators': 100}
    CPU times: user 45min 45s, sys: 5min 2s, total: 50min 48s
    Wall time: 3h 6min 46s

In [15]:
%%time
# params = grid3.best_params_
# params_n_estimators = params['n_estimators']
# params_max_samples = params['max_samples']
params_n_estimators = 100
params_max_samples = 0.3

best_bagging_DT = ensemble.BaggingClassifier(DT, max_features = 1.0, 
                                             max_samples = params_max_samples, 
                                             n_estimators = params_n_estimators, 
                                             oob_score = True, n_jobs = -1)
best_bagging_DT.fit(x_train, y_train)
y_pred = best_bagging_DT.predict(x_test)
y_pred_s = best_bagging_DT.predict_proba(x_test)[:,1]
table = gen_table(table, 'bagging-DT', y_test, y_pred, y_pred_s)
print(table)

              model precision    recall    F_beta       AUC
0          logistic  0.709591  0.336463  0.456479  0.854925
1                DT  0.649097  0.606941  0.627311  0.793089
2                RF  0.679593  0.625024  0.651167  0.922701
3  bagging-logistic  0.709628  0.336617  0.456628  0.854954
4        bagging-DT  0.739342  0.671271  0.703664  0.938534
CPU times: user 34.7 s, sys: 3.57 s, total: 38.2 s
Wall time: 1min 41s


<a id='d'></a>
## 1.4 Boosting (DecisionTree based)
    1. Adaboost
    2. GradientBoosting

[back to contents list](#list)

### 1.4.1 Adaboost
Parameter setting:

    'algorithm': 'SAMME'
    'n_estimators': 20

In [48]:
%%time
boosting_Ada = ensemble.AdaBoostClassifier(tree.DecisionTreeClassifier(), random_state = 2020)

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid4 = model_selection.GridSearchCV(boosting_Ada, 
                                     param_grid = {'algorithm': ['SAMME', 'SAMME.R'], 
                                                   'n_estimators': [10, 20, 50, 80, 100]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid4.fit(x_train, y_train)
print(grid4.best_params_)

{'algorithm': 'SAMME', 'n_estimators': 20}
CPU times: user 8h 33min 50s, sys: 2min 28s, total: 8h 36min 18s
Wall time: 8h 36min 57s


    {'algorithm': 'SAMME', 'n_estimators': 20}<br>
    CPU times: user 8h 33min 50s, sys: 2min 28s, total: 8h 36min 18s<br>
    Wall time: 8h 36min 57s<br>

In [16]:
%%time
# params = grid4.best_params_
# params_algorithm = params['algorithm']
# params_n_estimators = params['n_estimators']
params_algorithm = 'SAMME'
params_n_estimators = 20

best_boosting_Ada = ensemble.AdaBoostClassifier(tree.DecisionTreeClassifier(), 
                                                n_estimators = params_n_estimators, 
                                                algorithm = params_algorithm)
best_boosting_Ada.fit(x_train, y_train)
y_pred = best_boosting_Ada.predict(x_test)
y_pred_s = best_boosting_Ada.predict_proba(x_test)[:,1]
table = gen_table(table, 'boosting-Ada-DT', y_test, y_pred, y_pred_s)
print(table)

              model precision    recall    F_beta       AUC
0          logistic  0.709591  0.336463  0.456479  0.854925
1                DT  0.649097  0.606941  0.627311  0.793089
2                RF  0.679593  0.625024  0.651167  0.922701
3  bagging-logistic  0.709628  0.336617  0.456628  0.854954
4        bagging-DT  0.739342  0.671271  0.703664  0.938534
5   boosting-Ada-DT   0.66146  0.617483  0.638715   0.87244
CPU times: user 3min 7s, sys: 1.2 s, total: 3min 8s
Wall time: 3min 9s


### 1.4.2 GradientBoosting
Parameter setting:

    'n_estimators': 100

In [49]:
%%time
boosting_G = ensemble.GradientBoostingClassifier()

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid5 = model_selection.GridSearchCV(boosting_G, 
                                     param_grid = {'n_estimators': [10, 20, 50, 80, 100]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid5.fit(x_train, y_train)
print(grid5.best_params_)

{'n_estimators': 100}
CPU times: user 1h 4min 35s, sys: 1min 8s, total: 1h 5min 44s
Wall time: 1h 5min 45s


    {'n_estimators': 100}
    CPU times: user 1h 4min 35s, sys: 1min 8s, total: 1h 5min 44s
    Wall time: 1h 5min 45s

In [17]:
%%time
# params = grid5.best_params_
# params_n_estimators = params['n_estimators']
params_n_estimators = 100

best_boosting_G = ensemble.GradientBoostingClassifier(n_estimators = params_n_estimators)
best_boosting_G.fit(x_train, y_train)
y_pred = best_boosting_G.predict(x_test)
y_pred_s = best_boosting_G.predict_proba(x_test)[:,1]
table = gen_table(table, 'boosting-Gradient', y_test, y_pred, y_pred_s)
print(table)

               model precision    recall    F_beta       AUC
0           logistic  0.709591  0.336463  0.456479  0.854925
1                 DT  0.649097  0.606941  0.627311  0.793089
2                 RF  0.679593  0.625024  0.651167  0.922701
3   bagging-logistic  0.709628  0.336617  0.456628  0.854954
4         bagging-DT  0.739342  0.671271  0.703664  0.938534
5    boosting-Ada-DT   0.66146  0.617483  0.638715   0.87244
6  boosting-Gradient  0.715461  0.749375  0.732025   0.96119
CPU times: user 2min 41s, sys: 2.82 s, total: 2min 44s
Wall time: 2min 44s


<a id='e'></a>
## 1.5 Stacking

Move on to stacking model, we need to choose the high performance simple model as our final_estimator, in this case is Randomforest model.

[back to contents list](#list)

In [18]:
%%time
estimators = [('logit', logit), ('DT', DT), ('RF', RF)]
stacking_1 = ensemble.StackingClassifier(estimators = estimators, final_estimator = RF)
stacking_1.fit(x_train, y_train)
y_pred = stacking_1.predict(x_test)
y_pred_s = stacking_1.predict_proba(x_test)[:,1]
table = gen_table(table, 'stacking_3', y_test, y_pred, y_pred_s)
print(table)

               model precision    recall    F_beta       AUC
0           logistic  0.709591  0.336463  0.456479  0.854925
1                 DT  0.649097  0.606941  0.627311  0.793089
2                 RF  0.679593  0.625024  0.651167  0.922701
3   bagging-logistic  0.709628  0.336617  0.456628  0.854954
4         bagging-DT  0.739342  0.671271  0.703664  0.938534
5    boosting-Ada-DT   0.66146  0.617483  0.638715   0.87244
6  boosting-Gradient  0.715461  0.749375  0.732025   0.96119
7         stacking_3  0.619725  0.568389  0.592948  0.904773
CPU times: user 28min 48s, sys: 20.4 s, total: 29min 8s
Wall time: 27min 36s


In [22]:
from sklearn.linear_model import RidgeClassifier
ridge = RidgeClassifier()

In [23]:
%%time
estimators_more = [('logit', logit), ('DT', DT), ('RF', RF), ('knn', knn)]
stacking_2 = ensemble.StackingClassifier(estimators = estimators_more, final_estimator = RF)
stacking_2.fit(x_train, y_train)
y_pred = stacking_2.predict(x_test)
y_pred_s = stacking_2.predict_proba(x_test)[:,1]
table = gen_table(table, 'stacking_4', y_test, y_pred, y_pred_s)
print(table)

               model precision    recall    F_beta       AUC
0           logistic  0.709591  0.336463  0.456479  0.854925
1                 DT  0.649097  0.606941  0.627311  0.793089
2                 RF  0.679593  0.625024  0.651167  0.922701
3   bagging-logistic  0.709628  0.336617  0.456628  0.854954
4         bagging-DT  0.739342  0.671271  0.703664  0.938534
5    boosting-Ada-DT   0.66146  0.617483  0.638715   0.87244
6  boosting-Gradient  0.715461  0.749375  0.732025   0.96119
7                KNN   0.59917  0.577661  0.588219  0.774524
8                KNN  0.708381  0.637682  0.671175  0.899424
9         stacking_4  0.633232  0.587088  0.609288   0.90763
CPU times: user 2h 26min 22s, sys: 53.6 s, total: 2h 27min 15s
Wall time: 2h 26min 6s


In [None]:
%%time
estimators_morenmore = [('logit', logit), ('DT', DT), ('RF', RF), ('SVM', svc), ('ridge', ridge), ('lasso', lasso)]
stacking_3 = ensemble.StackingClassifier(estimators = estimators_morenmore, final_estimator = RF)
stacking_3.fit(x_train, y_train)
y_pred = stacking_3.predict(x_test)
y_pred_s = stacking_3.predict_proba(x_test)[:,1]
table = gen_table(table, 'stacking_6', y_test, y_pred, y_pred_s)
print(table)

<a id='4'></a>
<a id='f'></a>
## 1.6 Performance Comparation

**Simple Models:**

|| Logistic | Decision Tree | Random Forest | KNN |
|---| --- | --- | --- |---|
|Parameter settings|C: 10<br>solver: 'newton-cg'|default|max_features:'log2'|default|
|Precision|0.709591|0.649097|0.679593|0.708381|
|Recall|0.336463|0.606941|0.625024|0.637682|
|$F_{0.5}$|0.456479|0.627311|0.651167|0.671175|
|AUC|0.854925|0.793089|0.922701|0.899424|

**Ensemble Models:**

|| Bagging | <code style="background:yellow">Bagging</code> | Boosting | <code style="background:yellow">Boosting</code> | <code style="background:yellow">Stacking</code> |Stacking|
| --- | --- | --- | --- | --- | --- | --- |
|Base classifier| Logistic | **Decision Tree** | Decison Tree<br>Adaboost | **Decision Tree<br>Gradient Boost** | **logistic<br>Decison Tree<br>RandomForest** |logistic<br>Decison Tree<br>RandomForest<br>KNN|
|Parameter settings|max_samples: 0.3<br>n_estimators: 80|**max_samples: 0.3<br>n_estimators: 100**|algorithm: 'SAMME'<br>n_estimators: 20|**n_estimators:100**|**final_estimator:RandomForest**|final_estimator:RandomForest|
|Precision|0.709628|**0.739342**|0.66146|**0.715461**|**0.619725**|0.633232|
|Recall|0.336617|**0.671271**|0.617483|**0.749375**|**0.568389**|0.587088|
|$F_{0.5}$|0.456628|**0.703664**|0.638715|**0.732025**|**0.592948**|0.609288|
|AUC|0.854954|**0.938534**|0.87244|**0.96119**|**0.904773**|0.90763|

<code style="background:yellow">**Comments:**</code>

1. At the very beginning, I first use GridSearch to find the best params for 4 simple models. As the table shows, simple logistic has the lowest F_beta score and RandomForest generates fairly high F_beta and AUC. KNN has a relatively good result as well.


2. Move on to the bagging model, however, Decision Tree based bagging model generates higher results (F_beta=0.7037, AUC=0.9385) than logistic based bagging model (F_beta=0.4566, AUC=0.8549) and there's almost no improvement from simple logistic to bagging logistic.


3. For boosting model, which I use default Decision Tree based in this case, Gradient boosting has much better performance (F_beta=0.7320, AUC=0.9612) than Adaboost (F_beta=0.6387, AUC=0.8724). 


4. In terms of stacking model, from structuring perspective, it is considered as the most complex one among these three. 4 estimators stacking model has a slightly better results than the 3 estimators stacking model, which is very reasonable.

5. To conclude, Decision Tree based Gradient boost model generates the best result which I think still can be improved by increasing the estimator number. All 3 ensemble model have relatively better perfomance than single model.

<a id='3'></a>
<a id='list2'></a>
# <code style="color:green">[Question 2] Ensemble Learning with External Data</code>

You are given some external data of income bracket indicators for each region of 5-digits zip code. In our original data, we have only the first three digits of zip code, which is at county level, rather than town level. Therefore, we need to aggregate the income brackets to estimate the income of loan applicants. Build ensemble models again and see if there is any improvement.

[1.1 Data preparation](#g)<br>
[1.2 Simple Model (logistic, Decision Tree, Random Forest, KNN)](#h)<br>
[1.3 Bagging](#i)<br>
[1.4 Boosting (Adaboost, Gradient boosting)](#j)<br>
[1.5 Stacking](#k)<br>
[1.6 Performance Comparation](#l)<br>
<br>
<br>
[back to main list](#all)

<a id='g'></a>
## 2.1 Data preparation

[back to contents list](#list2)<br>

In [24]:
f = open('16zpallagi.csv', 'r')
f.readline()
count = {}
for l in f.readlines():
    s = l.split(',')
    if s[2] == '0':
        count[s[1], int(s[3])] = int(s[4])
    else:
        count['%05d' % int(s[2]), int(s[3])] = int(s[4])
agg_count = {}
for k, v in count.items():
    if k[0][:3] not in agg_count:
        agg_count[k[0][:3]] = [0] * 7
    agg_count[k[0][:3]][k[1]] += v
f.close()

# Assume incomes bracketed are [12500, 37500, 62500, 87500, 150000, 400000]
# 1 = $1 under $25,000
# 2 = $25,000 under $50,000
# 3 = $50,000 under $75,000
# 4 = $75,000 under $100,000
# 5 = $100,000 under $200,000
# 6 = $200,000 or more

bracket_income = [0, 12500, 37500, 62500, 87500, 150000, 400000]
estimated_zip_income = {}
for k, v in agg_count.items():
    sumn, sumd = 0, 0
    for i in range(1, 7):
        sumn += bracket_income[i] * v[i]
        sumd += v[i]
    estimated_zip_income[k] = sumn / sumd
    
#insert at the position of 4
approved.insert(4, 'estimated_income', approved[['zip3', 'state']].apply(lambda x: estimated_zip_income[x[0]] if x[0] in estimated_zip_income else estimated_zip_income[x[1]] if x[1] in estimated_zip_income else 0, axis = 1))
rejected.insert(4, 'estimated_income', rejected[['zip3', 'state']].apply(lambda x: estimated_zip_income[x[0]] if x[0] in estimated_zip_income else estimated_zip_income[x[1]] if x[1] in estimated_zip_income else 0, axis = 1))
data_pos = approved.drop(['zip3', 'state'], axis = 1).values
data_neg = rejected.drop(['zip3', 'state'], axis = 1).values

In [25]:
data = np.vstack((data_pos, data_neg))
x_train, x_test, y_train, y_test = model_selection.train_test_split(data[:,:-1], data[:,-1].astype(int), 
                                                                    test_size = 0.25, random_state = 2020)
x_train = min_max_scaler.fit_transform(x_train)
x_test = min_max_scaler.transform(x_test)

In [26]:
table = pd.DataFrame(columns=['model','precision','recall','F_beta','AUC'])

<a id='h'></a>
## 2.2 Simple model
- Logistic
- Decision Tree
- Random Forest
- KNN
    

[back to contents list](#list2)<br>

### 2.2.1 Logistc
Parameter setting:

    'C': 10
    'solver': 'liblinear'

In [54]:
%%time
fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid_0 = model_selection.GridSearchCV(linear_model.LogisticRegression(), 
                                    param_grid = {'C': [1, 10, 100, 1000], 
                                                  'solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']}, 
                                    scoring = fhalf_scorer, cv = 10)
grid_0.fit(x_train, y_train)
print(grid_0.best_params_)



{'C': 10, 'solver': 'liblinear'}
CPU times: user 41min 25s, sys: 36.2 s, total: 42min 1s
Wall time: 23min 44s


    {'C': 10, 'solver': 'liblinear'}
    CPU times: user 41min 25s, sys: 36.2 s, total: 42min 1s
    Wall time: 23min 44s

In [29]:
%%time
# params = grid_0.best_params_
# params_C = params['C']
# params_solver = params['solver']
params_C = 10
params_solver = 'liblinear'

logit = linear_model.LogisticRegression(C = params_C, solver = params_solver)
logit.fit(x_train, y_train)
y_pred = logit.predict(x_test)
y_pred_s = logit.predict_proba(x_test)[:,1]
table = gen_table(table, 'logistic', y_test, y_pred, y_pred_s)
print(table)

      model precision    recall    F_beta       AUC
0        RF  0.727399  0.665577  0.695116  0.941825
1       KNN  0.693066  0.646455  0.668949  0.896051
2  logistic  0.709613  0.339695  0.459449  0.856003
CPU times: user 3.24 s, sys: 160 ms, total: 3.4 s
Wall time: 2.79 s


### 2.2.2 Decision Tree
Parameter setting: 

    default

In [30]:
%%time
DT = tree.DecisionTreeClassifier()
DT.fit(x_train, y_train)
y_pred = DT.predict(x_test)
y_pred_s = DT.predict_proba(x_test)[:,1]
table = gen_table(table, 'DT', y_test, y_pred, y_pred_s)
print(table)

      model precision    recall    F_beta       AUC
0        RF  0.727399  0.665577  0.695116  0.941825
1       KNN  0.693066  0.646455  0.668949  0.896051
2  logistic  0.709613  0.339695  0.459449  0.856003
3        DT  0.604343  0.601785  0.603061  0.786339
CPU times: user 8.56 s, sys: 85.2 ms, total: 8.65 s
Wall time: 8.65 s


### 2.2.3 Random Forest

In [None]:
%%time
fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), 
                                    param_grid = {'max_features':['auto', 'sqrt', 'log2']}, 
                                    scoring = fhalf_scorer, cv = 10)
grid.fit(x_train, y_train)
print(grid.best_params_)

In [27]:
%%time
# params = grid.best_params_
# params_max_features = params['max_features']

RF = ensemble.RandomForestClassifier(max_features = 'log2')
RF.fit(x_train, y_train)
y_pred = RF.predict(x_test)
y_pred_s = RF.predict_proba(x_test)[:,1]
table = gen_table(table, 'RF', y_test, y_pred, y_pred_s)
print(table)

  model precision    recall    F_beta       AUC
0    RF  0.727399  0.665577  0.695116  0.941825
CPU times: user 4min 36s, sys: 2.2 s, total: 4min 38s
Wall time: 4min 38s


### 2.2.4 KNN

In [28]:
%%time
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
y_pred_s = knn.predict_proba(x_test)[:,1]
table = gen_table(table, 'KNN', y_test, y_pred, y_pred_s)
print(table)   

  model precision    recall    F_beta       AUC
0    RF  0.727399  0.665577  0.695116  0.941825
1   KNN  0.693066  0.646455  0.668949  0.896051
CPU times: user 43min 31s, sys: 6.59 s, total: 43min 37s
Wall time: 43min 31s


<a id='i'></a>
## 2.3 Bagging
    1. Logistic based bagging
    2. Decision Tree based bagging
 
[back to contents list](#list2)<br>

### 2.3.1 Logistc based bagging
Parameter setting:

    'C': 10
    'solver': 'liblinear'
    'max_samples': 0.7
    'n_estimators': 10

In [57]:
%%time
bagging_logistic = ensemble.BaggingClassifier(logit, 
                                              max_features = 1.0, 
                                              oob_score = True, 
                                              random_state = 2020, 
                                              n_jobs = -1)

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid_1 = model_selection.GridSearchCV(bagging_logistic, 
                                     param_grid = {'n_estimators': [10, 20, 50, 80, 100], 
                                                   'max_samples':[0.3,0.5,0.7,0.8]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid_1.fit(x_train, y_train)
print(grid_1.best_params_)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


{'max_samples': 0.7, 'n_estimators': 10}
CPU times: user 1h 35min 27s, sys: 5min 19s, total: 1h 40min 47s
Wall time: 8h 31min 31s


    {'max_samples': 0.7, 'n_estimators': 10}
    CPU times: user 1h 35min 27s, sys: 5min 19s, total: 1h 40min 47s
    Wall time: 8h 31min 31s

In [27]:
%%time
# params = grid_1.best_params_
# params_n_estimators = params['n_estimators']
# params_max_samples = params['max_samples']
params_n_estimators = 10
params_max_samples = 0.7

best_bagging_logistic = ensemble.BaggingClassifier(logit, max_features = 1.0, 
                                                   max_samples = params_max_samples, 
                                                   n_estimators = params_n_estimators, 
                                                   oob_score = True, n_jobs = -1)
best_bagging_logistic.fit(x_train, y_train)
y_pred = best_bagging_logistic.predict(x_test)
y_pred_s = best_bagging_logistic.predict_proba(x_test)[:,1]
table = gen_table(table, 'bagging-logistic', y_test, y_pred, y_pred_s)
print(table)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


              model precision    recall    F_beta       AUC
0          logistic  0.709613  0.339695  0.459449  0.856003
1                DT  0.604286   0.60217  0.603226  0.786499
2  bagging-logistic  0.709973  0.339348  0.459208  0.856001
CPU times: user 8.4 s, sys: 1.45 s, total: 9.85 s
Wall time: 23 s


### 2.3.1 Decision Tree based bagging
Parameter setting:

    'n_estimators': 100
    'max_samples': 0.3

In [59]:
%%time
bagging_DT = ensemble.BaggingClassifier(DT, 
                                        max_features = 1.0, 
                                        oob_score = True, 
                                        random_state = 2020, 
                                        n_jobs = -1)

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid_2 = model_selection.GridSearchCV(bagging_DT, 
                                     param_grid = {'n_estimators': [10, 20, 50, 80, 100], 
                                                   'max_samples':[0.3,0.5,0.7,0.8]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid_2.fit(x_train, y_train)
print(grid_2.best_params_)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have 

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])
  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


{'max_samples': 0.3, 'n_estimators': 100}
CPU times: user 46min 53s, sys: 5min 30s, total: 52min 23s
Wall time: 3h 43min 16s


    {'max_samples': 0.3, 'n_estimators': 100}
    CPU times: user 46min 53s, sys: 5min 30s, total: 52min 23s
    Wall time: 3h 43min 16s

In [28]:
%%time
# params = grid_2.best_params_
# params_n_estimators = params['n_estimators']
# params_max_samples = params['max_samples']
params_n_estimators = 100
params_max_samples = 0.3

best_bagging_DT = ensemble.BaggingClassifier(DT, max_features = 1.0, 
                                             max_samples = params_max_samples, 
                                             n_estimators = params_n_estimators, 
                                             oob_score = True, n_jobs = -1)
best_bagging_DT.fit(x_train, y_train)
y_pred = best_bagging_DT.predict(x_test)
y_pred_s = best_bagging_DT.predict_proba(x_test)[:,1]
table = gen_table(table, 'bagging-DT', y_test, y_pred, y_pred_s)
print(table)

              model precision    recall    F_beta       AUC
0          logistic  0.709613  0.339695  0.459449  0.856003
1                DT  0.604286   0.60217  0.603226  0.786499
2  bagging-logistic  0.709973  0.339348  0.459208  0.856001
3        bagging-DT  0.741219  0.685237  0.712129  0.948858
CPU times: user 48.8 s, sys: 6.73 s, total: 55.5 s
Wall time: 4min 46s


<a id='j'></a>   
## 2.4 Boosting (DecisionTree based)
    1. Adaboost
    2. GradientBoosting
    

[back to contents list](#list2)<br>    

### 2.4.1 Adaboost
Parameter setting:

    'algorithm': 'SAMME'
    'n_estimators': 50

In [61]:
%%time
boosting_Ada = ensemble.AdaBoostClassifier(tree.DecisionTreeClassifier(), random_state = 2020)

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid_3 = model_selection.GridSearchCV(boosting_Ada, 
                                     param_grid = {'algorithm': ['SAMME', 'SAMME.R'], 
                                                   'n_estimators': [10, 20, 50, 80, 100]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid_3.fit(x_train, y_train)
print(grid_3.best_params_)

{'algorithm': 'SAMME', 'n_estimators': 50}
CPU times: user 11h 15min 9s, sys: 4min 14s, total: 11h 19min 24s
Wall time: 11h 22min 10s


    {'algorithm': 'SAMME', 'n_estimators': 50}
    CPU times: user 11h 15min 9s, sys: 4min 14s, total: 11h 19min 24s
    Wall time: 11h 22min 10s

In [30]:
%%time
# params = grid_3.best_params_
# params_algorithm = params['algorithm']
# params_n_estimators = params['n_estimators']
params_algorithm = 'SAMME'
params_n_estimators = 50

best_boosting_Ada = ensemble.AdaBoostClassifier(tree.DecisionTreeClassifier(), 
                                                n_estimators = params_n_estimators, 
                                                algorithm = params_algorithm)
best_boosting_Ada.fit(x_train, y_train)
y_pred = best_boosting_Ada.predict(x_test)
y_pred_s = best_boosting_Ada.predict_proba(x_test)[:,1]
table = gen_table(table, 'boosting-Ada-DT', y_test, y_pred, y_pred_s)
print(table)

              model precision    recall    F_beta       AUC
0          logistic  0.709613  0.339695  0.459449  0.856003
1                DT  0.604286   0.60217  0.603226  0.786499
2  bagging-logistic  0.709973  0.339348  0.459208  0.856001
3        bagging-DT  0.741219  0.685237  0.712129  0.948858
4   boosting-Ada-DT  0.643488  0.622023  0.632574  0.896162
CPU times: user 18min 18s, sys: 23.2 s, total: 18min 42s
Wall time: 19min 8s


### 2.4.2 GradientBoosting
Parameter setting:

    'n_estimators': 100

In [63]:
%%time
boosting_G = ensemble.GradientBoostingClassifier()

fhalf_scorer = metrics.make_scorer(metrics.fbeta_score, beta = 0.5)
grid_4 = model_selection.GridSearchCV(boosting_G, 
                                     param_grid = {'n_estimators': [10, 20, 50, 80, 100]}, 
                                     scoring = fhalf_scorer, cv = 10)
grid_4.fit(x_train, y_train)
print(grid_4.best_params_)

{'n_estimators': 100}
CPU times: user 1h 17min 3s, sys: 1min 14s, total: 1h 18min 17s
Wall time: 1h 18min 18s


    {'n_estimators': 100}
    CPU times: user 1h 17min 3s, sys: 1min 14s, total: 1h 18min 17s
    Wall time: 1h 18min 18s

In [31]:
%%time
# params = grid_4.best_params_
# params_n_estimators = params['n_estimators']
params_n_estimators = 100

best_boosting_G = ensemble.GradientBoostingClassifier(n_estimators = params_n_estimators)
best_boosting_G.fit(x_train, y_train)
y_pred = best_boosting_G.predict(x_test)
y_pred_s = best_boosting_G.predict_proba(x_test)[:,1]
table = gen_table(table, 'boosting-Gradient', y_test, y_pred, y_pred_s)
print(table)

               model precision    recall    F_beta       AUC
0           logistic  0.709613  0.339695  0.459449  0.856003
1                 DT  0.604286   0.60217  0.603226  0.786499
2   bagging-logistic  0.709973  0.339348  0.459208  0.856001
3         bagging-DT  0.741219  0.685237  0.712129  0.948858
4    boosting-Ada-DT  0.643488  0.622023  0.632574  0.896162
5  boosting-Gradient  0.714218  0.750875  0.732088  0.960931
CPU times: user 4min 19s, sys: 9.66 s, total: 4min 28s
Wall time: 4min 34s


<a id='k'></a>  
## 2.5 Stacking

[back to contents list](#list2)<br>  

In [31]:
%%time
estimators = [('logit', logit), ('DT', DT), ('RF', RF)]
stacking_1 = ensemble.StackingClassifier(estimators = estimators, final_estimator = RF)
stacking_1.fit(x_train, y_train)
y_pred = stacking_1.predict(x_test)
y_pred_s = stacking_1.predict_proba(x_test)[:,1]
table = gen_table(table, 'stacking_3', y_test, y_pred, y_pred_s)
print(table)

        model precision    recall    F_beta       AUC
0          RF  0.727399  0.665577  0.695116  0.941825
1         KNN  0.693066  0.646455  0.668949  0.896051
2    logistic  0.709613  0.339695  0.459449  0.856003
3          DT  0.604343  0.601785  0.603061  0.786339
4  stacking_3  0.598557  0.571429  0.584678  0.900239
CPU times: user 28min 36s, sys: 15 s, total: 28min 51s
Wall time: 28min 48s


In [32]:
%%time
estimators_more = [('logit', logit), ('DT', DT), ('RF', RF), ('knn', knn)]
stacking_2 = ensemble.StackingClassifier(estimators = estimators_more, final_estimator = RF)
stacking_2.fit(x_train, y_train)
y_pred = stacking_2.predict(x_test)
y_pred_s = stacking_2.predict_proba(x_test)[:,1]
table = gen_table(table, 'stacking_4', y_test, y_pred, y_pred_s)
print(table)

        model precision    recall    F_beta       AUC
0          RF  0.727399  0.665577  0.695116  0.941825
1         KNN  0.693066  0.646455  0.668949  0.896051
2    logistic  0.709613  0.339695  0.459449  0.856003
3          DT  0.604343  0.601785  0.603061  0.786339
4  stacking_3  0.598557  0.571429  0.584678  0.900239
5  stacking_4  0.602709  0.576969  0.589558  0.902733
CPU times: user 2h 42min 7s, sys: 46.3 s, total: 2h 42min 53s
Wall time: 2h 42min 58s


<a id='5'></a>
<a id='l'></a>  
## 2.6 Performance Comparation

**Simple Models:**

|| Logistic | Decision Tree | **Random Forest** | **KNN** |
|---| --- | --- | --- |---|
|Parameter settings|C: 10<br>solver: 'liblinear'|default|max_features:'log2'|default|
|Precision|0.709613|0.604343|0.727399|0.693066|
|Recall|0.339695|0.601785|0.665577|0.646455|
|$F_{0.5}$|0.459449|0.603061|0.695116|0.668949|
|AUC|0.856003|0.786339|0.941825|0.896051|

**Ensemble Models with external data:**

|| Bagging | <code style="background:yellow">Bagging</code> | Boosting | <code style="background:yellow">Boosting</code> | <code style="background:yellow">Stacking</code> |Stacking|
| --- | --- | --- | --- | --- | --- | --- |
|Base classifier| Logistic | **Decision Tree** | Decison Tree<br>Adaboost | **Decision Tree<br>Gradient Boost** | **logistic<br>Decison Tree<br>RandomForest**|logistic<br>Decison Tree<br>RandomForest<br>KNN|
|Parameter settings|max_samples: 0.7<br>n_estimators: 10|**max_samples: 0.3<br>n_estimators: 100**|algorithm: 'SAMME'<br>n_estimators: 50|**n_estimators:100**|**final_estimator:RandomForest**|final_estimator:RandomForest|
|Precision|0.709973|**0.741219**|0.643488|**0.714218**|**0.598557**|0.602709|
|Recall|0.339348|**0.685237**|0.62202|**0.750875**|**0.571429**|0.576969|
|$F_{0.5}$|0.459208|**0.712129**|0.632574|**0.732088**|**0.584678**|0.589558|
|AUC|0.856001|**0.948858**|0.896162|**0.960931**|**0.900239**|0.902733

To clearly see the improvements after adding external data, here we show the result of **Ensemble Models without external data** again.

|| Bagging | <code style="background:yellow">Bagging</code> | Boosting | <code style="background:yellow">Boosting</code> | <code style="background:yellow">Stacking</code> |Stacking|
| --- | --- | --- | --- | --- | --- | --- |
|Base classifier| Logistic | **Decision Tree** | Decison Tree<br>Adaboost | **Decision Tree<br>Gradient Boost** | **logistic<br>Decison Tree<br>RandomForest** |logistic<br>Decison Tree<br>RandomForest<br>KNN|
|Parameter settings|max_samples: 0.3<br>n_estimators: 80|**max_samples: 0.3<br>n_estimators: 100**|algorithm: 'SAMME'<br>n_estimators: 20|**n_estimators:100**|**final_estimator:RandomForest**|final_estimator:RandomForest|
|Precision|0.709628|**0.739342**|0.66146|**0.715461**|**0.619725**|0.633232|
|Recall|0.336617|**0.671271**|0.617483|**0.749375**|**0.568389**|0.587088|
|$F_{0.5}$|0.456628|**0.703664**|0.638715|**0.732025**|**0.592948**|0.609288|
|AUC|0.854954|**0.938534**|0.87244|**0.96119**|**0.904773**|0.90763|

**Comments:**

1. At the very beginning, I still first use GridSearch to find the best params for 4 simple models. As the table shows, there's some slight improvements for the logistic and random forest model in F_beta and AUC.


2. Move on to the bagging model, after introducing external data, both Logistic and Decision Tree based bagging model show some improvements, especially the Decision Tree based bagging model which has F_beta=0.7121 and AUC=0.9489.


3. For boosting model, there's a quite clear improvement for Adaboost (AUC from 0.8724 to 0.8962) after new data added in. But for Gradient Boost, thought it still has higher performance than Adaboost, the acutal F_beta and AUC is getting lower. It is mainly because when I did model selection, the param_grid is setted as {'n_estimators': [10, 20, 50, 80, 100]}, in this case, the gradient boost model actually can get higher result if we set the n_estimators higher than 100.


4. In terms of stacking model, it also has a slight improvement for 3 estimators. Though in this case, adding KNN in does not have much better result and overall result is not better than final estimator.


5. To conclude, in this case, for binary classification, Decision Tree based Gradient Boost model gives us the best results in terms of $F_{0.5}$ and AUC even it's a not balanced dataset. Both precision and recall are relatively high. And it can certainly be improved by expanding the grid search range, e.g. the estimator number, which is just a cost of time.

[back to contents list](#list2)<br> 
[back to main list](#all)