In this script, we conduct ensemble learning on the Boston Test set, basing on the analyses of each single model's confusion matrix and performance. After 1. Averaging ensemble; 2. Conditional ensemble; 3. Weighted ensemble, we produced an "All results" table. We also tried 4. Subdistrict conditional ensemble, but decided not to consider it.

We also filtered out some extremely wrongly predicted images by our final prediction so as to do some qualitative analysis of the failure cases.


##### 1. Averaging ensemble

We tried 4 averaging ensemble: 

- all single models; 

- models with best accuracy (cv0cv2cv4full);

- models best at predicting safety = 0 (cv3full);

- models best at predicting safety = 1 (cv0cv1cv2cv4own).

Averaging ensembled models had better accuracy than single models. Model 'cv3full' was best at predicting safety = 0 (TN_Rate 0.756619) and model 'cv0cv1cv2cv4own' was best at predicting safety = 1 (TP_Rate 0.779363), but they were weak at predicting the other target respectively. Therefore, we would use these 2 models in conditional and weighted ensemble.

##### 2. Averaging + Conditional ensemble

'Conditional ensemble1': When predicting safety = 0, we use the prediction by 'cv3full'; else we use the prediction by 'cv0cv1cv2cv4own'. It got TN_Rate 0.800279. However the TP_Rate was as low as 0.643249.

'Conditional ensemble2': When predicting safety = 1, we use the prediction by 'cv0cv1cv2cv4own'; else we use the prediction by 'cv3full'. It got TP_Rate 0.819978. However the TN_Rate was as low as 0.647933.

Comparing to 1. Averaging ensemble, these two conditional ensembled models were even better at predicting only one of the targets each, but were weaker at predicting the other target. Therefore, we looked at weighted ensemble.

##### 3. Averaging + Weighted ensemble

'weighted ensemble1': We gave prediction by model 'cv3full' 0.4 weight; and prediction by model 'cv0cv1cv2cv4own' 0.6 weight. It got the best accuracy among all models: 0.741132. But this model was better at predicting safety = 1 than safety = 0 (TP_Rate 0.750274; TN_Rate 0.733395)


'weighted ensemble2': We gave prediction by model 'cv3full' 0.45 weight; and prediction by model 'cv0cv1cv2cv4own' 0.55 weight. It got the 2nd best accuracy among all models: 0.740881. This model performance was relatively even at predicting both safety = 1 and safety = 0 (TP_Rate 0.747530; TN_Rate 0.735253)

'weighted ensemble3': We gave prediction by model 'cv3full' 0.55 weight; and prediction by model 'cv0cv1cv2cv4own' 0.45 weight. It got the 3rd best accuracy among all models: 0.735094. This model is better at predicting safety = 0 than safety = 1 (TP_Rate 0.731065; TN_Rate 0.738504)

##### Conclusion
Examining the "All results" table, we decided to choose 'weighted ensemble2' as our final model, as it is a balance of overall accuracy and the prediction of both safety = 0 and safety = 1.

Next, we would use the ensemble strategy of 'weighted ensemble2' to predict Toronto street views (ensemble_toronto.ipynb).

In [5]:
import pandas as pd
import numpy as np
import glob
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

### Before ensembling: single model performance

#### Single model performance of models produced in Transfer Learning
Six models (best models produced from cv0, cv1, cv2, cv3, cv4 and whole training dataset) were produced in the transfer learning process. Validation accuracy showed that the models' accuracy was around 70%. Later we would show that averaging ensemble could boost the accuracy.

In [161]:
cv_score = pd.read_pickle("/Users/zhanglingling/Desktop/ML1030/boston_train_evaluate/cv_score.pickle")
cv_score 

CV round,0,1,2,3,4,mean,std
train_loss,0.167847,0.070979,0.085303,0.051624,0.052649,0.08568,0.042939
train_acc,0.9471,0.972844,0.970884,0.980929,0.979752,0.970302,0.012226
val_loss,1.802394,1.540489,1.852702,1.808886,1.562203,1.713335,0.133567
val_acc,0.669805,0.672003,0.691994,0.672214,0.692622,0.679727,0.010308


In [163]:
wholedata_score = pd.read_pickle("/Users/zhanglingling/Desktop/ML1030/boston_train_evaluate/wholedata_score.pickle")
wholedata_score 

Unnamed: 0,loss,acc
0,0.058055,0.978464


#### Single model performance of model produced by our own cnn
Training loss: 0.3158, Training acc: 0.8696, val_loss: 0.6401, val_acc: 0.680

We calculate the confusion matrix of each single model

#### Data preparation

As boston_prediction_own_cnn.csv and the other predictions were produced in 2 different VMs, we need to unify them first. 

boston_prediction_own_cnn.csv has one more row than the other predictions. That is because on the VM that run transfer learning, one test image was not able to be fetched by the Google Street View API. So we removed that row in boston_prediction_own_cnn.csv as well.


In [166]:
# p = pd.read_csv('/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston_prediction_own_cnn.csv')     
# t = pd.read_csv('/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv4.prediction.csv')
# df = pd.merge(p, t, how='left', on='_file', 
#                    indicator=True)
# df[df['_merge'] == 'left_only']
# p = p[p['_file'] != 'gsv_1578.jpg']
# p.to_csv('/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston_prediction_own_cnn.csv', index = False)

In [191]:
predict_dir = '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/'
file_list = list(glob.glob(predict_dir + "*.csv*"))
file_list.sort()
file_list

['/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv0.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv1.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv2.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv3.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv4.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.wholedata.hdf5.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston_prediction_own_cnn.csv']

In [192]:
df_list = []
for f in file_list:
    df = pd.read_csv(f)
    df = df.sort_values("_file")
    df_list.append(df)

In [193]:
test_csv = "/Users/zhanglingling/Desktop/ML1030/us_safety/boston_test_fetched_with_target.csv"  
test_df = pd.read_csv(test_csv)
test_df = test_df.sort_values("_file")
target = "safety"
img_name_col = "_file"
test_df = test_df[[img_name_col, target]]
print(test_df.shape)
test_df.head()

(3976, 2)


Unnamed: 0,_file,safety
0,gsv_0.jpg,1
1,gsv_1.jpg,1
10,gsv_10.jpg,0
99,gsv_100.jpg,1
992,gsv_1000.jpg,1


Note that test_df also have 3976 samples. In later functions we used df = df[df['_merge'] == 'both'] to resovle this.

In [194]:
df = pd.merge(test_df, df_list[0], how='left', on='_file', 
                   indicator=True)
df[df['_merge'] == 'left_only']

Unnamed: 0,_file,safety,0,1,_merge
640,gsv_1578.jpg,0,,,left_only


In test, set we have 2153 actual safety = 0 and 1822 actual safety =1

In [195]:
df = df[df['_merge'] == 'both']
df.groupby(['safety']).count()

Unnamed: 0_level_0,_file,0,1,_merge
safety,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2153,2153,2153,2153
1,1822,1822,1822,1822


In [202]:
# function that calculates each single model's confusion matrix and performance tables
def matrix_performance_singlemodel(test_df, prediction):
    
    #for a single model, no need to average
    prediction['pred_safety'] =  np.where(prediction['0'] > 0.5, 0, 1)
    
    #prepare y_true, y_pred
    df = pd.merge(test_df, prediction, how='left', on='_file', 
                   indicator=True)
    df[df['_merge'] == 'left_only']
    df = df[df['_merge'] == 'both']
    df['pred_safety'] = df['pred_safety'].astype(int)
    
    y_true = df['safety']
    y_pred = df['pred_safety']
    
    #confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    matrix = pd.DataFrame([{'tn': tn, 'fp': fp, 'fn': fn, 'tp': tp}])
    matrix['tn_rate'] = matrix['tn'] / (matrix['tn'] + matrix['fp'])
    matrix['tp_rate_recall'] = matrix['tp'] / (matrix['fn'] + matrix['tp'])

    #matrix['fp_rate'] = matrix['fp'] / (matrix['tn'] + matrix['fp'])
    #matrix['fn_rate'] = matrix['fn'] / (matrix['fn'] + matrix['tp'])
  
    #performance
    matrix['accuracy'] = accuracy_score(y_true, y_pred) # accuracy: (tp + tn) / (p + n)
    matrix['f1_score'] = f1_score(y_true, y_pred) # f1: 2 tp / (2 tp + fp + fn)
    matrix['precision'] = precision_score(y_true, y_pred) # precision tp / (tp + fp)

    
    return matrix

In [203]:
conf_matrix_table = pd.DataFrame(columns=['fn','fp','tn', 'tp', 'tn_rate', 'tp_rate_recall', 'accuracy', 'f1_score', 'precision'])

for i in range(len(df_list)):
    conf_matrix = matrix_performance_singlemodel(test_df, df_list[i])
    conf_matrix_table = pd.concat([conf_matrix_table, conf_matrix])


conf_matrix_table['model'] = ['cv0', 'cv1', 'cv2', 'cv3', 'cv4', 'full', 'own']


In [204]:
conf_matrix_table

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,485,704,1449,1337,0.673014,0.733809,0.700881,0.692208,0.655071,cv0
0,442,775,1378,1380,0.640037,0.757409,0.693836,0.69399,0.640371,cv1
0,537,649,1504,1285,0.69856,0.705269,0.701635,0.684239,0.664426,cv2
0,653,558,1595,1169,0.740827,0.641603,0.695346,0.658777,0.676896,cv3
0,505,684,1469,1317,0.682304,0.722832,0.700881,0.688988,0.658171,cv4
0,579,536,1617,1243,0.751045,0.682217,0.719497,0.690364,0.698707,full
0,552,741,1412,1270,0.655829,0.697036,0.674717,0.662666,0.631527,own


From above results, we can see that for the 7 single models:

2 models - cv3, full are better than the others in predicting 0;

5 models - cv0, cv1, cv2, cv4, own are better than the others in predicting 1

## 1. Averaging ensemble

In [205]:
# function that calculates averging ensemble model's confusion matrix and performance tables
def matrix_performance(test_df, file_list):
    
    #produce averaging ensembled model's prediction - 'averaged_prediction'
    df_list = []
    for f in file_list:
        df = pd.read_csv(f)
        df = df.sort_values("_file")
        df_list.append(df)
        
    averaged_prediction = pd.concat(df_list).groupby('_file').mean()
    averaged_prediction.reset_index(level=0, inplace=True)
    averaged_prediction['pred_safety'] =  np.where(averaged_prediction['0'] > 0.5, 0, 1)
    
    #prepare y_true, y_pred
    df = pd.merge(test_df, averaged_prediction, how='left', on='_file', 
                   indicator=True)
    df[df['_merge'] == 'left_only']
    df = df[df['_merge'] == 'both']
    df['pred_safety'] = df['pred_safety'].astype(int)
    
    
    y_true = df['safety']
    y_pred = df['pred_safety']
    
    #confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    matrix = pd.DataFrame([{'tn': tn, 'fp': fp, 'fn': fn, 'tp': tp}])
    
    matrix['tn_rate'] = matrix['tn'] / (matrix['tn'] + matrix['fp'])
    matrix['tp_rate_recall'] = matrix['tp'] / (matrix['fn'] + matrix['tp'])
    
    #matrix['fp_rate'] = matrix['fp'] / (matrix['tn'] + matrix['fp'])
    #matrix['fn_rate'] = matrix['fn'] / (matrix['fn'] + matrix['tp'])
    
    #performance
    matrix['accuracy'] = accuracy_score(y_true, y_pred) # accuracy: (tp + tn) / (p + n)
    matrix['f1_score'] = f1_score(y_true, y_pred) # f1: 2 tp / (2 tp + fp + fn)
    matrix['precision'] = precision_score(y_true, y_pred) # precision tp / (tp + fp)
    #recall = recall_score(y_true, y_pred) # recall: tp / (tp + fn)
    
    
    return matrix


### 1.1 Averaging ensemble of all single models

In [206]:
predict_dir = '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/'
file_list = list(glob.glob(predict_dir + "*.csv*"))
file_list.sort()
file_list

['/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv0.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv1.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv2.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv3.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv4.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.wholedata.hdf5.prediction.csv',
 '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston_prediction_own_cnn.csv']

In [207]:
conf_matrix_all = matrix_performance(test_df, file_list)
conf_matrix_all['model'] = ['all models']
display(conf_matrix_all)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,435,602,1551,1387,0.72039,0.761251,0.739119,0.727893,0.697335,all models


### 1.2 Averaging ensemble of models withs best accuracy only (cv0, cv2, cv4, full)

In [208]:
file_list = ['/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv0.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv2.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv4.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.wholedata.hdf5.prediction.csv']
conf_matrix_cv0cv2cv4full = matrix_performance(test_df, file_list)
conf_matrix_cv0cv2cv4full['model'] = ['cv0cv2cv4full']

display(conf_matrix_cv0cv2cv4full)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,462,620,1533,1360,0.71203,0.746432,0.727799,0.715413,0.686869,cv0cv2cv4full


### 1.4 Averaging ensemble of models best at predicting target 0 only (cv3, full)


In [209]:
file_list = ['/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv3.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.wholedata.hdf5.prediction.csv']
conf_matrix_cv3full = matrix_performance(test_df, file_list)
conf_matrix_cv3full['model'] = ['cv3full']

display(conf_matrix_cv3full)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,576,524,1629,1246,0.756619,0.683864,0.72327,0.693764,0.703955,cv3full


### 1.5 Averaging ensemble of models best at predicting target 1 only (cv0, cv1,  cv2, cv4, own)

In [210]:
file_list = ['/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv0.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv1.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv2.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv4.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston_prediction_own_cnn.csv']
conf_matrix_cv0cv1cv2cv4own = matrix_performance(test_df, file_list)
conf_matrix_cv0cv1cv2cv4own['model'] = ['cv0cv1cv2cv4own']

display(conf_matrix_cv0cv1cv2cv4own)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,402,664,1489,1420,0.691593,0.779363,0.731824,0.727087,0.681382,cv0cv1cv2cv4own


In [211]:
pd.concat([conf_matrix_table, 
          conf_matrix_all,  conf_matrix_cv0cv2cv4full,
          conf_matrix_cv3full,
          conf_matrix_cv0cv1cv2cv4own])

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,485,704,1449,1337,0.673014,0.733809,0.700881,0.692208,0.655071,cv0
0,442,775,1378,1380,0.640037,0.757409,0.693836,0.69399,0.640371,cv1
0,537,649,1504,1285,0.69856,0.705269,0.701635,0.684239,0.664426,cv2
0,653,558,1595,1169,0.740827,0.641603,0.695346,0.658777,0.676896,cv3
0,505,684,1469,1317,0.682304,0.722832,0.700881,0.688988,0.658171,cv4
0,579,536,1617,1243,0.751045,0.682217,0.719497,0.690364,0.698707,full
0,552,741,1412,1270,0.655829,0.697036,0.674717,0.662666,0.631527,own
0,435,602,1551,1387,0.72039,0.761251,0.739119,0.727893,0.697335,all models
0,462,620,1533,1360,0.71203,0.746432,0.727799,0.715413,0.686869,cv0cv2cv4full
0,576,524,1629,1246,0.756619,0.683864,0.72327,0.693764,0.703955,cv3full


### Conclusion of averaging ensemble

Above results showed that

1) The performance of all averaging ensemble models was better than that of single models.

2) In terms of accuracy, 'all models' yielded the best accuracy (73.9%), that is because it get on average both good tn_rate (72.0%) and good tp_rate (76.1%). However, this ensembled model is not best when predicting safety = 0 alone and safety = 1 alone.

3) When predicting safety = 0, 'cv3full' yielded the best prediction (tn_rate 75.7%).

4) When predicting safety = 1, 'cv0cv1cv2cv4own' yielded the best prediction (tp_rate 77.9%).


## 2. Conditional ensemble
Next, we would try whether conditional ensemble could further boost the model performance.

In [212]:
# function that calculates conditional ensembling 
def cond_ensemble(file_list0, file_list1):
    
    #produce averaging ensembled model's prediction - 'averaged_prediction'
    df_list0 = []
    for f in file_list0:
        df = pd.read_csv(f)
        df = df.sort_values("_file")
        df_list0.append(df)
        
    prediction0 = pd.concat(df_list0).groupby('_file').mean()
    prediction0.reset_index(level=0, inplace=True)
    prediction0['pred_safety'] =  np.where(prediction0['0'] > 0.5, 0, 1)
    
    
    df_list1 = []
    for f in file_list1:
        df = pd.read_csv(f)
        df = df.sort_values("_file")
        df_list1.append(df)
        
    prediction1 = pd.concat(df_list1).groupby('_file').mean()
    prediction1.reset_index(level=0, inplace=True)
    prediction1['pred_safety'] =  np.where(prediction1['0'] > 0.5, 0, 1)
    
    
    pred = pd.merge(prediction0, prediction1, how='outer', on='_file')
    pred.columns = ['_file', 'pred0_0', 'pred0_1', 'pred0_pred_safety',  'pred1_0', 'pred1_1', 'pred1_pred_safety']

    return pred


In [213]:
# function that calculates each model's confusion matrix and performance tables
def matrix_performance_cond_ensemble_model(test_df, prediction):
    
    #prepare y_true, y_pred
    df = pd.merge(test_df, prediction, how='left', on='_file', 
                   indicator=True)
    df[df['_merge'] == 'left_only']
    df = df[df['_merge'] == 'both']
    df['pred_safety'] = df['pred_safety'].astype(int)
    
    y_true = df['safety']
    y_pred = df['pred_safety']
    
    #confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    matrix = pd.DataFrame([{'tn': tn, 'fp': fp, 'fn': fn, 'tp': tp}])
    matrix['tn_rate'] = matrix['tn'] / (matrix['tn'] + matrix['fp'])
    matrix['tp_rate_recall'] = matrix['tp'] / (matrix['fn'] + matrix['tp'])
    

    #matrix['fp_rate'] = matrix['fp'] / (matrix['tn'] + matrix['fp'])
    #matrix['fn_rate'] = matrix['fn'] / (matrix['fn'] + matrix['tp'])
    
    
   #performance
    matrix['accuracy'] = accuracy_score(y_true, y_pred) # accuracy: (tp + tn) / (p + n)
    matrix['f1_score'] = f1_score(y_true, y_pred) # f1: 2 tp / (2 tp + fp + fn)
    matrix['precision'] = precision_score(y_true, y_pred) # precision tp / (tp + fp)
    #recall = recall_score(y_true, y_pred) # recall: tp / (tp + fn)
    
    
    return matrix

### 2.1 Conditional ensemble 1
When predicting safety = 0, we use the prediction by 'cv3full'; else we use the prediction by 'cv0cv1cv2cv4own'.

In [214]:
file_list0 = ['/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv3.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.wholedata.hdf5.prediction.csv']


file_list1 = ['/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv0.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv1.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv2.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston.test_bestmodel.hdf5.cv4.prediction.csv',
             '/Users/zhanglingling/Desktop/ML1030/boston_test_prediction/boston_prediction_own_cnn.csv']

In [215]:
pred = cond_ensemble(file_list0, file_list1)

pred['pred_safety'] = np.where(pred['pred0_pred_safety'] == 0, 0, pred['pred1_pred_safety'])
conf_matrix_cond1 = matrix_performance_cond_ensemble_model(test_df, pred)
conf_matrix_cond1['model'] = ['conditional ensemble1']

display(conf_matrix_cond1)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,650,430,1723,1172,0.800279,0.643249,0.728302,0.684579,0.731586,conditional ensemble1


### Conditional ensemble 2
When predicting safety = 1, we use the prediction by 'cv0cv1cv2cv4own'; else we use the prediction by 'cv3full'.

In [216]:
pred = cond_ensemble(file_list0, file_list1)
pred['pred_safety'] = np.where(pred['pred1_pred_safety'] == 1, 1, pred['pred0_pred_safety'])
conf_matrix_cond2 = matrix_performance_cond_ensemble_model(test_df, pred)
conf_matrix_cond2['model'] = ['conditional ensemble2']

display(conf_matrix_cond2)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,328,758,1395,1494,0.647933,0.819978,0.726792,0.733432,0.66341,conditional ensemble2


### Conclusion of conditional ensemble
Above results show that 'conditional ensemble 1' model is good at prediction target 0, at the expense of prediction 1.

On the contrary, 'conditional ensemble 2' model is good at prediction target 1, at the expense of prediction 0.

### 3. Weighted ensemble

### 3.1 Weighted ensemble 1

We give prediction by model 'cv3full' 0.4 weight; and prediction by model 'cv0cv1cv2cv4own' 0.6 weight.

In [217]:
pred = cond_ensemble(file_list0, file_list1)
pred['weight_pred_0'] = pred['pred0_0'] * 0.4 + pred['pred1_0'] * 0.6
pred['weight_pred_1'] = pred['pred0_1'] * 0.4 + pred['pred1_1'] * 0.6
pred['pred_safety'] =  np.where(pred['weight_pred_0'] > 0.5, 0, 1)
conf_matrix_w1 = matrix_performance_cond_ensemble_model(test_df, pred)
conf_matrix_w1['model'] = ['weighted ensemble1']

display(conf_matrix_w1)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,455,574,1579,1367,0.733395,0.750274,0.741132,0.726548,0.704276,weighted ensemble1


### 3.2 Weighted ensemble 2
We give prediction by model 'cv3full' 0.45 weight; and prediction by model 'cv0cv1cv2cv4own' 0.55 weight.

In [218]:
pred = cond_ensemble(file_list0, file_list1)
pred['weight_pred_0'] = pred['pred0_0'] * 0.45 + pred['pred1_0'] * 0.55
pred['weight_pred_1'] = pred['pred0_1'] * 0.45 + pred['pred1_1'] * 0.55
pred['pred_safety'] =  np.where(pred['weight_pred_0'] > 0.5, 0, 1)
conf_matrix_w2 = matrix_performance_cond_ensemble_model(test_df, pred)
conf_matrix_w2['model'] = ['weighted ensemble2']

display(conf_matrix_w2)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,460,570,1583,1362,0.735253,0.74753,0.740881,0.725626,0.704969,weighted ensemble2


### 3.3 Weighted ensemble 3
We give prediction by model 'cv3full' 0.55 weight; and prediction by model 'cv0cv1cv2cv4own' 0.45 weight.

In [219]:
pred = cond_ensemble(file_list0, file_list1)
pred['weight_pred_0'] = pred['pred0_0'] * 0.55 + pred['pred1_0'] * 0.45
pred['weight_pred_1'] = pred['pred0_1'] * 0.55 + pred['pred1_1'] * 0.45
pred['pred_safety'] =  np.where(pred['weight_pred_0'] > 0.5, 0, 1)
conf_matrix_w3 = matrix_performance_cond_ensemble_model(test_df, pred)
conf_matrix_w3['model'] = ['weighted ensemble3']

display(conf_matrix_w3)

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,490,563,1590,1332,0.738504,0.731065,0.735094,0.716707,0.702902,weighted ensemble3


### All results

In [220]:
pd.concat([conf_matrix_table, 
          conf_matrix_all,  conf_matrix_cv0cv2cv4full,
          conf_matrix_cv3full,
          conf_matrix_cv0cv1cv2cv4own,
          conf_matrix_cond1, conf_matrix_cond2, conf_matrix_w1, conf_matrix_w2, conf_matrix_w3])

Unnamed: 0,fn,fp,tn,tp,tn_rate,tp_rate_recall,accuracy,f1_score,precision,model
0,485,704,1449,1337,0.673014,0.733809,0.700881,0.692208,0.655071,cv0
0,442,775,1378,1380,0.640037,0.757409,0.693836,0.69399,0.640371,cv1
0,537,649,1504,1285,0.69856,0.705269,0.701635,0.684239,0.664426,cv2
0,653,558,1595,1169,0.740827,0.641603,0.695346,0.658777,0.676896,cv3
0,505,684,1469,1317,0.682304,0.722832,0.700881,0.688988,0.658171,cv4
0,579,536,1617,1243,0.751045,0.682217,0.719497,0.690364,0.698707,full
0,552,741,1412,1270,0.655829,0.697036,0.674717,0.662666,0.631527,own
0,435,602,1551,1387,0.72039,0.761251,0.739119,0.727893,0.697335,all models
0,462,620,1533,1360,0.71203,0.746432,0.727799,0.715413,0.686869,cv0cv2cv4full
0,576,524,1629,1246,0.756619,0.683864,0.72327,0.693764,0.703955,cv3full


### Conclusion:
We consider "weighted ensemble2" as the best ensembling method, it has an accuracy of 74.1%, and good at predicting both safety = 0 and safety = 1.

In [223]:
# save final modle "weighted ensemble2"
pred = cond_ensemble(test_df, file_list0, file_list1)
pred['weight_pred_0'] = pred['pred0_0'] * 0.45 + pred['pred1_0'] * 0.55
pred['weight_pred_1'] = pred['pred0_1'] * 0.45 + pred['pred1_1'] * 0.55
pred['pred_safety'] =  np.where(pred['weight_pred_0'] > 0.5, 0, 1)
pred.head()

Unnamed: 0,_file,pred0_0,pred0_1,pred0_pred_safety,pred1_0,pred1_1,pred1_pred_safety,weight_pred_0,weight_pred_1,pred_safety
0,gsv_0.jpg,0.00124,0.998759,1,0.172197,0.827803,1,0.095266,0.904734,1
1,gsv_1.jpg,0.625578,0.374422,0,0.439558,0.560442,1,0.523267,0.476733,0
2,gsv_10.jpg,0.00035,0.99965,1,0.144033,0.855967,1,0.079376,0.920624,1
3,gsv_100.jpg,0.588373,0.411627,0,0.326131,0.673869,1,0.44414,0.55586,1
4,gsv_1000.jpg,0.032821,0.967179,1,0.186796,0.813204,1,0.117507,0.882493,1


In [233]:
final_pred_weighted_ensemble2 = pred[['_file', 'weight_pred_0', 'weight_pred_1', 'pred_safety']]
final_pred_weighted_ensemble2.columns = ['_file', '0', '1', 'pred_safety']

In [234]:
final_pred_weighted_ensemble2.head()

Unnamed: 0,_file,0,1,pred_safety
0,gsv_0.jpg,0.095266,0.904734,1
1,gsv_1.jpg,0.523267,0.476733,0
2,gsv_10.jpg,0.079376,0.920624,1
3,gsv_100.jpg,0.44414,0.55586,1
4,gsv_1000.jpg,0.117507,0.882493,1


In [235]:
final_pred_weighted_ensemble2.to_csv("/Users/zhanglingling/Desktop/ML1030/final_pred_boston_test_weighted_ensemble2.csv", index=False)

### Examine some failure cases by our final prediction - weighted ensemble2

In [238]:
df = pd.merge(test_df, pred, how='left', on='_file', 
                   indicator=True)
df = df[df['_merge'] == 'both']
df['pred_safety'] = df['pred_safety'].astype(int)

In [249]:
final_pred_wrong = df[df['safety'] != df['pred_safety']]

In [251]:
print(final_pred_wrong.shape)
(df.shape[0] - final_pred_wrong.shape[0])/df.shape[0]

(1030, 12)


0.740880503144654

#### 73 samples were very wrongly predicted as 0; and 99 samples were wrongly predicted as 1.

In [261]:
print(final_pred_wrong[final_pred_wrong['weight_pred_0'] > 0.9].shape)
print(final_pred_wrong[final_pred_wrong['weight_pred_1'] > 0.9].shape)

(73, 12)
(99, 12)


#### We would randomly choose some of the extremely wrong prodiction's image id and download the images from our VM's boston_test/cropped_image folder to inspect.

In [267]:
## Very wrongly predicted as 0
final_pred_wrong[final_pred_wrong['weight_pred_0'] > 0.9].sort_values(by = ['weight_pred_0'], ascending=False)

Unnamed: 0,_file,safety,pred0_0,pred0_1,pred0_pred_safety,pred1_0,pred1_1,pred1_pred_safety,weight_pred_0,weight_pred_1,pred_safety,_merge
685,gsv_1618.jpg,1,0.999589,4.111776e-04,0.0,0.996554,0.003446,0.0,0.997920,0.002080,0,both
2655,gsv_3400.jpg,1,0.999999,7.340673e-07,0.0,0.995372,0.004628,0.0,0.997454,0.002546,0,both
2884,gsv_3608.jpg,1,0.999996,4.290905e-06,0.0,0.994700,0.005300,0.0,0.997083,0.002917,0,both
1861,gsv_2682.jpg,1,0.999962,3.766300e-05,0.0,0.987417,0.012583,0.0,0.993062,0.006938,0,both
2378,gsv_3149.jpg,1,0.999985,1.533803e-05,0.0,0.985764,0.014236,0.0,0.992163,0.007837,0,both
299,gsv_1268.jpg,1,0.982669,1.733082e-02,0.0,0.996175,0.003825,0.0,0.990097,0.009903,0,both
2711,gsv_3451.jpg,1,1.000000,4.589439e-08,0.0,0.980594,0.019406,0.0,0.989327,0.010673,0,both
1005,gsv_1907.jpg,1,0.997898,2.102504e-03,0.0,0.974365,0.025635,0.0,0.984955,0.015045,0,both
1915,gsv_2730.jpg,1,0.998798,1.202275e-03,0.0,0.968680,0.031320,0.0,0.982233,0.017767,0,both
3773,gsv_813.jpg,1,0.999956,4.359570e-05,0.0,0.963780,0.036220,0.0,0.980059,0.019941,0,both


In [268]:
## Very wrongly predicted as 1
final_pred_wrong[final_pred_wrong['weight_pred_1'] > 0.9].sort_values(by = ['weight_pred_1'], ascending=False)

Unnamed: 0,_file,safety,pred0_0,pred0_1,pred0_pred_safety,pred1_0,pred1_1,pred1_pred_safety,weight_pred_0,weight_pred_1,pred_safety,_merge
1509,gsv_2362.jpg,0,0.009144,0.990856,1.0,0.004879,0.995121,1.0,0.006799,0.993201,1,both
385,gsv_1345.jpg,0,0.002280,0.997720,1.0,0.012551,0.987449,1.0,0.007929,0.992071,1,both
2515,gsv_3274.jpg,0,0.009274,0.990726,1.0,0.008415,0.991585,1.0,0.008802,0.991198,1,both
2748,gsv_3485.jpg,0,0.000405,0.999595,1.0,0.015833,0.984167,1.0,0.008890,0.991110,1,both
3967,gsv_991.jpg,0,0.030527,0.969473,1.0,0.007650,0.992350,1.0,0.017945,0.982055,1,both
2716,gsv_3456.jpg,0,0.000011,0.999989,1.0,0.033031,0.966969,1.0,0.018172,0.981828,1,both
3183,gsv_3879.jpg,0,0.001118,0.998882,1.0,0.033804,0.966196,1.0,0.019095,0.980905,1,both
3001,gsv_3713.jpg,0,0.000506,0.999494,1.0,0.034928,0.965072,1.0,0.019438,0.980562,1,both
1687,gsv_2523.jpg,0,0.013274,0.986726,1.0,0.027835,0.972165,1.0,0.021283,0.978717,1,both
1194,gsv_2078.jpg,0,0.006370,0.993630,1.0,0.035453,0.964547,1.0,0.022366,0.977634,1,both
