<a href="https://www.kaggle.com/code/hwikookchoe/tps08-22-generalized-ensemble-model-gem?scriptVersionId=103379617" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Generalized Ensemble Model (GEM)

I used 4 base models, LGBM, Gaussian naive Bayes, Logit, and Probit. For ensembling the 4 base models, I tried to use Generalized Ensemble Model (GEM), proposed by  [Perrone and Cooper (1992)](https://scholar.google.com/scholar_lookup?title=When%20networks%20disagree%3A%20Ensemble%20methods%20for%20hybrid%20neural%20networks&author=M.P.%20Perrone&publication_year=1992), in order to find optimal weights in terms of MSE or RMSE. However, the optimal weights calculated by GEM algorithm suggested that Logit-only model predicted the best (in terms of MSE/RMSE). What is even more interesting is that the optimal weight does not maximize ROC score. Thus, although I calculated optimal weights using GEM algorithm, I assigned arbitrary weights for my final ensemble model.

Calculating optimal weight based on maximizing ROC AUC score seems impossible (is it possible?), since the score is not differentiable. Some approaches use alternative measure, such as binary cross entropy(BCE). There are other approaches that resemble the behavior of ROC AUC score and are differentiable (and are possible to make as quadratic form, which have single global max/min), such as [ROC-star](https://github.com/iridiumblue/roc-star). I'm going to implement BCE and ROC-star later.

Improved algorithm, Generalized Ensemble Model with internally tuned hyperparameters (GEM-ITH) is proposed by [Shahhosseini, Hu and Pham (2022)](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C14&q=Optimizing+ensemble+weights+and+hyperparameters+of+machine+learning+models+for+regression+problems&btnG=), which suggests tuning not only the weights of base models but also the hyperparameters of each base model inside the loop. I did not implement this work, and do not intend to do it.

## 0. Preparing Notebook

### 0.1. Import Packages

In [1]:
import gc
import os

import numpy as np 
import pandas as pd

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import roc_auc_score, mean_squared_error, log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

from lightgbm import LGBMClassifier, LGBMRegressor

import statsmodels.api as sm

from scipy.optimize import minimize

### 0.2. Define Custom Functions

In [2]:
def data_cleaning(df):
    df_copy = df.copy()
    df_copy.set_index('id', inplace=True)
    
    for i in '01':
        col = 'attribute_'+i
        df_copy.loc[:, col] = df_copy.loc[:, col].str[-1]
        df_copy.loc[:, col] = df_copy.loc[:, col].astype('int')
    
    for col in ['product_code']+['attribute_0']:
        df_copy.loc[:, col] = df_copy.loc[:, col].astype('category')
    
    for col in [f'attribute_{i}' for i in range(1, 4)] + [f'measurement_{i}' for i in range(3)]:
        df_copy.loc[:, col] = df_copy.loc[:, col].astype('float64')
    
    df_info = pd.DataFrame([df_copy.dtypes, df_copy.isna().sum()]).T
    df_info.columns = ['dtypes', 'n_of_nan']
        
    return df_copy, df_info


def LGBMImputer(train_df, test_df, y_column_list, X_column_list):

    train_df_copy = train_df.copy()
    test_df_copy = test_df.copy()
    
    for y_col in y_column_list:
        train_na_index = train_df.index[train_df.loc[:, y_col].isna()].tolist()
        train_not_na_index = train_df.index[train_df.loc[:, y_col].notna()].tolist()
        test_na_index = test_df.index[test_df.loc[:, y_col].isna()].tolist()
        
        LGBM_model = LGBMRegressor(n_jobs=-1)
        param_grid = {'learning_rate': [0.1, 0.06, 0.03, 0.01, 0.006, 0.003, 0.001]}

        grid_search = GridSearchCV(LGBM_model,
                                   param_grid,
                                   cv=5
                                  )
        grid_search.fit(train_df_copy.loc[train_not_na_index, X_column_list], train_df_copy.loc[train_not_na_index, y_col])
        final_model = grid_search.best_estimator_
        train_df.loc[train_na_index, y_col] = final_model.predict(train_df_copy.loc[train_na_index, X_column_list])
        test_df.loc[test_na_index, y_col] = final_model.predict(test_df_copy.loc[test_na_index, X_column_list])
    
    return train_df, test_df


def LogitDataTransformer(train_df, test_df, cat_col_list):
    '''
    I made custom OneHotEncoder, since I want to keep pd.DataFrame format, not np.ndarray format.
    '''
    
    train_df_copy = train_df.copy()
    test_df_copy = test_df.copy()
    
    drop_col_list = [col+'_'+train_df.loc[:, col].astype(str).min() for col in cat_col_list]
    train_df_copy = pd.get_dummies(train_df_copy, columns=cat_col_list)
    test_df_copy = pd.get_dummies(test_df_copy, columns=cat_col_list)
    train_df_copy.drop(columns=drop_col_list, inplace=True)
    test_df_copy.drop(columns=drop_col_list, inplace=True)
    
    return train_df_copy, test_df_copy


def FeatureEngineering(X_train, X_test):
    '''
    Add interaction terms
    '''
    
    X_train_copy = X_train.copy()
    X_test_copy = X_test.copy()
    
    col_list = X_test_copy.columns.tolist()
    for i1, c1 in enumerate(col_list[:-1]):
        for i2, c2 in enumerate(col_list[i1+1:]):
            X_train_copy.loc[:, c1+'_'+c2] = X_train_copy.loc[:, c1] * X_train_copy.loc[:, c2]
            X_test_copy.loc[:, c1+'_'+c2] = X_test_copy.loc[:, c1] * X_test_copy.loc[:, c2]
    
    return X_train_copy, X_test_copy

### 0.3. ROC-star

credit: IRIDIUMBLUE  
Github: [https://github.com/iridiumblue/roc-star](https://github.com/iridiumblue/roc-star)  
Kaggle: [https://www.kaggle.com/code/iridiumblue/roc-star-an-auc-loss-function-to-challenge-bxe/notebook](https://www.kaggle.com/code/iridiumblue/roc-star-an-auc-loss-function-to-challenge-bxe/notebook)

The original code is based on PyTorch, so I transformed to numpy/scipy version.

In [3]:
def epoch_update_gamma(y_true, y_pred, epoch=-1, delta=2):
    """
    numpy-backed version.
    
    Calculate gamma from last epoch's targets and predictions.
    Gamma is updated at the end of each epoch.
    y_true: np.ndarray. Targets (labels). int either 0 or 1.
    y_pred: np.ndarray. Predictions.
    """
    DELTA = delta
    SUB_SAMPLE_SIZE = 2000.0
    pos = y_pred[y_true == 1]
    neg = y_pred[y_true == 0]
    
    # subsample the training set for performance
    cap_pos = pos.shape[0]
    cap_neg = neg.shape[0]
    pos = pos[np.random.uniform(size=cap_pos) < SUB_SAMPLE_SIZE/cap_pos]
    neg = neg[np.random.uniform(size=cap_neg) < SUB_SAMPLE_SIZE/cap_neg]
    ln_pos = pos.shape[0]
    ln_neg = neg.shape[0]
    pos_expand = np.tile(pos.reshape(-1,1), (1,ln_neg)).reshape(-1)
    neg_expand = neg.repeat(ln_pos)
    diff = neg_expand - pos_expand
    ln_All = diff.shape[0]
    Lp = diff[diff>0] # because we're taking positive diffs, we got pos and neg flipped.
    ln_Lp = Lp.shape[0]-1
    diff_neg = -1.0 * diff[diff<0]
    diff_neg = diff_neg.sort()[0]
    ln_neg = diff_neg.shape[0]-1
    ln_neg = max([ln_neg, 0])
    left_wing = int(ln_Lp*DELTA)
    left_wing = max([0,left_wing])
    left_wing = min([ln_neg,left_wing])
    default_gamma = np.array([0.2])
    if diff_neg.shape[0] > 0:
       gamma_value = diff_neg[left_wing]
    else:
       gamma_value = default_gamma # default=torch.tensor(0.2, dtype=torch.float).cuda() #zoink
    L1 = diff[diff > -1.0*gamma]
    ln_L1 = L1.shape[0]
    if epoch > -1:
        return gamma_value
    else :
        return default_gamma

def roc_star_loss( _y_true, y_pred, gamma, _epoch_true, epoch_pred):
        """
        numpy-backed version.
        
        _y_true     : np.ndarray. Targets (labels). int either 0 or 1.
        y_pred      : np.ndarray. Predictions.
        gamma       : Gamma, as derived from last epoch.
        _epoch_true : np.ndarray. Targets (labels) from last epoch.
        epoch_pred  : np.ndarray. Predicions from last epoch.
        
        original code:
        https://github.com/iridiumblue/articles/blob/master/roc_star.md
        
        """
        #convert labels to boolean
        y_true = (_y_true>=0.50)
        epoch_true = (_epoch_true>=0.50)

        # if batch is either all true or false return small random stub value.
        if y_true.sum() == 0 or y_true.sum() == y_true.shape[0]:
            return y_pred.sum() * 1e-8

        pos = y_pred[y_true]
        neg = y_pred[~y_true]

        epoch_pos = epoch_pred[epoch_true]
        epoch_neg = epoch_pred[~epoch_true]

        # Take random subsamples of the training set, both positive and negative.
        max_pos = 1000 # Max number of positive training samples
        max_neg = 1000 # Max number of positive training samples
        cap_pos = epoch_pos.shape[0]
        cap_neg = epoch_neg.shape[0]
        epoch_pos = epoch_pos[np.random.uniform(size=cap_pos) < max_pos/cap_pos]
        epoch_neg = epoch_neg[np.random.uniform(size=cap_neg) < max_neg/cap_pos]

        ln_pos = pos.shape[0]
        ln_neg = neg.shape[0]

        # sum positive batch elements agaionst (subsampled) negative elements
        if ln_pos>0:
            pos_expand = np.tile(pos.reshape(-1,1), (1,epoch_neg.shape[0])).reshape(-1)
            neg_expand = epoch_neg.repeat(ln_pos)

            diff2 = neg_expand - pos_expand + gamma
            l2 = diff2[diff2>0]
            m2 = l2 * l2
            len2 = l2.shape[0]
        else:
            m2 = np.array([0.0])
            len2 = 0

        # Similarly, compare negative batch elements against (subsampled) positive elements
        if ln_neg>0 :
            pos_expand = np.tile(epoch_pos.reshape(-1,1), (1, ln_neg)).reshape(-1)
            neg_expand = neg.repeat(epoch_pos.shape[0])

            diff3 = neg_expand - pos_expand + gamma
            l3 = diff3[diff3>0]
            m3 = l3*l3
            len3 = l3.shape[0]
        else:
            m3 = np.array([0.0])
            len3=0

        if (m2.sum()+m3.sum()) != 0 :
           res2 = m2.sum() / max_pos + m3.sum() / max_neg
           #code.interact(local=dict(globals(), **locals()))
        else:
           res2 = m2.sum()+m3.sum()

        res2 = np.where(np.isnan(res2), np.zeros_like(res2), res2)

        return res2


## This part is incomplete, so may contain errors.

## 1. Data

### 1.1. Load Data

In [4]:
train_df, train_df_info = data_cleaning(pd.read_csv('/kaggle/input/tabular-playground-series-aug-2022/train.csv'))
test_df, test_df_info = data_cleaning(pd.read_csv('/kaggle/input/tabular-playground-series-aug-2022/test.csv'))

train_df_y = train_df.loc[:, 'failure']
train_df_X = train_df.drop(columns=['failure'])

submission_intermediate = pd.DataFrame(0, index=test_df.index, columns=['M'+n for n in '0123'])
GEM_df = pd.DataFrame(0, index=train_df.index, columns=['M'+n for n in '0123'])

In [5]:
# train_df_info
# test_df_info
# train_df.head(20)
# test_df.head()
# train_df.columns

#### 1.1.1. Explore Data

In [6]:
product_code_compare = pd.merge(
    train_df_X.product_code.value_counts(), 
    test_df.product_code.value_counts(), 
    left_index=True,
    right_index=True,
    how='outer'
)
product_code_compare.columns = ['train', 'test']
print('Comparison of product_code between train and test\n')
print(product_code_compare)

del product_code_compare
_ = gc.collect()

Comparison of product_code between train and test

    train    test
A  5100.0     NaN
B  5250.0     NaN
C  5765.0     NaN
D  5112.0     NaN
E  5343.0     NaN
F     NaN  5422.0
G     NaN  5107.0
H     NaN  5018.0
I     NaN  5228.0


Since there is no overlapped data, I will not use product code.

In [7]:
attribute_0_compare = pd.merge(
    train_df_X.attribute_0.value_counts(), 
    test_df.attribute_0.value_counts(), 
    left_index=True,
    right_index=True,
    how='outer'
)
attribute_0_compare.columns = ['train', 'test']
print('Comparison of attribute_0 between train and test\n')
print(attribute_0_compare)

del attribute_0_compare
_ = gc.collect()

Comparison of attribute_0 between train and test

   train   test
5   5250  10529
7  21320  10246


In [8]:
attribute_1_compare = pd.merge(
    train_df_X.attribute_1.value_counts(), 
    test_df.attribute_1.value_counts(), 
    left_index=True,
    right_index=True,
    how='outer'
)
attribute_1_compare.columns = ['train', 'test']
print('Comparison of attribute_1 between train and test\n')
print(attribute_1_compare)

del attribute_1_compare
_ = gc.collect()

Comparison of attribute_1 between train and test

       train     test
5.0  10362.0   5228.0
6.0   5343.0  10529.0
7.0      NaN   5018.0
8.0  10865.0      NaN


In [9]:
attribute_2_compare = pd.merge(
    train_df_X.attribute_2.value_counts(), 
    test_df.attribute_2.value_counts(), 
    left_index=True,
    right_index=True,
    how='outer'
)
attribute_2_compare.columns = ['train', 'test']
print('Comparison of attribute_2 between train and test\n')
print(attribute_2_compare)

del attribute_2_compare
_ = gc.collect()

Comparison of attribute_2 between train and test

       train     test
5.0   5765.0      NaN
6.0  10455.0   5422.0
7.0      NaN   5018.0
8.0   5250.0      NaN
9.0   5100.0  10335.0


In [10]:
attribute_3_compare = pd.merge(
    train_df_X.attribute_3.value_counts(), 
    test_df.attribute_3.value_counts(), 
    left_index=True,
    right_index=True,
    how='outer'
)
attribute_3_compare.columns = ['train', 'test']
print('Comparison of attribute_3 between train and test\n')
print(attribute_3_compare)

del attribute_3_compare
_ = gc.collect()

Comparison of attribute_3 between train and test

       train    test
4.0      NaN  5422.0
5.0   5100.0  5228.0
6.0   5112.0     NaN
7.0      NaN  5107.0
8.0  11015.0     NaN
9.0   5343.0  5018.0


In [11]:
train_df_X.drop(columns=['product_code'], inplace=True)
train_df.drop(columns=['product_code'], inplace=True)
test_df.drop(columns=['product_code'], inplace=True)

### 1.2. Split Data

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    train_df_X,
    train_df_y,
    test_size=0.2,
    random_state=42
)

### 1.3. Impute and Scale Data

I tried both (custom) LGBM imputer and (sklearn built-in) iterative imputer, but my LGBM imputer works better.

In [13]:
X_train_na_info = X_train.isna().sum()
X_test_na_info = X_test.isna().sum()

train_na_col_set = set(X_train_na_info.index[X_train_na_info>0].tolist())
test_na_col_set = set(X_train_na_info.index[X_train_na_info>0].tolist())
na_col_list = sorted(train_na_col_set | test_na_col_set)
non_na_col_list = [col for col in X_train.columns if col not in na_col_list]

X_train, X_test = LGBMImputer(X_train, X_test, na_col_list, non_na_col_list)

# IterImputer = IterativeImputer(random_state=42)
# X_train.loc[:,:] = IterImputer.fit_transform(X_train.values)
# X_test.loc[:,:] = IterImputer.transform(X_test.values)

In [14]:
train_na_col_set = set(train_df_info.index[train_df_info.n_of_nan>0].tolist())
test_na_col_set = set(test_df_info.index[test_df_info.n_of_nan>0].tolist())
na_col_list = sorted(train_na_col_set | test_na_col_set)
non_na_col_list = [f'attribute_{i}' for i in range(4)] + [f'measurement_{i}' for i in range(3)]

train_df_X, test_df = LGBMImputer(train_df_X, test_df, na_col_list, non_na_col_list)

# IterImputer = IterativeImputer(random_state=42)
# train_df_X.loc[:,:] = IterImputer.fit_transform(train_df_X.values)
# test_df.loc[:,:] = IterImputer.transform(test_df.values)

In [15]:
scaler = StandardScaler()

scale_col_list = X_train.select_dtypes(include='number').columns.tolist()
X_train.loc[:, scale_col_list] = scaler.fit_transform(X_train.loc[:, scale_col_list])
X_test.loc[:, scale_col_list] = scaler.transform(X_test.loc[:, scale_col_list])

In [16]:
scaler = StandardScaler()

scale_col_list = train_df.select_dtypes(include='number').columns.tolist()[:-1]
train_df_X.loc[:, scale_col_list] = scaler.fit_transform(train_df_X.loc[:, scale_col_list])
test_df.loc[:, scale_col_list] = scaler.transform(test_df.loc[:, scale_col_list])

## 2. Analysis

### 2.1. Estimate Model

In [17]:
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(train_df_X, train_df_y)

5

#### 2.1.1. LGBM Classifier

In [18]:
# LGBM_model = LGBMClassifier(n_jobs=-1)
# param_grid = {'boosting_type': ['gbdt', 'dart'],
#               'num_leaves': range(2, 54, 2),
#               'learning_rate': [0.1, 0.06, 0.03, 0.01, 0.006, 0.003, 0.001]
#              }

# grid_search = GridSearchCV(
#     LGBM_model,
#     param_grid,
#     cv=5
# )

# grid_search.fit(
#     X_train, 
#     y_train, 
#     eval_metric='l2', 
#     eval_set=[(X_test, y_test)]
# )

# print(grid_search.best_params_)
# final_model_LGBM = grid_search.best_estimator_

final_model_LGBM = LGBMClassifier(
    n_jobs=-1,
    learning_rate=0.1,
    num_leaves=2
)

for train_index, test_index in skf.split(train_df_X, train_df_y):
    final_model_LGBM.fit(train_df_X.iloc[train_index,:], train_df_y.iloc[train_index])
    y_pred_proba = final_model_LGBM.predict_proba(train_df_X.iloc[test_index,:])[:,1]
    print('roc score is', roc_auc_score(train_df_y.iloc[test_index].values, y_pred_proba))
    GEM_df.iloc[test_index, 0] = y_pred_proba
    submission_intermediate.iloc[:, 0] += (final_model_LGBM.predict_proba(test_df)[:,1])/5

roc score is 0.6171376325037858
roc score is 0.5814157176940389
roc score is 0.5856749902705629
roc score is 0.59398318922486
roc score is 0.5886455354574527


#### 2.1.2. Gaussian Naive Bayes

In [19]:
model_GNB = GaussianNB()

for train_index, test_index in skf.split(train_df_X, train_df_y):
    model_GNB.fit(train_df_X.iloc[train_index,:], train_df_y.iloc[train_index])
    y_pred_proba = model_GNB.predict_proba(train_df_X.iloc[test_index,:])[:,1]
    print('roc score is', roc_auc_score(train_df_y.iloc[test_index].values, y_pred_proba))
    GEM_df.iloc[test_index, 1] = y_pred_proba
    submission_intermediate.iloc[:, 1] += (model_GNB.predict_proba(test_df)[:,1])/5

roc score is 0.5976839549913067
roc score is 0.5072585830555509
roc score is 0.5725589688488807
roc score is 0.5870342560787831
roc score is 0.576526887087768


#### 2.1.3. Logistic Regressor

In [20]:
X_train_logit, X_test_logit = LogitDataTransformer(X_train, X_test, ['attribute_0'])
train_df_X_logit, test_df_logit = LogitDataTransformer(train_df_X, test_df, ['attribute_0'])

In [21]:
logit_model = LogisticRegression(max_iter=10000, solver='newton-cg')
logit_params = {'C':np.logspace(-10,0)}

logit_cv = GridSearchCV(
    estimator=logit_model, 
    param_grid=logit_params, 
    scoring='roc_auc',
    cv=5
)
# scoring can be 'neg_log_loss' or 'roc_auc'

logit_cv.fit(X_train, y_train)
print(logit_cv.best_params_)

final_model_Logit = LogisticRegression(C=logit_cv.best_params_['C'], max_iter=10000, solver='newton-cg')
# final_model_Logit = LogisticRegression(C=0.000868511373751352, max_iter=10000, solver='newton-cg')

for train_index, test_index in skf.split(train_df_X_logit, train_df_y):
    final_model_Logit.fit(train_df_X_logit.iloc[train_index,:], train_df_y.iloc[train_index])
    y_pred_proba = final_model_Logit.predict_proba(train_df_X_logit.iloc[test_index,:])[:,1]
    print('roc score is', roc_auc_score(train_df_y.iloc[test_index].values, y_pred_proba))
    GEM_df.iloc[test_index, 2] = y_pred_proba
    submission_intermediate.iloc[:, 2] += (final_model_Logit.predict_proba(test_df_logit)[:,1])/5



{'C': 0.000868511373751352}
roc score is 0.6014389829127393
roc score is 0.5856839794243558
roc score is 0.582732787356808
roc score is 0.5944322238954973
roc score is 0.5851807983214606


In [22]:
logit_cv.best_params_['C']

0.000868511373751352

#### 2.1.4. Probit Regressor


In [23]:
for train_index, test_index in skf.split(train_df_X_logit, train_df_y):
    probit_model = sm.Probit(train_df_y.iloc[train_index], train_df_X_logit.iloc[train_index,:]).fit(maxiter=10000)
    y_pred_proba = probit_model.predict(train_df_X_logit.iloc[test_index,:])
    print('roc score is', roc_auc_score(train_df_y.iloc[test_index].values, y_pred_proba))
    GEM_df.iloc[test_index, 3] = y_pred_proba
    submission_intermediate.iloc[:, 3] += probit_model.predict(test_df_logit)/5

Optimization terminated successfully.
         Current function value: 0.521913
         Iterations 5
roc score is 0.550995848558636
Optimization terminated successfully.
         Current function value: 0.512617
         Iterations 5
roc score is 0.568236560686306
Optimization terminated successfully.
         Current function value: 0.512786
         Iterations 5
roc score is 0.5813397857831774
Optimization terminated successfully.
         Current function value: 0.506558
         Iterations 5
roc score is 0.5900957503511057
Optimization terminated successfully.
         Current function value: 0.507271
         Iterations 5
roc score is 0.5922572716966447


### 2.2. Estimate Optimal Weight for Ensemble using Generalized Ensemble Model (GEM) Algorithm

In [24]:
GEM_df.head()

Unnamed: 0_level_0,M0,M1,M2,M3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.838986,0.244903,0.201785,0.011233
1,0.826552,0.252257,0.196445,0.009262
2,0.83369,0.269076,0.204666,0.009568
3,0.876592,0.291542,0.229822,0.01174
4,0.91336,0.358663,0.285113,0.025414


In [25]:
train_df_y.head()

id
0    0
1    0
2    0
3    0
4    0
Name: failure, dtype: int64

In [26]:
def GEM_obj(weight, y_true, GEM_df):
    y_pred = pd.Series(0, index=y_true.index)
    for i in range(4):
        y_pred += weight[i] * GEM_df.iloc[:, i]
    
    return log_loss(y_true, y_pred)

init_params = np.array([1/4, 1/4, 1/4, 1/4])
constraint = {'type':'eq', 'fun': lambda x: sum(x)-1}
bound = [(0,1), (0,1), (0,1), (0,1)]
opt_result = minimize(
    fun=GEM_obj,
    x0=init_params,
    args=(train_df_y, GEM_df),
    method='SLSQP',
    bounds=bound,
    constraints=constraint,
    options={'maxiter':10000}
)
print(opt_result)

     fun: 0.5100997593174399
     jac: array([0.01975389, 0.00333703, 0.00132555, 0.00541636])
 message: 'Optimization terminated successfully'
    nfev: 45
     nit: 9
    njev: 9
  status: 0
 success: True
       x: array([0.00000000e+00, 1.03449069e-16, 1.00000000e+00, 0.00000000e+00])


Previous results, when obj was to minimize MSE/RMSE/BCE, show that M2 (Logit) has weight of 1.

## 3. Create Submission File

In [27]:
def UnitScaler(array):
    q1 = array.min()
    q2 = array.max() - q1
    return (array - q1) / q2

for i in range(4):
    submission_intermediate.iloc[:, i] = UnitScaler(submission_intermediate.iloc[:, i].values)

In [28]:
submission_file = pd.Series(0, index=test_df.index, name='failure')
# for i in range(4):
    # submission_file += submission_intermediate.iloc[:, i] * opt_result.x[i]
    # submission_file += submission_intermediate.iloc[:, i] / 4

submission_file += submission_intermediate.iloc[:, 0] * 0.1
submission_file += submission_intermediate.iloc[:, 1] * 0.1
submission_file += submission_intermediate.iloc[:, 2] * 0.5
submission_file += submission_intermediate.iloc[:, 3] * 0.3


In [29]:
submission_file.head(10)

id
26570    0.423524
26571    0.353054
26572    0.384163
26573    0.382379
26574    0.611270
26575    0.353534
26576    0.327974
26577    0.461960
26578    0.291347
26579    0.372029
Name: failure, dtype: float64

In [30]:
submission_file.to_csv('submission.csv')