https://www.kaggle.com/remekkinas/ensemble-learning-meta-classifier-for-stacking  
https://www.kaggle.com/hiro5299834/tps-apr-2021-pseudo-labeling-voting-ensemble  
https://www.kaggle.com/c/tabular-playground-series-apr-2021/discussion/231738  

Alright we're diving blind to study and learn this.

## 1. Description

ensemble-learning meta-classifier for stacking using cross-validation

rubber duck method
> Stacking is an ensemble learning technique where it combines multiple classification models via meta classifier(a meta classifier(level 1) is a proxy, or a standalone representation until it's time for the main(level 2) classifier).  
**First level** consists of fitting the training set on different classifiers, which the same training set is also used for the **Second Level** which may lead to overfitting. To avoid that, the algorithm uses k-folds cross validation and are used to fit the **First Level** classifiers,
**Different Folds for each First Level Classifier**. They are then stacked as input data for the **Second Level Classifier**. After the training of the **StackingCVClassifier** the first level classifiers are fit to the entire dataset.

## 2. Load The Dataset



In [38]:
import numpy as np
import pandas as pd
import itertools    

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score,  KFold, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from mlxtend.classifier import StackingCVClassifier

from sklearn.metrics import roc_auc_score
import pickle

In [39]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
submission_df = pd.read_csv('sample_submission.csv')

test_df['Survived'] = pd.read_csv('kaggle_submission.csv')['Survived']

In [40]:
all_df = pd.concat([train_df,test_df], axis=0)

### Pseudolabelling  

>  the technique of using unlabelled (in some cases test) data with labelled train one to create better models. But how can we use unlabelled data if we need to train supervised model? The answer here is that we can built a 2 stages paradigm for this:  
◽ on stage 1 we train model only on train data and predict for unlabelled  
◽ on stage 2 we use predictions from stage 1 as «pseudo» labels for unlabelled data, concat with pseudolabelled data with labelled train and fit model on this concated dataset

In [41]:
test_df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,100000,3,"Holliday, Daniel",male,19.0,0,0,24745,63.01,,S,0
1,100001,3,"Nguyen, Lorraine",female,53.0,0,0,13264,5.81,,S,1
2,100002,1,"Harris, Heather",female,19.0,0,0,25990,38.91,B15315,C,1
3,100003,2,"Larsen, Eric",male,25.0,0,0,314011,12.93,,S,0
4,100004,1,"Cleary, Sarah",female,17.0,0,2,26203,26.89,B22515,C,1
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,199995,3,"Cash, Cheryle",female,27.0,0,0,7686,10.12,,Q,1
99996,199996,1,"Brown, Howard",male,59.0,1,0,13004,68.31,,S,0
99997,199997,3,"Lightfoot, Cameron",male,47.0,0,0,4383317,10.87,,S,0
99998,199998,1,"Jacobsen, Margaret",female,49.0,1,2,PC 26988,29.68,B20828,C,1


In [42]:
np.random.seed(42)

# shuffle the dataset
all_df = all_df.sample(frac=1).reset_index(drop=True)

# dropping the unneeded columns in the dataset
unneeded_cols = ['Cabin', 'PassengerId', 'Name', 'Ticket']
for col in unneeded_cols:
    all_df = all_df.drop([col], axis=1)
    train_df = train_df.drop([col], axis=1)
    test_df = test_df.drop([col], axis=1)
    
all_df = all_df[all_df['Embarked'].notna()]
all_df = all_df[all_df['Fare'].notna()]
#test_df = test_df[test_df['Embarked'].notna()]

# converting categorical values to numeric    
cat_features = ['Pclass','Sex', 'Embarked', 'SibSp', 'Parch']

# categorical_transformer = Pipeline(steps=[
#     ("imputer", SimpleImputer(strategy='most_frequent')),
#     ("onehot", OneHotEncoder(handle_unknown="ignore"))])

# # filling missing values
# num_features = ['Age']
# numeric_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='most_frequent'))])

# # preprocess and modeling pipeline
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('cat',categorical_transformer, cat_features),
#         ('num', numeric_transformer, num_features),
#     ])


In [44]:
imp_freq = SimpleImputer(strategy='most_frequent')
one_hot = OneHotEncoder(handle_unknown="ignore")
for cat in cat_features:
    all_df[cat] = all_df[cat].fillna(all_df[cat].value_counts().index[0])
    test_df[cat] = test_df[cat].fillna(test_df[cat].value_counts().index[0])

all_df['Age'] = all_df['Age'].fillna(all_df['Age'].mode()[0])
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].mode()[0])
all_df = pd.get_dummies(all_df, columns=cat_features)
test_df = pd.get_dummies(test_df, columns=cat_features)


In [45]:
X = all_df.drop('Survived', axis=1)
y = all_df.Survived

# split using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [33]:
models = {
    'Logistic Regression' : LogisticRegression(solver='liblinear'),
    'Ridge Classifier' : RidgeClassifier(),
    'SGDClassifier' : SGDClassifier(),
    'RandomForest' : RandomForestClassifier(),
    'ExtraTreesClassifier' : ExtraTreesClassifier(),
    'KNC': KNeighborsClassifier(),
    'DecisionTreeClassifier':DecisionTreeClassifier(),
    'LGBMClassifier' : LGBMClassifier(verbose=-1),
    'XGBClassifier' : XGBClassifier(verbose=0, use_label_encoder=False,),
    'CatBoostClassifier' : CatBoostClassifier(verbose=0),
}


In [34]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    model_scores = {}
    for name, model in models.items(): 
        model.fit(X_train, y_train)
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [17]:
model_scores = fit_and_score(models=models, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)
model_scores

Parameters: { "verbose" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.






{'Logistic Regression': 0.8534210130013553,
 'Ridge Classifier': 0.8339775446346401,
 'SGDClassifier': 0.754346334688687,
 'RandomForest': 0.8464099861118083,
 'ExtraTreesClassifier': 0.8306979234643509,
 'KNC': 0.8397168816826465,
 'DecisionTreeClassifier': 0.8129444639659991,
 'LGBMClassifier': 0.8795408530361595,
 'XGBClassifier': 0.8783862925221291,
 'CatBoostClassifier': 0.8790388702039723}

In [19]:
import operator
max(model_scores.items(), key=operator.itemgetter(1))[0]

'LGBMClassifier'

In [20]:
test_df = test_df.drop('Survived', axis=1)

In [21]:
dtm_oof = np.zeros(train_df.shape[0])
dtm_preds = np.zeros(test_df.shape[0])
feature_importances = pd.DataFrame()


skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for fold, (train_idx, valid_idx) in enumerate(skf.split(all_df, all_df['Survived'])):
    print(f"===== FOLD {fold} =====")
    oof_idx = np.array([idx for idx in valid_idx if idx < train_df.shape[0]])
    preds_idx = np.array([idx for idx in valid_idx if idx >= train_df.shape[0]])

    X_train, y_train = all_df.iloc[train_idx].drop('Survived', axis=1), all_df.iloc[train_idx]['Survived']
    X_valid, y_valid = all_df.iloc[oof_idx].drop('Survived', axis=1), all_df.iloc[oof_idx]['Survived']
    X_test = all_df.iloc[preds_idx].drop('Survived', axis=1)
    
    model = LGBMClassifier(
    )
    model.fit(X_train, y_train)
    
    dtm_oof[oof_idx] = model.predict_proba(X_valid)[:,1]
    dtm_preds[preds_idx-train_df.shape[0]] = model.predict_proba(X_test)[:,1]
    
    acc_score = accuracy_score(y_valid, np.where(dtm_oof[oof_idx]>0.5, 1, 0))
    print(f"===== ACCURACY SCORE {acc_score:.6f} =====\n")
    
acc_score = accuracy_score(all_df[:train_df.shape[0]]['Survived'], np.where(dtm_oof>0.5, 1, 0))
print(f"===== ACCURACY SCORE {acc_score:.6f} =====")

===== FOLD 0 =====
===== ACCURACY SCORE 0.879594 =====

===== FOLD 1 =====
===== ACCURACY SCORE 0.886933 =====

===== FOLD 2 =====
===== ACCURACY SCORE 0.877649 =====

===== FOLD 3 =====
===== ACCURACY SCORE 0.880572 =====

===== FOLD 4 =====
===== ACCURACY SCORE 0.881900 =====

===== FOLD 5 =====
===== ACCURACY SCORE 0.881951 =====

===== FOLD 6 =====
===== ACCURACY SCORE 0.886760 =====

===== FOLD 7 =====
===== ACCURACY SCORE 0.877648 =====

===== FOLD 8 =====
===== ACCURACY SCORE 0.876981 =====

===== FOLD 9 =====
===== ACCURACY SCORE 0.879043 =====

===== ACCURACY SCORE 0.880900 =====


In [49]:
test_df = test_df.drop('Survived', axis=1)

In [52]:
submission_df['Survived'] = model.predict(test_df)
submission_df.to_csv('LGBMClassifier.csv', index = False)

In [53]:
submission_df

Unnamed: 0,PassengerId,Survived
0,100000,0
1,100001,1
2,100002,1
3,100003,0
4,100004,1
...,...,...
99995,199995,1
99996,199996,0
99997,199997,0
99998,199998,1


### STACKING

In [54]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
submission_df = pd.read_csv('sample_submission.csv')

test_df['Survived'] = pd.read_csv('kaggle_submission.csv')['Survived']

In [55]:
all_df = pd.concat([train_df,test_df], axis=0)

In [56]:
all_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,1,"Oconnor, Frankie",male,,2,0,209245,27.14,C12239,S
1,1,0,3,"Bryan, Drew",male,,0,0,27323,13.35,,S
2,2,0,3,"Owens, Kenneth",male,0.33,1,2,CA 457703,71.29,,S
3,3,0,3,"Kramer, James",male,19.00,0,0,A. 10866,13.04,,S
4,4,1,3,"Bond, Michael",male,25.00,0,0,427635,7.76,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
99995,199995,1,3,"Cash, Cheryle",female,27.00,0,0,7686,10.12,,Q
99996,199996,0,1,"Brown, Howard",male,59.00,1,0,13004,68.31,,S
99997,199997,0,3,"Lightfoot, Cameron",male,47.00,0,0,4383317,10.87,,S
99998,199998,1,1,"Jacobsen, Margaret",female,49.00,1,2,PC 26988,29.68,B20828,C


In [57]:
score_list = []
for key, value in model_scores.items():
    temp = [key,value]
    score_list.append(temp)

In [58]:
def takeSecond(score_list):
    return score_list[1]
score_list.sort(key=takeSecond,reverse=True)
score_list

[['LGBMClassifier', 0.8795408530361595],
 ['CatBoostClassifier', 0.8790388702039723],
 ['XGBClassifier', 0.8783862925221291],
 ['Logistic Regression', 0.8534210130013553],
 ['RandomForest', 0.8464099861118083],
 ['KNC', 0.8397168816826465],
 ['Ridge Classifier', 0.8339775446346401],
 ['ExtraTreesClassifier', 0.8306979234643509],
 ['DecisionTreeClassifier', 0.8129444639659991],
 ['SGDClassifier', 0.754346334688687]]

In [78]:
cl1 = LogisticRegression(solver='liblinear')
cl2 = RidgeClassifier()
cl3 = SGDClassifier()
cl4 = RandomForestClassifier()
cl5 = ExtraTreesClassifier()
cl6 = KNeighborsClassifier()
cl7 = DecisionTreeClassifier()
cl8 = LGBMClassifier(device = 'gpu',gpu_platform_id = '1', gpu_device_id = '1')
cl9 = XGBClassifier(eval_metric='binary:logistic' )
cl10 = CatBoostClassifier(verbose = None, logging_level = 'Silent')

In [73]:
classifiers = {
    "LGBMClassifier": cl8,
    "XGBClassifier": cl9,
    "CatBoostClassifier": cl10,
    "RandomForest": cl4,
    "LogisticRegression": cl1,
}

In [74]:
taken_classifiers = ['LGBMClassifier', 'XGBClassifier', 'CatBoostClassifier', 'RandomForest', 'LogisticRegression']

In [75]:
mlr=LogisticRegression()

def best_stacking_search():
    cls_list = []
    best_auc = -1
    i=0

    best_cls_experiment = list()

    print(">>>> Training started <<<<")

    for cls_comb in range(2, len(taken_classifiers)+1):
        for subset in itertools.combinations(taken_classifiers, cls_comb):
            cls_list.append(subset)

    print(f"Total number of model combination: {len(cls_list)}")


    for cls_exp in cls_list:
        cls_labels = list(cls_exp)

        classifier_exp = []
        for ii in range(len(cls_labels)):
            label = taken_classifiers[ii]
            classifier = classifiers[label]
            classifier_exp.append(classifier)


        sclf = StackingCVClassifier(classifiers = classifier_exp,
                                    shuffle = False,
                                    use_probas = True,
                                    cv = 5,
                                    meta_classifier = mlr,
                                    n_jobs = -1)
        model = sclf
        
        model.fit(X_train, y_train)
        scores = model.score(X_test, y_test)

        if scores.mean() > best_auc:
            best_cls_experiment = list(cls_exp)
        i += 1
        print(f"  {i} - Stacked combination - Acc {cls_exp}: {scores.mean():.5f}")
        
    return best_cls_experiment

In [76]:
import warnings
warnings.filterwarnings('ignore')

In [79]:
best_stacking_search()

>>>> Training started <<<<
Total number of model combination: 26


LightGBMError: GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1