# **Ensembles**

### **What is it?**
**Ensemble methods** are techniques in machine learning that leverage the power of **multiple models** to enhance predictive performance. These methods combine various individual models to create a **more robust and accurate predictive model**. By aggregating the predictions of several models, ensemble methods aim to mitigate the weaknesses of individual models and **reduce the risk of overfitting**, ultimately leading to **improved accuracy** and robustness in predictions.

### **Why are they better**

##### **Greater accuracy**
* By combining the predictions of multiple models, ensemble methods can **reduce the errors** that are present in **individual models**. This combination allows for leveraging the **best aspects of each model**, resulting in more precise predictions.

##### **Harder to overfit**
* Individual models may be prone to overfitting, but using multiple models helps **reduce the likelihood** of any single **model overfitting** to the training data. This ensemble approach balances the tendencies of different models and creates a **more stable** and generalized **output**.

##### **Better generalization**
* Due to the aforementioned advantages, ensemble methods tend to **generalize better on new data**. This makes them **highly valuable** for practical applications where the model's performance on **unseen data is crucial**.

### **Examples**

##### **Bagging**
* Uses **sampling with replacement** to create different **subsets** of the data and **combines** their **predictions** (e.g., **Random Forest**).

##### **Boosting**
* Models are trained **iteratively**, and each model attempts to **correct the errors of previous** models. Examples include **AdaBoost**, **Gradient Boosting**, and **XGBoost** (we've already used 2 of these in previous notebooks).

##### **Stacking**
* Uses a **meta-model** that learns how to **combine the predictions of multiple different models**. These models are typically base models, and their **predictions** serve as the **input for the meta-model**. We will use **Meta-Learner**.

##### **Voting**
* Combines predictions by **"voting"**. It can use **"hard voting"** or **"soft voting"**. The former uses **majority voting** to make decisions, i.e., the majority decides, while **soft voting** selects the class with the **highest probability.**

There are also other principles like:
* **Blending**
* **Bucket of Models**
* **Cascade Generalization**
but these will not be covered in this notebook.

**Let's get to work!**

### Data loading

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import warnings
import logging
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/dapprojekt24-1/train.csv
/kaggle/input/dapprojekt24-1/test.csv


In [2]:
# Suppress all warnings
warnings.filterwarnings("ignore")

# Disable LightGBM info messages
logging.getLogger('lightgbm').setLevel(logging.ERROR)

In [3]:
train_data = pd.read_csv("/kaggle/input/dapprojekt24-1/train.csv")
test_data = pd.read_csv("/kaggle/input/dapprojekt24-1/test.csv")


In [4]:
train_data.head()

Unnamed: 0,Date,Symbol,Adj Close,Close,High,Low,Open,Volume,Target,Id
0,2010-01-04,MMM,53.29538,83.019997,83.449997,82.669998,83.089996,3043700.0,0,0
1,2010-01-05,MMM,52.961575,82.5,83.230003,81.699997,82.800003,2847000.0,0,1
2,2010-01-06,MMM,53.712681,83.669998,84.599998,83.510002,83.879997,5268500.0,0,2
3,2010-01-07,MMM,53.751179,83.730003,83.760002,82.120003,83.32,4470100.0,0,3
4,2010-01-08,MMM,54.129955,84.32,84.32,83.300003,83.690002,3405800.0,0,4


### Imports

In [8]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score


### Models
I will use two sets of models, one that I used in the mandatory notebook, and the second one that uses ensembles so we can make comparisons.

In [19]:
models = {
    'GaussianNB': GaussianNB(),
    'LogisticRegression': LogisticRegression(),
    'XGBClassifier': XGBClassifier()
}

ensembles = {
    'Bagging_DT': BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42),
    'Bagging_LR': BaggingClassifier(base_estimator=LogisticRegression(), n_estimators=10, random_state=42),
    'AdaBoost': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=4), n_estimators=50, learning_rate=1.0, random_state=42),
    'Voting': VotingClassifier(estimators=[
        ('lr', LogisticRegression()),
        ('dt', DecisionTreeClassifier()),
        ('gnb', XGBClassifier())
    ], voting='hard'),
    'Stacking': StackingClassifier(estimators=[
        ('lr', LogisticRegression()),
        ('knn', KNeighborsClassifier())
    ], final_estimator=LogisticRegression())
}


### Data preparation

In [16]:
all_features = ['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume', 'Target']
features = ['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume']
train_data1 = train_data[all_features]
train_data1 = train_data1[(train_data1 >= 0).all(axis=1)]

train_data1.fillna(-1, inplace=True)

X = train_data1[features]
y = train_data1['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### First set of models

In [17]:
results1 = []

best_model1 = None
best_f1_score1 = 0

for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    rounded_predictions = (predictions >= 0.5).astype(int)
    f1 = f1_score(y_test, rounded_predictions)
    print(f'{name} - F1 score: {f1}')
    
    if f1 > best_f1_score1:
        best_model1 = model
        best_f1_score1 = f1

    results1.append({'Model': name, 'F1 Score': f1})

print(f'Best model: {best_model1}')
print(f'Best F1 score: {best_f1_score1}')


GaussianNB - F1 score: 0.877086296823139
LogisticRegression - F1 score: 0.8784048361242393
XGBClassifier - F1 score: 0.8798497080960251
Best model: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)
Best F1 score: 0.8798497080960251


### New set of models

In [20]:
results2 = []

best_model2 = None
best_f1_score2 = 0

for name, model in ensembles.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    rounded_predictions = (predictions >= 0.5).astype(int)
    f1 = f1_score(y_test, rounded_predictions)
    print(f'{name} - F1 score: {f1}')
    
    if f1 > best_f1_score2:
        best_model2 = model
        best_f1_score2 = f1

    results2.append({'Model': name, 'F1 Score': f1})

print(f'Best model: {best_model2}')
print(f'Best F1 score: {best_f1_score2}')


Bagging_DT - F1 score: 0.8622186619809298
Bagging_LR - F1 score: 0.8784048361242393
AdaBoost - F1 score: 0.8781778209398069
Voting - F1 score: 0.880048443687182
Stacking - F1 score: 0.8784048361242393
Best model: VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('dt', DecisionTreeClassifier()),
                             ('gnb',
                              XGBClassifier(base_score=None, booster=None,
                                            callbacks=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None, device=None,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric=None,
                                            feature_types=None, gamma=None

### **Results**

In [21]:
if best_f1_score1 > best_f1_score2: best_model = best_model1
else: best_model = best_model2
best_f1_score = max(best_f1_score1, best_f1_score2)

print(f"Best model in the notebook is: {best_model}\n")
print(f"Best f1 score that I got is: {best_f1_score}\n")

results = pd.concat([pd.DataFrame(results1), pd.DataFrame(results2)], ignore_index=True)
results

Best model in the notebook is: VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('dt', DecisionTreeClassifier()),
                             ('gnb',
                              XGBClassifier(base_score=None, booster=None,
                                            callbacks=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None, device=None,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric=None,
                                            feature_types=None, gamma=None,
                                            grow_policy=None,
                                            importance_type=None,
                                            interact

Unnamed: 0,Model,F1 Score
0,GaussianNB,0.877086
1,LogisticRegression,0.878405
2,XGBClassifier,0.87985
3,Bagging_DT,0.862219
4,Bagging_LR,0.878405
5,AdaBoost,0.878178
6,Voting,0.880048
7,Stacking,0.878405


I can happily tell you that one of my **ensemble models** has yielded the **best result**, and that's the **Voting model**. It achieved the most favorable outcome, which is quite fitting as I **utilized XGBoost**, the top performer among the base models, **along with others** to reach this **conclusion**.

##### Code for submission

In [23]:
X_test = test_data[features].fillna(-1)
IDs = test_data['Id']

predictions = best_model.predict(X_test)
rounded_predictions = (predictions >= 0.5).astype(int)

negative_mask = (X_test < 0).any(axis=1)
rounded_predictions[negative_mask] = 0

predictions_df = pd.DataFrame({'Id': IDs, 'TARGET': rounded_predictions})

predictions_df.to_csv('/kaggle/working/predictions.csv', index=False)