### **what is ensemble learning**
ensemble means combining multiple models (called base learners) to improve accuracy and robustness

-> the idea is that when weak learners are going to be combined properly, they will eventually form a strong learner

ensemble learning is used to:
- reduce variance (like overfitting)

- reduce bias

- improve generalization

##### **3 core families of ensemble methods:**
- **bagging**
- *type: parallel*
- training many models independently on random subsets of data (like **RandomForest**)


- **boosting**
- *type: sequential*
- training each model sequentially, where each model corrects the errors of its predecessor model (like **AdaBoost**, **XGBoost**)


- **voting**
- *type: simple averaging*
- the idea is to combine predictions from multiple independent models (like using **majority voting**)


### overview of algorithms (simple intuition)
1. **bagging (bootstrap aggregation)**
- randomly sample the data (with replacement)
- train multiple models (often decision trees)
- combine predictions using **majority voting**, e.g. **RandomForest**

2. **boosting**
- training models sequentially
- each model learns to fix the mistakes of the previous model
- e.g. **AdaBoost & XGBoost**

3. **voting**
- combining several trained model's predictions
- **hard voting**: uses majority class
- **soft voting**: averages the predicted probabilities 

handling null or categorical data

In [None]:
df.isnull().sum() #check for missing values

df.dupicated().sum() #check for duplicates

df = df.drop_duplicates() #remove duplicates if any

In [None]:
#if there is any object(string) columns

for col in df.select_dtypes(include=['object', 'category']).columns:
    df[col] = df[col].astype('category').cat.codes

### **ensemble specific coding basics**
#### **random forest (bagging)**

In [None]:
from sklearn.ensemble import RandomForestClassifier

#n_estimators is the number of trees
rf = RandomForestClassifier(n_estimators=100, criterion='gini', random_state=42)

rf.fit(X_train, y_train)

rf_pred = rf.predict(X_test)

#### **AdaBoost (boosting)**

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)

ada.fit(X_train, y_train)

ada_pred = ada.predict(X_test)

#### **XGBoost**

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

xgb.fit(X_train, y_train)

xgb_pred = xgb.predict(X_test)

#### **voting classifier**

In [None]:
from sklearn.ensemble import VotingClassifier

voting = VotingClassifier(n_estimators=[('rf', rf), ('ada', ada), ('xgb', xgb)],
                          voting='soft')

voting.fit(X_train, y_train)

vote_pred = voting.predict(X_test)

#### **max pooling - code**

In [2]:
import sklearn
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression 

In [None]:
#model creation
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3 = LogisticRegression()


model1.fit(x_train, y_train)
model2.fit(x_train, y_train)
model3.fit(x_train, y_train)

#prediction
pred1 = model1.predict(x_test)
pred2 = model2.predict(x_test)
pred3 = model3.predict(x_test)

#final prediction
final_pred = np.array([])

for i in range(0, len(x_test)):
    final_pred = np.append(final_pred, st.mode([pred1[i], pred2[i], pred3[i]]))

print(final_pred)

#### **averaging - code**

In [None]:
#model creation
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3 = LogisticRegression()


model1.fit(x_train, y_train)
model2.fit(x_train, y_train)
model3.fit(x_train, y_train)

#predict_proba -> function predicts probability score for YES and No
pred1 = model1.predict_proba(x_test)
pred2 = model2.predict_proba(x_test)
pred3 = model3.predict_proba(x_test)

final_pred = (pred1 + pred2 + pred3) / 3

final_pred

#### **voting classifier using sklearn**

In [3]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

model1 = LogisticRegression(random_state=1)
model2 = DecisionTreeClassifier(random_state=1)

estimators = [('lr', model1), ('DT', model2)]

#### **hard voting**

In [None]:
model = VotingClassifier(estimators=[('lr', model1), ('DT', model2)], voting='hard')

model.fit(x_train, y_train)

model.score(x_test, y_test)

#### **soft voting**

In [None]:
model = VotingClassifier(estimators=[('lr', model1), ('DT', model2)], voting='soft')

model.fit(x_train, y_train)

model.score(x_test, y_test)

#### **bagging meta estimator - code**

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

model = BaggingClassifier(DecisionTreeClassifier(random_state=1))

model.fit(x_train, y_train)

model.score(x_test, y_test)

#### **adaboost - code**

In [None]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state=1)

model.fit(x_train, y_train)

model.score(x_test, y_test)