Ensemble methods combine multiple models to improve the overall performance and robustness compared to individual models. The three main types of ensemble methods are:

1. **Bagging (Bootstrap Aggregating)**:
   - Reduces variance by averaging the predictions of multiple models.
   - Common example: Random Forest.

2. **Boosting**:
   - Reduces bias by sequentially improving weak models.
   - Common example: Gradient Boosting, AdaBoost.

3. **Stacking**:
   - Combines multiple models using a meta-model to improve predictions.



## Establishing a Baseline Performance with a Simple Model

#### Purpose

1. **Benchmarking**:
   - Establishing a baseline performance provides a point of reference to compare more complex models and advanced techniques against. It allows you to understand the minimum performance level you can achieve with straightforward methods.

2. **Simplification**:
   - A simple model like Logistic Regression helps to quickly understand the relationship between features and the target variable. It simplifies the initial stages of model development and data exploration.

3. **Error Analysis**:
   - A baseline model can help in identifying the strengths and weaknesses of your dataset. By analyzing the errors and shortcomings of the baseline model, you can gain insights into the data quality and areas that need improvement.

4. **Efficiency**:
   - Simple models are computationally efficient and faster to train. This is especially useful in the initial stages of a project to get quick results and set expectations.

#### Benefits

1. **Reference Point**:
   - Provides a benchmark to evaluate the effectiveness of more complex models. If a complex model does not significantly outperform the baseline, it might not be worth the additional complexity.

2. **Understandability**:
   - Simple models are easier to interpret and explain. This helps in communicating initial findings to stakeholders and understanding the basic patterns in the data.

3. **Quick Feedback**:
   - Baseline models give quick feedback on the data and the problem at hand. This is valuable for iterative development and early detection of data issues.

4. **Foundational Knowledge**:
   - Helps in establishing a foundational understanding of the problem. It provides a starting point for further experimentation and refinement of more sophisticated models.

### Example of Baseline Model: Logistic Regression

Logistic Regression is commonly used as a baseline model for binary classification problems due to its simplicity and interpretability. Establishing a baseline performance with a simple model is a crucial first step in any machine learning project. It sets a clear benchmark, simplifies initial analyses, and provides a foundation for comparing and justifying the use of more complex models and techniques. This approach ensures a systematic and incremental improvement in model performance, leading to more robust and reliable solutions.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Print unique values of the target variable in the original dataset
print("Unique values in the original target variable:", df['class'].unique())

# Select features and target
target = 'class'
X = df.drop(columns=[target])
# Convert target to binary, strip any extra whitespace
y = df[target].apply(lambda x: 1 if x.strip() == '>50K' else 0)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the pipeline with LogisticRegression
baseline_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train and evaluate the baseline model
baseline_pipeline.fit(X_train, y_train)
y_pred_baseline = baseline_pipeline.predict(X_test)
print("Baseline Logistic Regression Model Performance")
print(classification_report(y_test, y_pred_baseline))

# Perform cross-validation for the baseline model
cv_scores_baseline = cross_val_score(baseline_pipeline, X_train, y_train, cv=5)
print("Baseline Logistic Regression Cross-validation scores: ", cv_scores_baseline)
print("Baseline Logistic Regression Mean cross-validation score: ", cv_scores_baseline.mean())


Unique values in the original target variable: ['<=50K', '>50K']
Categories (2, object): ['<=50K', '>50K']
Baseline Logistic Regression Model Performance
              precision    recall  f1-score   support

           0       0.89      0.93      0.91      7479
           1       0.74      0.61      0.66      2290

    accuracy                           0.86      9769
   macro avg       0.81      0.77      0.79      9769
weighted avg       0.85      0.86      0.85      9769

Baseline Logistic Regression Cross-validation scores:  [0.85323097 0.84926424 0.85527831 0.84873304 0.85026875]
Baseline Logistic Regression Mean cross-validation score:  0.8513550608264019


### Step-by-Step Guide to Implement Ensemble Methods

### 1. Bagging: Random Forest

First, let's implement and evaluate a Random Forest model using bagging within the pipeline.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create the pipeline with RandomForest
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train and evaluate the RandomForest model
rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_test)
print("Random Forest Model Performance")
print(classification_report(y_test, y_pred_rf))

# Perform cross-validation for the RandomForest model
cv_scores_rf = cross_val_score(rf_pipeline, X_train, y_train, cv=5)
print("Random Forest Cross-validation scores: ", cv_scores_rf)
print("Random Forest Mean cross-validation score: ", cv_scores_rf.mean())

Random Forest Model Performance
              precision    recall  f1-score   support

           0       0.89      0.93      0.91      7479
           1       0.73      0.64      0.68      2290

    accuracy                           0.86      9769
   macro avg       0.81      0.78      0.80      9769
weighted avg       0.86      0.86      0.86      9769

Random Forest Cross-validation scores:  [0.85642994 0.8509277  0.85284709 0.85231636 0.85564372]
Random Forest Mean cross-validation score:  0.853632961230241


## 2. Boosting: Gradient Boosting and AdaBoost

Next, let's implement and evaluate Gradient Boosting and AdaBoost models using boosting within the pipeline.



### Gradient Boosting

In [None]:
import time
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Create the pipeline with GradientBoosting
gb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# Record the start time
start_time = time.time()

# Train and evaluate the Gradient Boosting model
gb_pipeline.fit(X_train, y_train)
y_pred_gb = gb_pipeline.predict(X_test)
print("Gradient Boosting Model Performance")
print(classification_report(y_test, y_pred_gb))

# Perform cross-validation for the Gradient Boosting model
cv_scores_gb = cross_val_score(gb_pipeline, X_train, y_train, cv=5)
print("Gradient Boosting Cross-validation scores: ", cv_scores_gb)
print("Gradient Boosting Mean cross-validation score: ", cv_scores_gb.mean())

# Record the end time
end_time = time.time()

# Calculate and print the duration
duration = end_time - start_time
print(f"Total execution time: {duration:.2f} seconds")


Gradient Boosting Model Performance
              precision    recall  f1-score   support

           0       0.89      0.95      0.92      7479
           1       0.79      0.62      0.70      2290

    accuracy                           0.87      9769
   macro avg       0.84      0.79      0.81      9769
weighted avg       0.87      0.87      0.87      9769

Gradient Boosting Cross-validation scores:  [0.8678183  0.86295585 0.8678183  0.86217046 0.86447402]
Gradient Boosting Mean cross-validation score:  0.8650473869349777
Total execution time: 47.80 seconds


### Histogram-based Gradient Boosting (HistGB)

#### Overview

Histogram-based Gradient Boosting (HistGB) is an advanced variant of the traditional Gradient Boosting algorithm. It is designed to handle large datasets more efficiently by using histogram-based techniques to approximate the distribution of continuous features, which significantly reduces the computational burden.

#### Key Features

1. **Histogram Binning**:
   - Continuous features are binned into discrete intervals (histograms). This reduces the number of possible split points and speeds up the training process.
   - The algorithm creates histograms for each feature and determines the best split based on these histograms.

2. **Efficiency**:
   - By using histograms, the algorithm reduces the complexity of finding the best splits, making it much faster and more memory-efficient.
   - This efficiency allows HistGB to scale to much larger datasets compared to traditional Gradient Boosting.

3. **Parallel Processing**:
   - HistGB can take advantage of parallel processing, further improving its speed and scalability.

4. **Regularization Techniques**:
   - Like traditional Gradient Boosting, HistGB includes various regularization techniques to prevent overfitting, such as learning rate adjustments, subsampling, and more.

#### Benefits of Using HistGB

1. **Speed and Scalability**:
   - **Faster Training**: The histogram-based approach significantly reduces the time required to find the best splits, making training much faster.
   - **Large Datasets**: HistGB is particularly well-suited for large datasets that would be computationally prohibitive for traditional Gradient Boosting.

2. **Memory Efficiency**:
   - By reducing the number of split points, HistGB lowers memory usage, which is critical when working with large datasets.

3. **Performance**:
   - **Competitive Accuracy**: HistGB often achieves similar or better predictive performance compared to traditional Gradient Boosting, thanks to its efficient handling of data.
   - **Regularization**: The inclusion of regularization techniques helps in maintaining high performance without overfitting.

4. **Implementation**:
   - **Ease of Use**: Integrated into popular machine learning libraries like `scikit-learn`, making it easy to use with existing workflows.
   - **Compatibility**: Works seamlessly with other preprocessing and modeling tools in these libraries.

5. **Flexibility**:
   - HistGB supports various loss functions and can be used for both classification and regression tasks.


### Conclusion

Histogram-based Gradient Boosting (HistGB) is a highly efficient and scalable variant of traditional Gradient Boosting, making it particularly well-suited for large datasets. Its histogram binning approach significantly speeds up the training process while maintaining or even improving predictive performance. By understanding and leveraging the benefits of HistGB, you can handle large-scale machine learning tasks more effectively and efficiently.

The HistGradientBoostingClassifier does not accept sparse matrices, which can result from using the OneHotEncoder with sparse output. To fix this, you need to specify sparse=False in the OneHotEncoder.

In [None]:
import time
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns with sparse=False
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the pipeline with HistGradientBoosting
hgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', HistGradientBoostingClassifier(random_state=42))
])

# Record the start time
start_time = time.time()

# Train and evaluate the Hist Gradient Boosting model
hgb_pipeline.fit(X_train, y_train)
y_pred_hgb = hgb_pipeline.predict(X_test)
print("Hist Gradient Boosting Model Performance")
print(classification_report(y_test, y_pred_hgb))

# Perform cross-validation for the Hist Gradient Boosting model
cv_scores_hgb = cross_val_score(hgb_pipeline, X_train, y_train, cv=5)
print("Hist Gradient Boosting Cross-validation scores: ", cv_scores_hgb)
print("Hist Gradient Boosting Mean cross-validation score: ", cv_scores_hgb.mean())

# Record the end time
end_time = time.time()

# Calculate and print the duration
duration = end_time - start_time
print(f"Total execution time: {duration:.2f} seconds")


Hist Gradient Boosting Model Performance
              precision    recall  f1-score   support

           0       0.90      0.94      0.92      7479
           1       0.78      0.67      0.73      2290

    accuracy                           0.88      9769
   macro avg       0.84      0.81      0.82      9769
weighted avg       0.88      0.88      0.88      9769

Hist Gradient Boosting Cross-validation scores:  [0.87549584 0.86756238 0.87498401 0.86703353 0.87279242]
Hist Gradient Boosting Mean cross-validation score:  0.8715736359808937
Total execution time: 26.28 seconds


### AdaBoost


In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Create the pipeline with AdaBoost
ada_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', AdaBoostClassifier(random_state=42))
])

# Train and evaluate the AdaBoost model
ada_pipeline.fit(X_train, y_train)
y_pred_ada = ada_pipeline.predict(X_test)
print("AdaBoost Model Performance")
print(classification_report(y_test, y_pred_ada))

# Perform cross-validation for the AdaBoost model
cv_scores_ada = cross_val_score(ada_pipeline, X_train, y_train, cv=5)
print("AdaBoost Cross-validation scores: ", cv_scores_ada)
print("AdaBoost Mean cross-validation score: ", cv_scores_ada.mean())

AdaBoost Model Performance
              precision    recall  f1-score   support

           0       0.89      0.94      0.92      7479
           1       0.77      0.63      0.69      2290

    accuracy                           0.87      9769
   macro avg       0.83      0.78      0.80      9769
weighted avg       0.86      0.87      0.86      9769

AdaBoost Cross-validation scores:  [0.86103647 0.85732566 0.86269994 0.85641157 0.86076273]
AdaBoost Mean cross-validation score:  0.8596472725349337


### 3. Stacking

Finally, let's implement and evaluate a stacking model. Stacking involves combining multiple models using a meta-model to improve predictions.

### How Stacking Works

Stacking is an ensemble learning technique that combines multiple machine learning models to improve predictive performance. Unlike bagging or boosting, stacking involves training a meta-model (or meta-learner) to aggregate the predictions of several base models (or base learners).

#### Steps in Stacking

1. **Base Models**:
   - Train multiple different models (base learners) on the same dataset. These models can be of different types (e.g., decision trees, logistic regression, SVMs) or the same type with different hyperparameters.

2. **Generate Predictions**:
   - Each base model makes predictions on the training data. These predictions are then used as input features for the next step.
   - For cross-validation stacking, the base models are trained on different folds of the training data, and the out-of-fold predictions are used to avoid overfitting.

3. **Meta-Model**:
   - A new model (meta-learner) is trained on the predictions made by the base models. The meta-learner learns to combine these predictions to produce a final output.
   - The meta-learner can be any machine learning model, but a simple model like logistic regression is often used.

4. **Final Prediction**:
   - For new data, each base model makes predictions, which are then used as input features for the meta-learner. The meta-learner combines these predictions to make the final prediction.


### Explanation

1. **Base Models**:
   - Three different models are used as base models: RandomForestClassifier, GradientBoostingClassifier, and SVC.
   
2. **Meta-Model**:
   - A LogisticRegression model is used as the meta-learner to combine the predictions of the base models.

3. **Training and Evaluation**:
   - The stacking pipeline is trained on the training data and evaluated on the test data.
   - Cross-validation is performed to assess the stability and generalizability of the stacking model.

4. **Timing**:
   - The execution time is recorded to measure the duration of the training and evaluation process.

### Benefits of Stacking

1. **Improved Performance**:
   - By combining multiple models, stacking can often achieve better performance than any single model by leveraging their individual strengths.

2. **Robustness**:
   - Stacking increases the robustness of the model by reducing the risk of overfitting to a single model's biases.

3. **Flexibility**:
   - You can combine different types of models (e.g., decision trees, linear models, SVMs) to capture different aspects of the data.

4. **Customizability**:
   - Both the base models and the meta-model can be customized to suit the specific needs of the problem.

### Conclusion

Stacking is a powerful ensemble method that combines multiple base models and uses a meta-model to improve predictive performance. It offers improved accuracy, robustness, and flexibility, making it a valuable tool in the machine learning practitioner's toolkit. By training on a smaller sample, you can quickly prototype and test the stacking model before scaling up to the full dataset.



In [None]:
import pandas as pd
import time
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select a sample of the data to reduce run time
df_sample = df.sample(frac=0.2, random_state=42)  # Adjust frac to 0.1 (10%) for a smaller sample

# Select features and target
target = 'class'
X = df_sample.drop(columns=[target])
y = df_sample[target].apply(lambda x: 1 if x.strip() == '>50K' else 0)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns with sparse_output=False
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define base models for stacking
estimators = [
    ('rf', RandomForestClassifier(random_state=42)),
    ('gb', GradientBoostingClassifier(random_state=42)),
    ('svm', SVC(probability=True))
]

# Define the stacking classifier with a logistic regression meta-model
stacking_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(max_iter=1000)))
])

# Record the start time
start_time = time.time()

# Train and evaluate the stacking model
stacking_pipeline.fit(X_train, y_train)
y_pred_stack = stacking_pipeline.predict(X_test)
print("Stacking Model Performance")
print(classification_report(y_test, y_pred_stack))

# Perform cross-validation for the stacking model
cv_scores_stack = cross_val_score(stacking_pipeline, X_train, y_train, cv=5)
print("Stacking Cross-validation scores: ", cv_scores_stack)
print("Stacking Mean cross-validation score: ", cv_scores_stack.mean())

# Record the end time
end_time = time.time()

# Calculate and print the duration
duration = end_time - start_time
print(f"Total execution time: {duration:.2f} seconds")


Stacking Model Performance
              precision    recall  f1-score   support

           0       0.89      0.95      0.92      1485
           1       0.79      0.64      0.71       469

    accuracy                           0.87      1954
   macro avg       0.84      0.79      0.81      1954
weighted avg       0.87      0.87      0.87      1954

Stacking Cross-validation scores:  [0.86308381 0.87460013 0.87332054 0.88547665 0.86427657]
Stacking Mean cross-validation score:  0.8721515389083176
Total execution time: 384.68 seconds


In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Take a sample of the dataset to reduce run time
df = df.sample(frac=0.2, random_state=42)

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Print unique values of the target variable in the original dataset
print("Unique values in the original target variable:", df['class'].unique())

# Select features and target
target = 'class'
X = df.drop(columns=[target])
# Convert target to binary, strip any extra whitespace
y = df[target].apply(lambda x: 1 if x.strip() == '>50K' else 0)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Function to evaluate and compare models
def evaluate_model(model, model_name):
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Train the model
    pipeline.fit(X_train, y_train)

    # Predict and evaluate
    y_pred = pipeline.predict(X_test)
    print(f"{model_name} Model Performance")
    print(classification_report(y_test, y_pred))

    # Perform cross-validation
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
    print(f"{model_name} Cross-validation scores: ", cv_scores)
    print(f"{model_name} Mean cross-validation score: ", cv_scores.mean())
    print("\n")

# Models to evaluate
models = [
    (LogisticRegression(max_iter=1000), "Logistic Regression"),
    (SVC(), "SVM"),
    (GradientBoostingClassifier(random_state=42), "Gradient Boosting"),
    (RandomForestClassifier(random_state=42), "Random Forest"),
    (AdaBoostClassifier(random_state=42), "AdaBoost"),
    (StackingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(random_state=42)),
            ('gb', GradientBoostingClassifier(random_state=42)),
            ('svm', SVC(probability=True))
        ],
        final_estimator=LogisticRegression(max_iter=1000)
    ), "Stacking")
]

# Evaluate each model
for model, model_name in models:
    evaluate_model(model, model_name)


Unique values in the original target variable: ['<=50K', '>50K']
Categories (2, object): ['<=50K', '>50K']
Logistic Regression Model Performance
              precision    recall  f1-score   support

           0       0.88      0.94      0.91      1485
           1       0.78      0.61      0.68       469

    accuracy                           0.86      1954
   macro avg       0.83      0.78      0.80      1954
weighted avg       0.86      0.86      0.86      1954

Logistic Regression Cross-validation scores:  [0.84325016 0.85604607 0.85924504 0.8643634  0.85019206]
Logistic Regression Mean cross-validation score:  0.8546193463930212


SVM Model Performance
              precision    recall  f1-score   support

           0       0.88      0.95      0.91      1485
           1       0.79      0.59      0.68       469

    accuracy                           0.86      1954
   macro avg       0.83      0.77      0.79      1954
weighted avg       0.86      0.86      0.86      1954

SVM C

### How the Meta-Model Uses Predictions of Base Models

In stacking, the meta-model (or meta-learner) uses the predictions of the base models as its input features. Here’s a step-by-step explanation of how this works:

1. **Training Base Models**:
   - Each base model is trained on the training data.

2. **Generating Predictions**:
   - The trained base models generate predictions. For each instance in the training set, each base model produces a prediction.

3. **Creating Meta-Features**:
   - These predictions (also known as meta-features) are then used as the input features for the meta-model. The meta-features are typically the predicted probabilities or the raw predictions from the base models.
   - For example, if you have three base models and 10,000 training instances, you end up with a new training set of 10,000 instances and 3 features (one feature per base model).

4. **Training the Meta-Model**:
   - The meta-model is trained on these meta-features (the predictions of the base models) to learn how to combine them to make the final prediction.

5. **Final Prediction**:
   - When making predictions on new data, each base model first generates its predictions for the new instances.
   - These predictions are then used as input to the meta-model, which combines them to produce the final prediction.

### Benefits of Using Meta-Model with Predictions of Base Models

1. **Combining Strengths of Different Models**:
   - Each base model may capture different patterns and relationships in the data. By combining their predictions, the meta-model can leverage the strengths of each base model, leading to better overall performance.

2. **Reducing Overfitting**:
   - Base models are often prone to overfitting on their own, especially complex ones. The meta-model helps to mitigate this by learning a more generalized way to combine the base models' predictions.

3. **Improved Predictive Performance**:
   - The meta-model can identify and correct the weaknesses of individual base models, leading to improved predictive performance. This is especially useful when the base models are diverse and have complementary strengths.

4. **Flexibility and Robustness**:
   - Stacking allows the use of different types of models, making the ensemble more robust to various types of data and scenarios. It is flexible in terms of the choice of base models and the meta-model.


### Explanation

1. **Preprocessing**:
   - Preprocessing steps are applied to both numeric and categorical features using `ColumnTransformer`.

2. **Base Models**:
   - RandomForestClassifier, GradientBoostingClassifier, and SVC are used as base models.

3. **Meta-Model**:
   - Logistic Regression is used as the meta-model to combine the predictions of the base models.

4. **Training and Prediction**:
   - The stacking pipeline is trained on the training data. Predictions are made on the test data, and performance is evaluated using a classification report.

5. **Cross-Validation**:
   - Cross-validation is performed to assess the generalizability and stability of the stacking model.

### Benefits

- **Combining Strengths**: By combining different base models, the meta-model can leverage their strengths and compensate for their weaknesses.
- **Reducing Overfitting**: The meta-model can learn to generalize better by combining the predictions of base models trained on different data folds.
- **Improved Performance**: The overall predictive performance is often better than any single model due to the aggregation of diverse models.

Stacking effectively enhances model performance and robustness by leveraging multiple models and a meta-model to combine their predictions.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Load the dataset
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Rename columns and take a sample
df.columns = [col.lower().replace('-', '_') for col in df.columns]
df_sample = df.sample(frac=0.1, random_state=42)

# Select features and target
target = 'class'
X = df_sample.drop(columns=[target])
y = df_sample[target].apply(lambda x: 1 if x.strip() == '>50K' else 0)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define base models
estimators = [
    ('rf', RandomForestClassifier(random_state=42)),
    ('gb', GradientBoostingClassifier(random_state=42)),
    ('svm', SVC(probability=True))
]

# Define the stacking classifier with a logistic regression meta-model
stacking_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(max_iter=1000)))
])

# Train the stacking model
stacking_pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred_stack = stacking_pipeline.predict(X_test)
print("Stacking Model Performance")
print(classification_report(y_test, y_pred_stack))

# Perform cross-validation
cv_scores_stack = cross_val_score(stacking_pipeline, X_train, y_train, cv=5)
print("Stacking Cross-validation scores: ", cv_scores_stack)
print("Stacking Mean cross-validation score: ", cv_scores_stack.mean())



## Benefits of Ensemble Methods

1. **Improved Accuracy**:
   - Combining multiple models often leads to better performance compared to individual models, as errors from different models can cancel each other out.

2. **Reduced Overfitting**:
   - Ensemble methods like bagging help reduce overfitting by averaging predictions from multiple models trained on different subsets of the data.

3. **Enhanced Robustness**:
   - Ensembles are generally more robust to noise and outliers since multiple models are used to make the final prediction.

4. **Flexibility**:
   - Different ensemble methods can be tailored to specific problems, offering flexibility in addressing various types of data and model weaknesses.

By integrating and evaluating different ensemble methods, you can leverage their strengths to improve model performance and robustness, ultimately leading to more reliable predictions.

### Additional Considerations for Ensemble Methods

Ensemble methods are powerful techniques for improving the performance and robustness of machine learning models. Here are some additional considerations to keep in mind when working with ensemble methods:

#### 1. **Types of Ensemble Methods**

- **Bagging**:
  - **Bootstrap Aggregating (Bagging)**: Uses multiple instances of the same model trained on different subsets of the data.
  - **Random Forest**: An extension of bagging that uses decision trees with random feature selection.
  
- **Boosting**:
  - **AdaBoost**: Adjusts the weights of misclassified instances and combines weak learners sequentially.
  - **Gradient Boosting**: Sequentially builds models to correct the errors of the previous models.
  - **XGBoost**: An optimized and efficient implementation of gradient boosting.
  
- **Stacking**:
  - Combines the predictions of multiple base models using a meta-model to improve overall performance.

#### 2. **Handling Overfitting**

- **Bagging**: Helps reduce overfitting by averaging predictions and increasing model diversity.
- **Boosting**: Can sometimes lead to overfitting if the models are too complex or the learning rate is too high. Regularization techniques, early stopping, and tuning parameters can help mitigate this.

#### 3. **Hyperparameter Tuning**

- Ensemble methods often involve tuning multiple hyperparameters to achieve optimal performance. Use techniques like `GridSearchCV` or `RandomizedSearchCV` to find the best combination of parameters.
- For example, in Random Forest, you can tune parameters like the number of trees (`n_estimators`), maximum depth (`max_depth`), and the number of features to consider (`max_features`).

#### 4. **Combining Different Types of Models**

- In stacking, combining different types of models (e.g., decision trees, linear models, SVMs) can capture different aspects of the data, leading to better performance.
- Ensure the base models are diverse to benefit from the ensemble approach.

#### 5. **Interpretability**

- Ensemble methods, especially those involving multiple models, can be less interpretable than single models.
- Techniques like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) can help interpret the predictions of ensemble models.

#### 6. **Computational Resources**

- Ensemble methods, particularly those involving many base models or complex algorithms like gradient boosting, can be computationally intensive.
- Consider the computational resources available and the trade-off between model performance and training time.

#### 7. **Use Cases and Applications**

- Ensemble methods are widely used in various applications such as classification, regression, and anomaly detection.
- They are particularly effective in competitions and real-world scenarios where high accuracy and robustness are required.


### Conclusion

Ensemble methods are a cornerstone of modern machine learning due to their ability to combine multiple models for improved performance and robustness. By understanding the various types of ensemble techniques, handling overfitting, performing hyperparameter tuning, and combining diverse models, you can leverage the full potential of ensemble methods for your machine learning tasks. Always consider the trade-offs between model complexity, interpretability, and computational resources when deploying ensemble methods in practice. If you have any specific questions or need further details, feel free to ask!