<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_00_why_use_piplelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What are sklearn Pipelines?

In machine learning, pipelines help streamline the process of preprocessing data and training models. A pipeline chains together multiple steps into a single object, making the workflow cleaner, more maintainable, and less error-prone.

### Why use sklearn Pipelines?

1. **Consistency**:
   - **End-to-End Workflow**: Pipelines ensure that the same preprocessing steps are consistently applied during both training and testing phases.
   - **No Data Leakage**: Pipelines help prevent data leakage by ensuring that each step in the preprocessing pipeline is applied correctly.

2. **Clean and Maintainable Code**:
   - **Modular Design**: Pipelines organize code into modular, reusable components, making it easier to read, debug, and maintain.
   - **Single Object**: The entire workflow, including preprocessing and modeling, is encapsulated in a single object, simplifying the overall process.

3. **Simplified Workflow**:
   - **Chaining Steps**: Pipelines allow you to chain multiple preprocessing steps and the estimator into a single object, simplifying the workflow.
   - **Ease of Use**: Once defined, pipelines can be easily used for fitting, predicting, and evaluating without repeatedly applying the same preprocessing steps.

### Basic Example

Let's create a basic pipeline that includes data preprocessing and a simple classifier.

1. **StandardScaler:** This step standardizes the features by removing the mean and scaling to unit variance.
2. **LogisticRegression:** This is the classifier used in the pipeline.



In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Fetch the data
data = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = data.frame

# Features and target
X = df.drop(columns='class')
y = df['class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify categorical and numeric columns
categorical_features = X.select_dtypes(include=['category']).columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=300))
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
# print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.91      7479
        >50K       0.73      0.60      0.66      2290

    accuracy                           0.86      9769
   macro avg       0.81      0.77      0.78      9769
weighted avg       0.85      0.86      0.85      9769



In [None]:
pipeline



### Combined Benefits of GridSearchCV and Pipelines

1. **Integrated Hyperparameter Tuning**:
   - **Simultaneous Optimization**: You can tune hyperparameters for both preprocessing steps and the model within a single grid search, optimizing the entire workflow.
   - **Comprehensive Search**: Allows for a comprehensive search over multiple preprocessing and model parameters simultaneously.

2. **Enhanced Reproducibility**:
   - **Consistent Application**: Ensures that all preprocessing and model steps are consistently applied across different runs, enhancing reproducibility.
   - **Documented Workflow**: A pipeline combined with GridSearchCV provides a clear and documented workflow, making it easier to reproduce results.

3. **Reduced Risk of Errors**:
   - **Automated Handling**: Automates the handling of data preprocessing and model fitting, reducing the risk of manual errors.
   - **Improved Validation**: Cross-validation within GridSearchCV ensures robust validation, reducing the likelihood of overfitting or underfitting.

### Example Scenario

Imagine you are building a machine learning model to predict whether an individual earns more than $50,000 a year based on various demographic features (like age, education, and occupation).

- **Without Pipelines and GridSearchCV**: You might manually preprocess the data (handling missing values, scaling features, encoding categories), split the data, fit the model, and manually tune hyperparameters. This approach can lead to inconsistencies, data leakage, and manual errors.
  
- **With Pipelines and GridSearchCV**: You define a pipeline that handles all preprocessing steps and the model. GridSearchCV is used to automatically tune hyperparameters for both preprocessing steps (like scaling parameters) and the model. This ensures a consistent, error-free, and efficient workflow, leading to better model performance and reliability.


In [None]:
# Fetch the data
data = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = data.frame

# Features and target
X = df.drop(columns='class')
y = df['class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify categorical and numeric columns
categorical_features = X.select_dtypes(include=['category']).columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=300))
])

# Define the grid search parameters
param_grid = {
    'classifier__C': [0.1, 1.0, 10],
    'classifier__solver': ['liblinear', 'saga']
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)

# Train the pipeline with grid search
grid_search.fit(X_train, y_train)

# Make predictions
y_pred = grid_search.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
print(f'Best parameters: {grid_search.best_params_}')

Fitting 5 folds for each of 6 candidates, totalling 30 fits
              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.91      7479
        >50K       0.73      0.60      0.66      2290

    accuracy                           0.86      9769
   macro avg       0.81      0.77      0.78      9769
weighted avg       0.85      0.86      0.85      9769

Best parameters: {'classifier__C': 1.0, 'classifier__solver': 'liblinear'}



### How GridSearchCV Reduces Overfitting

Overfitting occurs when a model learns the noise in the training data rather than the actual underlying patterns. This leads to excellent performance on the training data but poor generalization to new, unseen data.

1. **Cross-Validation**:
   - **K-Fold Cross-Validation**: GridSearchCV uses k-fold cross-validation, where the training data is split into `k` subsets (folds). The model is trained on `k-1` folds and validated on the remaining fold. This process is repeated `k` times, each time with a different fold as the validation set. The average performance across all `k` folds provides a more reliable estimate of the model's performance.
   - **Model Generalization**: By evaluating the model on multiple folds, GridSearchCV ensures that the hyperparameters are chosen based on their performance on multiple data subsets, not just a single training/validation split. This helps in selecting hyperparameters that generalize well to new data.

2. **Robust Hyperparameter Tuning**:
   - **Comprehensive Search**: GridSearchCV performs an exhaustive search over a specified hyperparameter space. This helps in identifying the best combination of hyperparameters that leads to optimal model performance.
   - **Avoiding Overfitting**: By considering multiple hyperparameter combinations and evaluating them through cross-validation, GridSearchCV reduces the risk of overfitting to the training data. It helps in selecting hyperparameters that perform well across different validation sets, indicating better generalization.

3. **Automated and Consistent Process**:
   - **Consistent Evaluation**: GridSearchCV ensures that each hyperparameter combination is evaluated consistently using the same cross-validation procedure. This uniformity reduces the likelihood of overfitting caused by inconsistent evaluation practices.
   - **Automated Handling**: The automated process reduces the risk of manual errors that can lead to overfitting, such as reusing validation data or improperly splitting the dataset.


Integrated hyperparameter tuning allows you to optimize hyperparameters for both the preprocessing steps and the model itself within a single grid search. This is one of the powerful features of scikit-learn's pipeline and GridSearchCV.

### Optimize Model Params and Preprocessing

1. **Simultaneous Optimization**:
   - **Preprocessing and Model**: You can optimize parameters for data preprocessing (e.g., imputation strategy, scaling) and model hyperparameters (e.g., regularization strength, learning rate) together.
   - **Unified Search**: Instead of separately tuning preprocessing and model parameters, you define a single parameter grid that includes both, and GridSearchCV will search through all combinations.

2. **Comprehensive Workflow Optimization**:
   - **Holistic Approach**: The pipeline allows you to consider how preprocessing and model parameters interact and find the best combination that works together optimally.
   - **Efficiency**: This approach is more efficient and streamlined, as it avoids the need for separate tuning stages for preprocessing and modeling.

### Explanation

1. **Hyperparameter Grid**:
   - **Preprocessing Parameters**: `preprocessor__num__imputer__strategy` specifies the imputation strategy for numeric features. This allows GridSearchCV to test both 'mean' and 'median' strategies.
   - **Model Parameters**: `classifier__C` and `classifier__solver` are hyperparameters for the logistic regression model.

2. **Nested Parameters**:
   - Parameters are specified in a hierarchical manner using double underscores (`__`). This notation helps GridSearchCV understand which parameters belong to which step in the pipeline.

3. **Grid Search Execution**:
   - **Unified Tuning**: GridSearchCV searches over the combined hyperparameter space, optimizing both preprocessing and model parameters simultaneously.
   - **Cross-Validation**: Each combination of hyperparameters is evaluated using cross-validation, ensuring robust performance estimates.

### Benefits

- **Comprehensive Optimization**: Finds the best combination of preprocessing and model hyperparameters together, ensuring they work well in concert.
- **Efficiency**: Streamlines the tuning process into a single operation, avoiding separate tuning stages.
- **Consistency**: Ensures consistent application of preprocessing and model fitting steps across all data splits during cross-validation.

By integrating hyperparameter tuning within pipelines, you achieve a more efficient, comprehensive, and reliable machine learning workflow, leading to models that generalize better to new data.

In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Fetch the data
data = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = data.frame

# Features and target
X = df.drop(columns='class')
y = df['class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify categorical and numeric columns
categorical_features = X.select_dtypes(include=['category']).columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer()),  # Imputer with default strategy (mean)
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=200))
])

# Define the grid search parameters
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],  # Tuning imputer strategy
    'classifier__C': [0.1, 1.0, 10],  # Tuning regularization strength
    'classifier__solver': ['liblinear', 'saga']  # Tuning solver
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)

# Train the pipeline with grid search
grid_search.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Best parameters: {grid_search.best_params_}')


Fitting 5 folds for each of 12 candidates, totalling 60 fits
              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.91      7479
        >50K       0.73      0.60      0.66      2290

    accuracy                           0.86      9769
   macro avg       0.81      0.77      0.78      9769
weighted avg       0.85      0.86      0.85      9769

Best parameters: {'classifier__C': 1.0, 'classifier__solver': 'liblinear', 'preprocessor__num__imputer__strategy': 'mean'}


### Custom Transformers: Overview and Importance

#### What Are Custom Transformers?

Custom transformers are user-defined preprocessing components in a machine learning pipeline. They extend the functionality of scikit-learn's built-in transformers, allowing you to implement specific data transformation tasks that are not covered by the default options. These transformers follow the same interface as scikit-learn's built-in transformers, typically by inheriting from `BaseEstimator` and `TransformerMixin`.

#### Why Are Custom Transformers Important?

1. **Flexibility**:
   - Custom transformers allow you to handle unique preprocessing tasks tailored to your specific dataset and problem, which may not be achievable using existing scikit-learn transformers.

2. **Reusability**:
   - Once created, custom transformers can be reused across different projects and datasets, promoting code reuse and modularity.

3. **Maintainability**:
   - Encapsulating preprocessing logic within custom transformers makes your code more organized and maintainable. This separation of concerns helps in managing complex preprocessing workflows more effectively.

4. **Pipeline Integration**:
   - Custom transformers seamlessly integrate with scikit-learn's pipeline API, enabling a unified and consistent preprocessing and modeling workflow. This ensures that preprocessing steps are consistently applied during both training and inference.

5. **Enhanced Functionality**:
   - They can perform complex data transformations, feature engineering, and data augmentation techniques that go beyond standard preprocessing steps like scaling or encoding.

6. **Automated Hyperparameter Tuning**:
   - By including custom transformers in a pipeline, you can leverage tools like `GridSearchCV` to tune hyperparameters for both preprocessing steps and model parameters simultaneously, optimizing the entire machine learning workflow.

#### Examples of Custom Transformer Use Cases

1. **Feature Engineering**:
   - Creating interaction terms, polynomial features, or domain-specific features that capture important relationships within the data.

2. **Custom Scaling**:
   - Implementing scaling methods not provided by default, such as scaling by the maximum absolute value or custom normalization techniques.

3. **Data Augmentation**:
   - Generating synthetic data points or applying transformations that increase the diversity of the training data, especially useful in computer vision and natural language processing.

4. **Data Cleaning**:
   - Applying domain-specific cleaning operations, such as correcting data entry errors or handling rare categories in categorical features.

5. **Statistical Transformations**:
   - Implementing transformations based on statistical properties of the data, such as log transformations or Box-Cox transformations to handle skewed distributions.

By creating and using custom transformers, data scientists and machine learning engineers can build more robust, flexible, and maintainable preprocessing pipelines tailored to their specific needs.

In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# Fetch the data
data = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = data.frame

# Features and target
X = df.drop(columns='class')
y = df['class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Custom Transformer to apply a log transformation to a specified column
class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.column] = np.log1p(X[self.column])
        return X

# Identify categorical and numeric columns
categorical_features = X.select_dtypes(include=['category']).columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a pipeline including the custom transformer
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('log_transform', LogTransformer(column='age')),  # Example: Log-transform the 'age' column
    ('classifier', LogisticRegression(max_iter=200))
])

# Define the grid search parameters
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],  # Tuning imputer strategy
    'classifier__C': [0.1, 1.0, 10],  # Tuning regularization strength
    'classifier__solver': ['liblinear', 'saga']  # Tuning solver
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)

# Evaluate the model
print(classification_report(y_test, y_pred))

# try:
#     # Train the pipeline with grid search
#     grid_search.fit(X_train, y_train)
#     # Make predictions and evaluate the model
#     y_pred = grid_search.predict(X_test)
#     accuracy = accuracy_score(y_test, y_pred)
#     print(f'Best parameters: {grid_search.best_params_}')
#     print(f'Accuracy: {accuracy}')
# except ValueError as e:
#     print(f"ValueError during model fitting: {e}")
# except Exception as e:
#     print(f"An error occurred: {e}")


              precision    recall  f1-score   support

       <=50K       0.88      0.93      0.91      7479
        >50K       0.73      0.60      0.66      2290

    accuracy                           0.86      9769
   macro avg       0.81      0.77      0.78      9769
weighted avg       0.85      0.86      0.85      9769



### Feature Union

1. **FeatureUnion**:
   - Combines the output of multiple transformer objects into a single feature space.
   - In this example, we combined the standard numeric and categorical preprocessing with the custom log transformation and adding a constant value.

2. **Custom Transformers**:
   - `LogTransformer`: Applies a logarithmic transformation to the specified column.
   - `AddConstantTransformer`: Adds a specified constant value to the specified column.

3. **ColumnTransformer**:
   - Combines the feature union with the specified numeric and categorical columns.

4. **Pipeline Integration**:
   - Integrates the combined preprocessing steps and logistic regression model into a single pipeline.

This code demonstrates how to use Feature Union to combine different feature extraction methods within a scikit-learn pipeline, enhancing the flexibility and power of your preprocessing steps.



In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# Fetch the data
data = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = data.frame

# Features and target
X = df.drop(columns='class')
y = df['class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Custom Transformer to apply a log transformation to a specified column
class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.column] = np.log1p(X[self.column])
        return X

# Custom Transformer to bin a numeric column into categories
class BinningTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column, bins, labels):
        self.column = column
        self.bins = bins
        self.labels = labels

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X[self.column] = pd.cut(X[self.column], bins=self.bins, labels=self.labels)
        return X

# Identify categorical and numeric columns
categorical_features = X.select_dtypes(include=['category']).columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Feature Union to combine multiple feature extraction methods
feature_union = FeatureUnion(transformer_list=[
    ('preprocessor', preprocessor),
    ('log_transform', LogTransformer(column='age')),
    ('binning', BinningTransformer(column='education-num', bins=[0, 5, 10, 15, 20], labels=['low', 'medium', 'high', 'very high'])),
])

# Create a pipeline including the custom transformer
pipeline = Pipeline(steps=[
    ('features', feature_union),
    ('classifier', LogisticRegression(max_iter=200))
])

# Define the grid search parameters
param_grid = {
    'classifier__C': [0.1, 1.0, 10],  # Tuning regularization strength
    'classifier__solver': ['liblinear', 'saga']  # Tuning solver
}

# Create a GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)

# Evaluate the model
print(classification_report(y_test, y_pred))

# try:
#     # Train the pipeline with grid search
#     grid_search.fit(X_train, y_train)
#     # Make predictions and evaluate the model
#     y_pred = grid_search.predict(X_test)
#     accuracy = accuracy_score(y_test, y_pred)
#     print(f'Best parameters: {grid_search.best_params_}')
#     print(f'Accuracy: {accuracy}')

#     # Evaluate the model
#     print(classification_report(y_test, y_pred))
# except ValueError as e:
#     print(f"ValueError during model fitting: {e}")
# except Exception as e:
#     print(f"An error occurred: {e}")


              precision    recall  f1-score   support

       <=50K       0.84      0.95      0.89      7479
        >50K       0.70      0.39      0.50      2290

    accuracy                           0.82      9769
   macro avg       0.77      0.67      0.70      9769
weighted avg       0.80      0.82      0.80      9769



### Model Ensembling with Voting Classifier

#### What Are They?

**Model ensembling** is a machine learning technique where multiple models (often called base models or learners) are combined to produce a single, stronger model. A **Voting Classifier** is a specific type of ensemble method that aggregates the predictions of multiple models and makes a final prediction based on a majority vote (for classification) or an average (for regression).

#### What Do They Do?

A Voting Classifier combines the predictions from different models to improve the overall performance. There are two main types of voting:

1. **Hard Voting**: Each base model makes a prediction (a class label for classification problems), and the final prediction is the one that receives the majority of the votes.
2. **Soft Voting**: Each base model outputs a probability (a confidence level for each class), and the final prediction is based on the average of these probabilities.

#### Benefits of Using Them

1. **Improved Accuracy**:
   - By combining the strengths of multiple models, Voting Classifiers can often achieve better accuracy than any single model, especially if the individual models are diverse and perform differently on different parts of the data.

2. **Robustness**:
   - Combining predictions from multiple models can lead to more robust results. It reduces the risk of a poor model impacting the overall performance because the ensemble can compensate for weaker models.

3. **Reduced Overfitting**:
   - Ensembling helps in reducing overfitting, particularly when individual models tend to overfit the data. The combined model is less likely to overfit as it averages out the biases of individual models.

4. **Ease of Implementation**:
   - Voting Classifiers are relatively simple to implement using libraries like scikit-learn. They allow you to leverage existing models without needing to create entirely new algorithms.

### Summary

A Voting Classifier is a simple yet powerful ensemble technique that improves model performance by combining multiple base models. It enhances accuracy, robustness, and generalization by leveraging the strengths of different models and mitigating their individual weaknesses.

In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Fetch the data
data = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = data.frame

# Features and target
X = df.drop(columns='class')
y = df['class']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify categorical and numeric columns
categorical_features = X.select_dtypes(include=['category']).columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define individual models
logistic = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=300))
])

decision_tree = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())
])

random_forest = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Combine models in a VotingClassifier
voting_clf = VotingClassifier(estimators=[
    ('lr', logistic),
    ('dt', decision_tree),
    ('rf', random_forest)
], voting='hard')

# Train the VotingClassifier
voting_clf.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = voting_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Evaluate the model
print(classification_report(y_test, y_pred))


Accuracy: 0.8628314054662709
              precision    recall  f1-score   support

       <=50K       0.89      0.93      0.91      7479
        >50K       0.74      0.64      0.69      2290

    accuracy                           0.86      9769
   macro avg       0.82      0.78      0.80      9769
weighted avg       0.86      0.86      0.86      9769

