### Why Perform Feature Selection Using Pipelines?

**Feature selection** is the process of selecting a subset of relevant features for use in model construction. Integrating feature selection within a pipeline offers several benefits:

1. **Consistency and Reproducibility**:
   - By including feature selection in the pipeline, you ensure that the same selection process is applied consistently across different stages of model development (training, validation, testing).
   - This reproducibility is crucial when moving from development to production, as the same steps will be applied without manual intervention.

2. **Simplified Workflow**:
   - Pipelines encapsulate the entire workflow, reducing the complexity of the code and making it more maintainable.
   - This simplification helps in debugging and makes the process easier to understand and document.

3. **Avoid Data Leakage**:
   - Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates.
   - By using a pipeline, you can ensure that feature selection is performed on the training data alone, preventing leakage of test data information into the training process.

4. **Automated Hyperparameter Tuning**:
   - When you use pipelines, you can easily integrate hyperparameter tuning (e.g., using `GridSearchCV` or `RandomizedSearchCV`) to select the best parameters for both feature selection and the model.
   - This integration allows for a more comprehensive search of the best configuration, optimizing both the preprocessing and model training stages simultaneously.

5. **Modularity and Reusability**:
   - Pipelines make it easy to modify or extend your workflow. For example, you can switch out the feature selection method or add new preprocessing steps with minimal changes to the overall structure.
   - This modularity is beneficial when experimenting with different models or preprocessing techniques.

### Why Use a Pipeline Instead of Doing It Manually?

**Manual Feature Selection**:
- **Error-Prone**: Manually applying feature selection can lead to inconsistencies and errors, especially in complex workflows.
- **Time-Consuming**: Requires more effort to ensure that the same steps are applied consistently across different datasets (training, validation, testing).
- **Less Reproducible**: Manual processes are harder to document and reproduce, leading to potential issues when transitioning from development to production.

**Using Pipelines**:
- **Automated Workflow**: Encapsulates the entire process, from preprocessing to feature selection to model training, into a single, cohesive workflow.
- **Consistent Application**: Ensures that all steps are applied in a consistent manner, reducing the risk of errors and improving reproducibility.
- **Ease of Use**: Simplifies the experimentation process, making it easy to swap out or modify different components of the workflow.
- **Integrated Hyperparameter Tuning**: Allows for comprehensive optimization of both preprocessing and model training parameters in a unified framework.

### Example Benefits of Using a Pipeline for Feature Selection

1. **Consistent Training and Evaluation**: Ensures that feature selection is applied consistently during cross-validation, avoiding data leakage and providing a more accurate estimate of model performance.
2. **Simplified Code**: Reduces boilerplate code and the potential for bugs by encapsulating feature selection within the pipeline.
3. **Efficient Hyperparameter Tuning**: Facilitates the optimization of feature selection parameters alongside model parameters, leading to better overall performance.

By leveraging pipelines for feature selection, you gain a more robust, maintainable, and scalable approach to building and evaluating machine learning models.

### No Feature Selection


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select features and target
target = 'class'
X = df.drop(columns=[target])
y = df[target].apply(lambda x: 1 if x == '>50K' else 0)  # Convert target to binary

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the full pipeline with a classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))])

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Perform cross-validation to check for overfitting
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print("Cross-validation scores: ", cv_scores)
print("Mean cross-validation score: ", cv_scores.mean())

              precision    recall  f1-score   support

           0       0.89      0.93      0.91      7479
           1       0.74      0.61      0.66      2290

    accuracy                           0.86      9769
   macro avg       0.81      0.77      0.79      9769
weighted avg       0.85      0.86      0.85      9769

Cross-validation scores:  [0.85323097 0.84926424 0.85527831 0.84873304 0.85026875]
Mean cross-validation score:  0.8513550608264019


### Feature Selction - Select K Best

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Number of features to select
n = 3

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select features and target
target = 'class'
X = df.drop(columns=[target])
y = df[target].apply(lambda x: 1 if x == '>50K' else 0)  # Convert target to binary

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the full pipeline with feature selection and a classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=n)),  # Adjust k as needed
    ('classifier', LogisticRegression(max_iter=1000))])

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Perform cross-validation to check for overfitting
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print("Cross-validation scores: ", cv_scores)
print("Mean cross-validation score: ", cv_scores.mean())

# Get the selected feature names
selected_features_mask = pipeline.named_steps['feature_selection'].get_support()
all_features = numeric_features + pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected]

print("Selected features:", selected_features)


              precision    recall  f1-score   support

           0       0.86      0.93      0.89      7479
           1       0.68      0.49      0.57      2290

    accuracy                           0.83      9769
   macro avg       0.77      0.71      0.73      9769
weighted avg       0.82      0.83      0.82      9769

Cross-validation scores:  [0.82072937 0.81445937 0.81535509 0.81545943 0.81609931]
Mean cross-validation score:  0.8164205133394938
Selected features: ['education_num', 'marital_status_Married-civ-spouse', 'relationship_Husband']


## Grid Search Feature Selection

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Number of features to select
n = 3

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select features and target
target = 'class'
X = df.drop(columns=[target])
y = df[target].apply(lambda x: 1 if x == '>50K' else 0)  # Convert target to binary

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create base logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Create the full pipeline with feature selection and a classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=n)),  # Placeholder for feature selection
    ('classifier', logreg)])

# Define parameter grid for GridSearchCV
param_grid = [
    {
        'feature_selection': [SelectKBest(score_func=f_classif)],
        'feature_selection__k': [5, 10, 15]
    },
    {
        'feature_selection': [RFE(estimator=logreg)],
        'feature_selection__n_features_to_select': [5, 10, 15]
    }
]

# Perform grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and best estimator
print("Best parameters found: ", grid_search.best_params_)
print("Best estimator found: ", grid_search.best_estimator_)

# Predict and evaluate
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

# Get the selected feature names
best_feature_selector = grid_search.best_estimator_.named_steps['feature_selection']
if isinstance(best_feature_selector, SelectKBest):
    selected_features_mask = best_feature_selector.get_support()
    all_features = numeric_features + grid_search.best_estimator_.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
    selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected]
elif isinstance(best_feature_selector, RFE):
    selected_features_mask = best_feature_selector.support_
    all_features = numeric_features + grid_search.best_estimator_.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
    selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected]

print("Selected features: ", selected_features)


Best parameters found:  {'feature_selection': SelectKBest(k=15), 'feature_selection__k': 15}
Best estimator found:  Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fnlwgt',
                                                   'education_num',
                                                   'capital_gain',
                                                   'capital_loss',
                                                   'hours_per_week']),
                                                 ('cat',
                                                  P

There are several feature selection methods available in scikit-learn, each with different strategies for selecting the most relevant features. Here are some common methods:

1. **Univariate Feature Selection**:
   - **SelectKBest**: Selects the top k features based on univariate statistical tests.
   - **SelectPercentile**: Selects the top features based on a percentile of the highest scores.

2. **Recursive Feature Elimination (RFE)**:
   - **RFE**: Recursively removes features and builds a model on the remaining attributes. It uses the model's coefficients to rank the features.
   - **RFECV**: RFE with cross-validation to find the optimal number of features.

3. **Model-Based Feature Selection**:
   - **SelectFromModel**: Selects features based on importance weights. Can be used with any estimator that exposes a `coef_` or `feature_importances_` attribute.
   - **Lasso**: Uses L1 regularization to shrink some coefficients to zero, effectively performing feature selection.
   - **Tree-based methods**: Tree-based estimators such as RandomForest, GradientBoosting, etc., can be used to evaluate the importance of features.

4. **Principal Component Analysis (PCA)**:
   - **PCA**: Transforms the features into a lower-dimensional space. It is often used for dimensionality reduction but can also serve as a feature selection method.

Let's integrate a few more feature selection methods into the grid search for comparison:



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True)
df = adult.frame

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select features and target
target = 'class'
X = df.drop(columns=[target])
y = df[target].apply(lambda x: 1 if x == '>50K' else 0)  # Convert target to binary

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create base logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Determine the number of features for selection
total_features = len(numeric_features) + len(categorical_features)
feature_selection_params = {
    'select_kbest': [total_features // i for i in range(2, 6)],
    'rfe': [total_features // i for i in range(2, 6)]
}

# Create the full pipeline with a placeholder for feature selection
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', SelectKBest(score_func=f_classif, k=total_features // 2)),  # Placeholder for feature selection
    ('classifier', logreg)])

# Define parameter grid for GridSearchCV
param_grid = [
    {
        'feature_selection': [SelectKBest(score_func=f_classif)],
        'feature_selection__k': feature_selection_params['select_kbest']
    },
    {
        'feature_selection': [RFE(estimator=logreg)],
        'feature_selection__n_features_to_select': feature_selection_params['rfe']
    },
    {
        'feature_selection': [SelectFromModel(estimator=LogisticRegression(penalty="l1", solver='liblinear'))],
        'feature_selection__threshold': ["mean", "median"]
    },
    {
        'feature_selection': [SelectFromModel(estimator=RandomForestClassifier(n_estimators=100))],
        'feature_selection__threshold': ["mean", "median"]
    }
]

# Perform grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and best estimator
print("Best parameters found: ", grid_search.best_params_)
print("Best estimator found: ", grid_search.best_estimator_)

# Predict and evaluate
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

# Get the selected feature names
best_feature_selector = grid_search.best_estimator_.named_steps['feature_selection']
if isinstance(best_feature_selector, SelectKBest):
    selected_features_mask = best_feature_selector.get_support()
    all_features = numeric_features + grid_search.best_estimator_.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
    selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected]
elif isinstance(best_feature_selector, RFE):
    selected_features_mask = best_feature_selector.support_
    all_features = numeric_features + grid_search.best_estimator_.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
    selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected]
elif isinstance(best_feature_selector, SelectFromModel):
    selected_features_mask = best_feature_selector.get_support()
    all_features = numeric_features + grid_search.best_estimator_.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features).tolist()
    selected_features = [feature for feature, selected in zip(all_features, selected_features_mask) if selected]

print("Selected features: ", selected_features)


  warn(


Best parameters found:  {'feature_selection': SelectFromModel(estimator=RandomForestClassifier(), threshold='median'), 'feature_selection__threshold': 'median'}
Best estimator found:  Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fnlwgt',
                                                   'education_num',
                                                   'capital_gain',
                                                   'capital_loss',
                                                   'hours_per_week']),
                                        