## Why Use Pipelines for Hyperparameter Tuning?

Using pipelines for hyperparameter tuning offers several advantages:

1. **Consistency and Reproducibility**:
   - Pipelines ensure that the same preprocessing steps are consistently applied to all data splits during cross-validation, preventing data leakage and ensuring reproducibility.

2. **Simplified Workflow**:
   - Pipelines encapsulate the entire machine learning workflow (preprocessing, feature selection, modeling) into a single object. This simplifies the code and makes it easier to manage complex workflows.

3. **Hyperparameter Optimization**:
   - With pipelines, you can optimize hyperparameters for both preprocessing steps and the model simultaneously. For example, you can tune the parameters for scaling, encoding, and the classifier in one go.

4. **Avoiding Data Leakage**:
   - Pipelines prevent data leakage by ensuring that transformations (e.g., scaling, encoding) are applied to the training data only during cross-validation. The same transformations are then applied to the validation set, but they are not influenced by it.

5. **Modularity and Flexibility**:
   - Pipelines are modular, meaning you can easily swap out or modify steps (e.g., changing the model or adding new preprocessing steps) without altering the rest of the workflow.

### Example Scenario: Without Pipelines

When not using pipelines, you would have to manually apply preprocessing steps to your training and validation sets separately. This can lead to inconsistencies and data leakage. Additionally, you would have to manually manage and tune hyperparameters for each step, which is cumbersome and error-prone.

### Example Scenario: With Pipelines

With pipelines, you define your preprocessing and modeling steps once. During hyperparameter tuning, all steps are applied consistently, and you can easily optimize the hyperparameters for the entire pipeline.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Take a sample of the dataset to reduce run time
df = df.sample(frac=0.3, random_state=42)

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select features and target
target = 'class'
X = df.drop(columns=[target])
y = df[target].apply(lambda x: 1 if x == '>50K' else 0)  # Convert target to binary

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the full pipeline with a classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())])

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Perform cross-validation to check for overfitting
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print("Cross-validation scores: ", cv_scores)
print("Mean cross-validation score: ", cv_scores.mean())


              precision    recall  f1-score   support

           0       0.89      0.94      0.91      2236
           1       0.76      0.61      0.67       695

    accuracy                           0.86      2931
   macro avg       0.82      0.77      0.79      2931
weighted avg       0.86      0.86      0.86      2931

Cross-validation scores:  [0.85884861 0.86268657 0.8587884  0.85622867 0.8587884 ]
Mean cross-validation score:  0.8590681283975055


## Random Search

**Description**:
- Random search samples a fixed number of hyperparameter combinations from the specified distributions.

**Strengths**:
1. **Efficient**: Typically requires less computational time compared to grid search as it evaluates fewer combinations.
2. **Good Performance**: Can find good hyperparameter combinations faster, especially when the hyperparameter space is large.
3. **Flexible**: Can easily be extended to more iterations if needed.

**Weaknesses**:
1. **Stochastic Nature**: Results can vary between runs because it samples randomly.
2. **Not Exhaustive**: May miss the best combination as it does not evaluate all possible combinations.

**When to Choose**:
- When the hyperparameter space is large or when computational resources are limited.
- When you want a quick and reasonably good set of hyperparameters rather than the absolute best.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml
from scipy.stats import randint, uniform

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Take a sample of the dataset to reduce run time
df = df.sample(frac=0.3, random_state=42)

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select features and target
target = 'class'
X = df.drop(columns=[target])
y = df[target].apply(lambda x: 1 if x == '>50K' else 0)  # Convert target to binary

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the pipeline with RandomForestClassifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))])

# Define the parameter distributions for RandomizedSearchCV
param_distributions = {
    'classifier__n_estimators': randint(50, 200),
    'classifier__max_depth': [10, 20, None],
    'classifier__min_samples_split': randint(2, 11),
    'classifier__min_samples_leaf': randint(1, 5),
    'classifier__bootstrap': [True, False]
}

# Perform randomized search
random_search = RandomizedSearchCV(pipeline, param_distributions, n_iter=50, cv=5, n_jobs=-1, verbose=2, random_state=42)
random_search.fit(X_train, y_train)

# Best parameters and best estimator
print("Best parameters found: ", random_search.best_params_)
print("Best estimator found: ", random_search.best_estimator_)

# Predict and evaluate using the best estimator
y_pred = random_search.predict(X_test)
print(classification_report(y_test, y_pred))

# Perform cross-validation to check for overfitting
cv_scores = cross_val_score(random_search.best_estimator_, X_train, y_train, cv=5)
print("Cross-validation scores: ", cv_scores)
print("Mean cross-validation score: ", cv_scores.mean())


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters found:  {'classifier__bootstrap': True, 'classifier__max_depth': None, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 178}
Best estimator found:  Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fnlwgt',
                                                   'education_num',
                                                   'capital_gain',
                                                   'capital_loss',
                   

## Grid Search

**Description**:
- Grid search is an exhaustive search method where all possible combinations of the specified hyperparameters are evaluated.

**Strengths**:
1. **Comprehensive**: Evaluates every possible combination of hyperparameters within the specified ranges, ensuring the best possible parameters within the grid are found.
2. **Deterministic**: Given the same hyperparameters and dataset, it will always produce the same results.

**Weaknesses**:
1. **Computationally Expensive**: Can be very slow and resource-intensive, especially with a large number of hyperparameters or a wide range of values.
2. **Inefficient**: Often evaluates many combinations that do not improve the model performance.

**When to Choose**:
- When the hyperparameter space is small and computational resources are not a constraint.
- When you need to ensure that the absolute best combination within the specified range is found.



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import fetch_openml

# Load the Adult Census Income dataset from OpenML
adult = fetch_openml(data_id=1590, as_frame=True, parser='auto')
df = adult.frame

# Take a sample of the dataset
df = df.sample(frac=0.1, random_state=42)  # Adjust frac to 0.1 (10%) for a smaller sample

# Rename columns to lower case and replace hyphens with underscores
df.columns = [col.lower().replace('-', '_') for col in df.columns]

# Select features and target
target = 'class'
X = df.drop(columns=[target])
y = df[target].apply(lambda x: 1 if x == '>50K' else 0)  # Convert target to binary

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the pipeline with RandomForestClassifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [10, 20, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__bootstrap': [True, False]
}

# Perform grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters and best estimator
print("Best parameters found: ", grid_search.best_params_)


Fitting 5 folds for each of 162 candidates, totalling 810 fits
Best parameters found:  {'classifier__bootstrap': True, 'classifier__max_depth': None, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 10, 'classifier__n_estimators': 50}


### Choosing Between Grid Search and Random Search

**Grid Search**:
- Use when you have a relatively small hyperparameter space and you want to ensure that the optimal parameters are found within that space.
- Suitable for cases where computational resources and time are not major constraints.

**Random Search**:
- Use when the hyperparameter space is large, making grid search computationally impractical.
- Suitable for cases where you need a quicker solution and can afford to explore a large hyperparameter space without exhaustive evaluation.
- Often used as a first step to identify promising regions of the hyperparameter space, which can then be fine-tuned using more focused searches.

### Practical Example

In practice, you might start with random search to identify a promising set of hyperparameters and then use grid search around that region for fine-tuning. This combined approach can provide a good balance between efficiency and thoroughness.

### Summary

- **Grid Search**: Comprehensive but computationally expensive; best for small hyperparameter spaces.
- **Random Search**: More efficient and suitable for large hyperparameter spaces; good for quick results and exploring large spaces.

## Final Model with Best Params

In [None]:
# Extract the best parameters
best_params = grid_search.best_params_

# Create a new pipeline with the best parameters
final_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        random_state=42,
        n_estimators=best_params['classifier__n_estimators'],
        max_depth=best_params['classifier__max_depth'],
        min_samples_split=best_params['classifier__min_samples_split'],
        min_samples_leaf=best_params['classifier__min_samples_leaf'],
        bootstrap=best_params['classifier__bootstrap']
    ))
])

# Train the final model on the entire training dataset
final_pipeline.fit(X_train, y_train)

# Predict and evaluate using the final model
y_pred_final = final_pipeline.predict(X_test)
print("Final Model Performance")
print(classification_report(y_test, y_pred_final))

# Optionally, you can save the final model
import joblib
joblib.dump(final_pipeline, '/content/sample_data/final_model.pkl')

Final Model Performance
              precision    recall  f1-score   support

           0       0.89      0.96      0.92       747
           1       0.81      0.60      0.69       230

    accuracy                           0.87       977
   macro avg       0.85      0.78      0.81       977
weighted avg       0.87      0.87      0.87       977



['/content/sample_data/final_model.pkl']


### Saving the Pipeline:
1. **`joblib.dump`**:
   - `joblib.dump` is used to serialize the pipeline object and save it to a file.
   - The method takes two arguments:
     - The first argument is the object to be saved (in this case, `final_pipeline`).
     - The second argument is the file path where the object should be saved (`'final_model.pkl'`).

#### Loading the Pipeline:
1. **`joblib.load`**:
   - To load the saved pipeline, you use `joblib.load`.
   - This method reads the serialized object from the file and deserializes it back into a usable Python object.


### Explanation:

1. **Saving the Pipeline**:
   - `joblib.dump(final_pipeline, 'final_model.pkl')`: Saves the entire pipeline (including preprocessing steps and the trained model) to the file `final_model.pkl`.

2. **Loading the Pipeline**:
   - `loaded_pipeline = joblib.load('final_model.pkl')`: Loads the saved pipeline from the file `final_model.pkl`.
   - The loaded pipeline can be used exactly like the original pipeline, allowing you to make predictions or further train the model.

### Benefits of Saving the Pipeline:

1. **Consistency**:
   - Ensures that the exact same preprocessing steps and model configuration are used when the model is deployed or reused, maintaining consistency across different stages of the machine learning workflow.

2. **Reusability**:
   - Allows you to reuse the trained model and preprocessing steps without retraining, saving time and computational resources.

3. **Portability**:
   - Makes it easy to share the trained model and preprocessing steps with others, enabling collaboration and reproducibility.

4. **Deployment**:
   - Simplifies the deployment process by providing a single object that includes all necessary steps for making predictions on new data.

By saving the entire pipeline, you encapsulate the entire data preprocessing and modeling workflow, ensuring that it can be reliably and efficiently reused in the future.

###Loading the Pipeline and Making Predictions

In [None]:
import joblib
from sklearn.metrics import classification_report

# Load the saved pipeline
loaded_pipeline = joblib.load('/content/sample_data/final_model.pkl')

# Use the loaded pipeline to make predictions on the test set
y_pred_loaded = loaded_pipeline.predict(X_test)

# Evaluate the loaded model
print("Loaded Model Performance")
print(classification_report(y_test, y_pred_loaded))

Loaded Model Performance
              precision    recall  f1-score   support

           0       0.89      0.96      0.92       747
           1       0.81      0.60      0.69       230

    accuracy                           0.87       977
   macro avg       0.85      0.78      0.81       977
weighted avg       0.87      0.87      0.87       977

