### Pipeline

Using scikit-learn's Pipeline with a built-in dataset, demonstrating how to chain together multiple functions for a machine learning workflow. This example uses the Iris dataset and applies preprocessing, feature selection, and a classifier. The pipeline will include at least five functions.

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score

In [3]:
# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

In [None]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
# Define preprocessing for numeric columns and encoding categorical data
# Here we assume the Iris dataset has numeric features, but this can be adapted for categorical data
numeric_features = X.columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Handle missing values
    ('scaler', StandardScaler())                 # Standardize features
])

In [None]:
# Define a full preprocessing pipeline with PCA and RandomForest
# Applying PCA for dimensionality reduction
full_pipeline = Pipeline(steps=[
    ('preprocessor', numeric_transformer),
    ('pca', PCA(n_components=2)),                  # Dimensionality reduction to 2 components
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))  # Classifier
])

In [None]:
# Cross-validation to evaluate the pipeline
cv_scores = cross_val_score(full_pipeline, X_train, y_train, cv=5)

# Print cross-validation scores
print(f'Cross-validation scores: {cv_scores}')
print(f'Mean cross-validation score: {np.mean(cv_scores)}')

# Fit the pipeline on the training data
full_pipeline.fit(X_train, y_train)

# Evaluate the pipeline on the test set
test_score = full_pipeline.score(X_test, y_test)
print(f'Test set score: {test_score}')

Breakdown of the Pipeline:

Imputation (SimpleImputer):
This step is used to fill missing values with the mean value for each feature. This is useful if you have missing data in your dataset.

Standardization (StandardScaler):
The features are standardized to have a mean of 0 and a standard deviation of 1. This is important for algorithms like PCA and many machine learning algorithms.

Dimensionality Reduction (PCA):
Principal Component Analysis (PCA) is used to reduce the number of features from 4 to 2 for easier visualization and improved performance with less computational cost.

Classification (RandomForestClassifier):
A random forest classifier is used to train and predict the target variable, based on the features processed in the previous steps.

Cross-validation (cross_val_score):
This function is used to evaluate the model on different subsets of the data to ensure it generalizes well across different data splits.