In [1]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [1]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Sample data
data = {
    'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'],
    'Age': [25, 30, 35, 28, 32],
    'Income': [50000, 60000, 75000, 55000, 80000],
    'Loan_Status': ['Approved', 'Not Approved', 'Approved', 'Approved', 'Not Approved']
}

df = pd.DataFrame(data)

# Split the data into features and target
X = df.drop(columns=['Loan_Status'])
y = df['Loan_Status']

# Define which columns to preprocess using specific transformers
categorical_cols = ['Gender']
numeric_cols = ['Age', 'Income']

# Create transformers for each type of column
categorical_transformer = LabelEncoder()
numeric_transformer = StandardScaler()

# Create a ColumnTransformer to apply the transformers to the specified columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols),
        ('num', numeric_transformer, numeric_cols)
    ],
    remainder='passthrough'  # Include other columns as-is
)

# Fit the transformers to the data
categorical_transformer.fit(X[categorical_cols])
numeric_transformer.fit(X[numeric_cols])

# Apply the ColumnTransformer to transform the data
X_transformed = preprocessor.transform(X)

# Now, X_transformed contains the preprocessed data
print(X_transformed)
print(y)


  y = column_or_1d(y, warn=True)


NotFittedError: This ColumnTransformer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [2]:
# Create a hypothetical dataset
data = pd.DataFrame({
    'age': [30, 25, 35, 40, 28],
    'income': [50000, 60000, 75000, 90000, 55000],
    'gender': ['male', 'female', 'male', 'female', 'male'],
    'education': ['bachelor', 'master', 'high_school', 'phd', 'bachelor'],
    'region': ['north', 'south', 'east', 'west', 'north'],
    'target': [1, 0, 1, 0, 1]
})

In [3]:
# Split the data into features (X) and target (y)
X = data.drop(columns=['target'])
y = data['target']

In [15]:
# Define preprocessing steps for numerical and categorical features
numerical_features = ['age', 'income']
categorical_features = ['gender', 'education', 'region']

In [16]:
numerical_transformer = Pipeline([
    ('scaler', StandardScaler(with_mean=False, with_std=True)),  # Customizable StandardScaler
])

In [17]:
categorical_transformer = Pipeline([
    ('onehot', OneHotEncoder(sparse_output=False, drop='first'))  # One-hot encoding with customization
])

In [18]:
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [19]:
# Create the final pipeline with preprocessing and a classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

In [20]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data and make predictions
pipeline.fit(X_train, y_train)

In [21]:
y_pred = pipeline.predict(X_test)

ValueError: Found unknown categories ['master'] in column 1 during transform

In [None]:

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


# Theory

`Pipeline` and `ColumnTransformer` are two important tools in scikit-learn for building and managing machine learning workflows, especially when working with complex datasets and multiple preprocessing steps. Here's an explanation of both:

1. **Pipeline**:

   A scikit-learn `Pipeline` is a way to streamline a sequence of data processing steps into a single object. These steps can include data preprocessing, feature selection, and model training. The primary advantages of using pipelines are:

   - **Simplicity**: It provides a cleaner and more organized way to structure your code by chaining together multiple processing steps.
   - **Reproducibility**: You can easily reproduce your entire machine learning workflow, including preprocessing and modeling, with just a few lines of code.
   - **Safety**: Pipelines help prevent data leakage during cross-validation by ensuring that preprocessing is applied separately to each fold.

   Here's a basic example of a pipeline:
   ```python
   from sklearn.pipeline import Pipeline
   from sklearn.preprocessing import StandardScaler
   from sklearn.linear_model import LogisticRegression

   pipeline = Pipeline([
       ('scaler', StandardScaler()),  # Preprocessing step
       ('classifier', LogisticRegression())  # Modeling step
   ])
   ```

2. **ColumnTransformer**:

   The `ColumnTransformer` is used when you have different preprocessing steps for different subsets of your dataset, especially when working with structured data with columns of different types (e.g., numerical and categorical). It allows you to specify which transformations should be applied to each subset of columns.

   Key features of `ColumnTransformer`:

   - **Flexibility**: You can specify different preprocessing steps for different subsets of columns. For example, you can apply one set of transformations to numerical columns and another set to categorical columns.
   - **Integration with Pipelines**: `ColumnTransformer` can be seamlessly integrated into a pipeline, allowing you to create a unified workflow.

   Here's an example of using `ColumnTransformer` within a pipeline:

   ```python
   from sklearn.compose import ColumnTransformer
   from sklearn.pipeline import Pipeline
   from sklearn.preprocessing import StandardScaler, OneHotEncoder
   from sklearn.ensemble import RandomForestClassifier

   # Specify which columns should undergo which transformations
   transformers = [
       ('num', StandardScaler(), ['age', 'income']),  # Standardize numerical columns
       ('cat', OneHotEncoder(), ['gender', 'education'])  # One-hot encode categorical columns
   ]

   # Create a ColumnTransformer
   column_transformer = ColumnTransformer(transformers)

   # Create a pipeline with the ColumnTransformer and a classifier
   pipeline = Pipeline([
       ('preprocessor', column_transformer),
       ('classifier', RandomForestClassifier())
   ])
   ```

In the above example, `ColumnTransformer` allows you to apply different preprocessing steps to numerical and categorical columns before passing them to the classifier within the pipeline.

Both `Pipeline` and `ColumnTransformer` are valuable for creating structured and organized machine learning workflows, making it easier to build complex models and ensure consistency in data preprocessing across different parts of your dataset.

# 2

Cross-validation is a crucial technique for evaluating the performance of machine learning models. In scikit-learn, you can use the `Pipeline` class along with `cross_val_score` or `cross_validate` functions to create a pipeline and perform cross-validation. Here's a step-by-step example of how to do this:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

# Load a sample dataset (Iris dataset in this case)
data = load_iris()
X = data.data
y = data.target

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: Standardize the features
    ('pca', PCA(n_components=2)),  # Step 2: Apply Principal Component Analysis
    ('classifier', SVC())         # Step 3: Use a Support Vector Classifier
])

# Perform cross-validation
scores = cross_val_score(pipeline, X, y, cv=5)  # 5-fold cross-validation
print("Cross-validation scores:", scores)
print("Mean accuracy:", scores.mean())
```

In this example:

1. We load the Iris dataset as an example.
2. We create a pipeline using the `Pipeline` class, which consists of three steps:
   - Step 1: Standardize the features using `StandardScaler`.
   - Step 2: Apply Principal Component Analysis (PCA) for dimensionality reduction.
   - Step 3: Use a Support Vector Classifier (SVC) as the final estimator.

3. We then use `cross_val_score` to perform 5-fold cross-validation on the pipeline. You can change the `cv` parameter to specify the number of folds for cross-validation.

This code snippet demonstrates how to create a pipeline that includes preprocessing and modeling steps and perform cross-validation to assess the model's performance. You can replace the dataset, preprocessing, and classifier components with your own data and model as needed.