# **Problem Statement**  
## **17. Create a pipeline using sklearn.pipeline for preprocessing + training.**

Create a machine learning pipeline using sklearn.pipeline that performs:
1. Data preprocessing
2. Model training
3. Prediction & evaluation

in a single, reproducible workflow.

### Constraints & Example Inputs/Outputs

### Constraints
- Dataset may contain numerical and categorical features
- Preprocessing must be applied only on training data
- Model must be trained and evaluated using the same pipeline

### Example Input:
```python
Features:
- Numerical: Age, Salary
- Categorical: City

Target:
- Purchased (0 or 1)

```

### Expected Output:
- Trained pipeline
- Predictions on test data
- Accuracy score


### Solution Approach

### Why Use Pipelines?
Pipelines:
- Prevent data leakage
- Ensure consistent preprocessing
- Simplify training + inference
- Enable cross-validation & grid search

### Pipeline Components
1. Preprocessing
- Scaling numerical features
- Encoding categorical features

2. Model
- Logistic Regression (example)

### Tools Used
- ColumnTransformer
- StandardScaler
- OneHotEncoder
- Pipeline

### Solution Code

In [6]:
# Step1: Import Required Libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [7]:
# Step2: Create Sample Dataset
data = pd.DataFrame({
    "Age": [22, 25, 47, 52, 46, 56],
    "Salary": [25000, 30000, 50000, 70000, 60000, 80000],
    "City": ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai", "Delhi"],
    "Purchased": [0, 0, 1, 1, 1, 1]
})

X = data.drop("Purchased", axis=1)
y = data["Purchased"]


In [8]:
# Step3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


In [None]:
# Approach1: Brute Force Approach (Without Pipeline)
# Manual preprocessing
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print("Accuracy (Brute Force):", accuracy_score(y_test, y_pred))


### Alternative Solution

In [None]:
# Approach2: Optimized Approach (Using sklearn Pipeline)
numeric_features = ["Age", "Salary"]
categorical_features = ["City"]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(drop="first"), categorical_features)
    ]
)

pipeline = Pipeline(
    steps=[
        ("preprocessing", preprocessor),
        ("classifier", LogisticRegression())
    ]
)

pipeline.fit(X_train, y_train)


### Alternative Approaches

- make_pipeline() (shorter syntax)
- Pipeline + GridSearchCV
- FeatureUnion for parallel pipelines
- Using pipelines with cross-validation

### Test Case

In [None]:
# Test Case 1: Model Prediction

y_pred_pipeline = pipeline.predict(X_test)
print("Predictions:", y_pred_pipeline)


In [4]:
# Test Case 2: Accuracy Score

accuracy = accuracy_score(y_test, y_pred_pipeline)
print("Pipeline Accuracy:", accuracy)


ROC Points: [(0.0, 0.5), (0.0, 1.0), (0.5, 1.0), (1.0, 1.0)]
AUC: 1.0


In [6]:
# Test Case 3: New Data Prediction (Inference)

new_data = pd.DataFrame({
    "Age": [30],
    "Salary": [40000],
    "City": ["Delhi"]
})

prediction = pipeline.predict(new_data)
print("New Prediction:", prediction)


ROC Points: [(0.5, 0.0), (1.0, 0.0), (1.0, 0.5), (1.0, 1.0)]
AUC: 0.0


### Expected Outputs
- Model trains successfully
- Same preprocessing applied to train, test, and new data
- Clean, reproducible ML workflow


## Complexity Analysis

| Step           | Time Complexity |
| -------------- | --------------- |
| Scaling        | O(n·d)          |
| Encoding       | O(n·k)          |
| Model Training | O(n·d)          |


#### Thank You!!