# ⚙️ Pipelines in Machine Learning

## 🧠 What’s the Problem?

So far, every time we trained a model, we had to do this manually:

1. Encode categorical data  
2. Scale numerical features  
3. Split data  
4. Train model  
5. Predict  

But in real workflows:
- **We don’t wanna repeat all these steps** for every dataset.
- **Risk of data leakage**: For example, accidentally using test data in `.fit()` during scaling.

That’s where **Pipelines** come in.

---

## 💡 What is a Pipeline?

A **Pipeline** in `sklearn` is a tool that chains multiple steps together into a single workflow. Each step in the pipeline performs a specific task, such as preprocessing or modeling.

Think of it as a **conveyor belt** — raw data goes in, predictions come out.

The key idea is:

- We pass our raw data into the pipeline.
- The pipeline applies each step sequentially (e.g., scaling → training a model).
- The final output is the result of the last step (e.g., predictions).

This ensures that all steps are applied consistently and avoids mistakes like data leakage

---

## 🛠️ Why Use Pipelines?

1. **Automation**: Automate repetitive tasks like encoding, scaling, and training.
2. **Avoid Data Leakage**: Ensures that preprocessing (e.g., scaling) only uses training data.
3. **Clean Code**: Keeps your code organized and reproducible.
4. **Easy to Deploy**: Simplifies the process of deploying models into production.

---

## 🌟 How Does It Work?

A Pipeline consists of multiple steps, each defined as a tuple `(name, transformer/model)`:
- **Transformers**: Handle preprocessing steps like encoding or scaling.
- **Model**: The final step is always an estimator (e.g., LogisticRegression).

Example Workflow:
1. **Step 1**: Encode categorical variables.
2. **Step 2**: Scale numerical features.
3. **Step 3**: Train a machine learning model.

## Without Pipelines 👇

In [5]:
import pandas as pd
from sklearn.datasets import  load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [6]:
# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0


## With Pipelines 👇

In [7]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [None]:
# Loading dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Creating pipeline
pipe = Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression())])

'''
- here, Pipeline([...]) creates a new pipeline object &  inside the Pipeline, we pass a list of tuples, where each tuple represents a step in the workflow.

- In First Tuple: ("scaler", StandardScaler()) ;
- "scaler": This is the name of the first step in the pipeline. We can name it anything we want (e.g., "scaling_step", "preprocessing", etc.), but it should be descriptive.
- StandardScaler(): This is the actual transformer for the first step. It scales the features to have a mean of 0 and a standard deviation of 1.

- In Second Tuple: ("model", LogisticRegression()) ;
- "model": This is the name of the second step in the pipeline. Again, we can name it anything (e.g., "classifier", "logistic_regression", etc.).
- LogisticRegression(): This is the actual estimator (machine learning model) for the second step. It trains on the scaled data from the previous step.
'''

# Train
pipe.fit(X_train, y_train)

y_predict = pipe.predict(X_test)


'''
## What Happens Here?

- After scaling the data, the pipeline passes the scaled data to the logistic regression model.
- The model is trained on the scaled data during .fit() or used to make predictions during .predict().
'''

'''
🌟 How Does the Pipeline Work Internally?

When we call methods like .fit() or .predict() on the pipeline, here’s what happens internally:

1. During .fit(X_train, y_train)

- Step 1: The pipeline applies StandardScaler().fit_transform(X_train) to scale the training data.
- Step 2: The pipeline passes the scaled data (i.e. X_train) & y_train to LogisticRegression().fit() to train the model.

2. During .predict(X_test)

- Step 1: The pipeline applies StandardScaler().transform(X_test) to scale the test data using the same scaler fitted on the training data.
- Step 2: The pipeline passes the scaled test data to LogisticRegression().predict() to make predictions.

'''

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0
