# ⚙️ Pipelines in Machine Learning

## 🧠 What’s the Problem?

So far, every time we trained a model, we had to do this manually:

1. Encode categorical data  
2. Scale numerical features  
3. Split data  
4. Train model  
5. Predict  

But in real workflows:
- **We don’t wanna repeat all these steps** for every dataset.
- **Risk of data leakage**: For example, accidentally using test data in `.fit()` during scaling.

That’s where **Pipelines** come in.

---

## 💡 What is a Pipeline?

A **Pipeline** in `sklearn` chains all preprocessing + modeling steps together in a single workflow.

Think of it as a **conveyor belt** — raw data goes in, predictions come out.

---

## 🛠️ Why Use Pipelines?

1. **Automation**: Automate repetitive tasks like encoding, scaling, and training.
2. **Avoid Data Leakage**: Ensures that preprocessing (e.g., scaling) only uses training data.
3. **Clean Code**: Keeps your code organized and reproducible.
4. **Easy to Deploy**: Simplifies the process of deploying models into production.

---

## 🌟 How Does It Work?

A Pipeline consists of multiple steps, each defined as a tuple `(name, transformer/model)`:
- **Transformers**: Handle preprocessing steps like encoding or scaling.
- **Model**: The final step is always an estimator (e.g., LogisticRegression).

Example Workflow:
1. **Step 1**: Encode categorical variables.
2. **Step 2**: Scale numerical features.
3. **Step 3**: Train a machine learning model.

## Without Pipelines 👇

In [25]:
import pandas as pd
from sklearn.datasets import  load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [26]:
# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0


## With Pipelines 👇

In [27]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [28]:
# Loading dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Creating pipeline
pipe = Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression())])

# Train
pipe.fit(X_train, y_train)

y_predict = pipe.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0
