### ML Pipelines with Scikit-learn
What Is Scikit-learn?
Scikit-learn is Python‚Äôs most widely used machine learning library. It gives you tools for:

- Regression & classification (e.g., LinearRegression, RandomForest)

- Preprocessing (e.g., StandardScaler, OneHotEncoder)

- Model evaluation (e.g., R¬≤, accuracy)

- Pipelines and hyperparameter tuning

- ML Pipeline - A Pipeline in scikit-learn is a way to streamline a machine learning workflow by chaining together preprocessing steps and the model itself


#### üì¶ What Is a ColumnTransformer?
ColumnTransformer lets you apply different preprocessing steps to different columns.

- Numeric columns ‚Üí scale

- Categorical columns ‚Üí encode

- Drop or pass-through other columns

#### ‚öôÔ∏è Objective
We will:

- Predict lap time (milliseconds)

- Use features like Constructor name, Driver name, and lap

- Apply encoding + scaling

- Train a Linear Regression model

#### 1. Load the data into a pandas DataFrame.

In [None]:
# Load required packages
import pandas as pd

# Load the dataset
df = pd.read_csv(r"C:\Users\p.muthusenapathy\VSCode_Projects\Python_Training\datasets\F1 data.csv")
# Use double backslashes or raw string for Windows file paths to avoid unicode errors

# Drop nulls and select relevant columns
df = df[['milliseconds', 'Constructor name', "Driver's forename", 'laps']]
df = df.dropna()

Unnamed: 0,milliseconds,Constructor name,Driver's forename,laps
0,5690616.0,McLaren,Lewis,58
1,5696094.0,BMW Sauber,Nick,58
2,5698779.0,Williams,Nico,58
3,5707797.0,Renault,Fernando,58
4,5708630.0,McLaren,Heikki,58
...,...,...,...,...
25386,5476545.0,McLaren,Lando,58
25387,5479053.0,Alpine F1 Team,Fernando,58
25388,5481371.0,Alpine F1 Team,Esteban,58
25389,5483402.0,Ferrari,Charles,58


#### 2. Define Features and Target

In [10]:
X = df[['Constructor name', "Driver's forename", 'laps']]
y = df['milliseconds']

#### 3. Create the ColumnTransformer

In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Categorical and numerical columns
cat_features = ['Constructor name', "Driver's forename"]
num_features = ['laps']

# ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features),
        ('num', StandardScaler(), num_features)
    ])


#### 4. Create the Full Pipeline

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Pipeline: preprocessor + model
model_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('regressor', LinearRegression())
])


#### 5. Train the Pipeline

In [13]:
model_pipeline.fit(X, y)


#### 6. Make Predictions

In [14]:
preds = model_pipeline.predict(X.head())
print("Predicted lap times:", preds)


Predicted lap times: [5550285.07859792 5539957.88513717 5714260.09906229 5589905.57787934
 5764671.56934242]


####  7. Evaluate Model (R¬≤)

In [15]:
from sklearn.metrics import r2_score

r2 = r2_score(y, model_pipeline.predict(X))
print(f"R¬≤ Score: {r2:.3f}")


R¬≤ Score: 0.665
