## <b>Model Persistence and Inference with Joblib</b>
Lets now summarize how to <b>train a Random Forest model on California housing data</b>, save the model and preprocessing pipeline using `joblib`, and reuse the model later for inference on new data (`input.csv`). This approach helps avoid retraining the model every time, improving performance and enabling reproducibility.

### <b>Why These Steps?</b>
#### <b>1. Why Train Once and Save?</b>
- Training models repeatedly is <b>time-consuming</b> and <b>computationally expensive</b>.
- Saving the model (`model.pkl`) and preprocessing pipeline (`pipeline.pkl`) ensures you can <b>quickly load and run inference</b> anytime in the future.

#### <b>2. Why Use a Preprocessing Pipeline?</b>
- Raw data needs to be cleaned, scaled, and encoded before model training.
- A `Pipeline` automates this transformation and ensures <b>identical preprocessing</b> during inference.

#### <b>3. Why Use Joblib?</b>
- `joblib` efficiently serializes large NumPy arrays (like in sklearn models).
Faster and more suitable than `pickle` for `scikit-learn` objects.

#### <b>4. Why the If-Else Logic?</b>
- The program checks if a saved model exists.

- If <b>not</b>, it trains and saves the model.
- If it <b>does</b>, it skips training and <b>only runs inference</b>, saving time.


In [None]:
import os
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

MODEL_FILE = "model.pkl"
PIPELINE_FILE = "pipeline.pkl"

def build_pipeline(num_attribs, cat_attribs):
    num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])
    cat_pipeline = Pipeline([
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ])
    full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs)
    ])
    return full_pipeline

if not os.path.exists(MODEL_FILE):
    # TRAINING PHASE
    housing = pd.read_csv("housing.csv")
    housing['income_cat'] = pd.cut(housing["median_income"], 
                                   bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf], 
                                   labels=[1, 2, 3, 4, 5])
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    for train_index, _ in split.split(housing, housing['income_cat']):
        housing = housing.loc[train_index].drop("income_cat", axis=1)

    housing_labels = housing["median_house_value"].copy()
    housing_features = housing.drop("median_house_value", axis=1)

    num_attribs = housing_features.drop("ocean_proximity", axis=1).columns.tolist()
    cat_attribs = ["ocean_proximity"]

    pipeline = build_pipeline(num_attribs, cat_attribs)
    housing_prepared = pipeline.fit_transform(housing_features)

    model = RandomForestRegressor(random_state=42)
    model.fit(housing_prepared, housing_labels)

    # Save model and pipeline
    joblib.dump(model, MODEL_FILE)
    joblib.dump(pipeline, PIPELINE_FILE)

    print("Model trained and saved.")

else:
    # INFERENCE PHASE
    model = joblib.load(MODEL_FILE)
    pipeline = joblib.load(PIPELINE_FILE)

    input_data = pd.read_csv("input.csv")
    transformed_input = pipeline.transform(input_data)
    predictions = model.predict(transformed_input)
    input_data["median_house_value"] = predictions

    input_data.to_csv("output.csv", index=False)
    print("Inference complete. Results saved to output.csv")

Inference complete. Results saved to output.csv


### <b>Summary</b>
With this setup, our ML pipeline is:

- <b>Efficient</b> – No retraining needed if the model exists.
- <b>Reproducible</b> – Same preprocessing logic every time.
- <b>Production-ready</b> – Can be deployed or reused across multiple systems.