### Ensuring Feature Consistency Between Training & InferencePipelines:

**Task 1**: Consistent Feature Preparation
- Step 1: Write a function for data preprocessing and imputation shared by both training and inference pipelines.
- Step 2: Demonstrate consistent application on both datasets.

In [None]:
# write your code from here

**Task 2**: Pipeline Integration
- Step 1: Use sklearn pipelines to encapsulate the preprocessing steps.
- Step 2: Configure identical pipelines for both training and building inference models.

In [None]:
# write your code from here

**Task 3**: Saving and Loading Preprocessing Models
- Step 1: Save the transformation model after fitting it to the training data.
- Step 2: Load and apply the saved model during inference.

In [1]:
# write your code from here
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import joblib  # For saving/loading pipeline

# --- Task 1: Consistent Feature Preparation ---

def load_data_with_missing():
    """Load Boston housing dataset and artificially inject missing values."""
    boston = load_boston()
    df = pd.DataFrame(boston.data, columns=boston.feature_names)
    # Inject missing values in 'CRIM' column (every 10th row)
    df.loc[::10, 'CRIM'] = np.nan
    return df

def preprocess_data(df, pipeline=None, fit_pipeline=True):
    """
    Preprocess dataset using given pipeline.
    If no pipeline provided, create a new one (SimpleImputer + StandardScaler).
    If fit_pipeline=True, fit and transform; else only transform.
    """
    if pipeline is None:
        pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ])
    if fit_pipeline:
        processed = pipeline.fit_transform(df)
    else:
        processed = pipeline.transform(df)
    return processed, pipeline

# --- Demonstrate consistent application on train & inference ---

# Load full dataset and split into train/test to simulate inference data
df = load_data_with_missing()
X_train, X_test = train_test_split(df, test_size=0.3, random_state=42)

# Preprocess training data (fit pipeline here)
X_train_processed, pipeline = preprocess_data(X_train, fit_pipeline=True)

# Preprocess inference data (use same pipeline, no fitting)
X_test_processed, _ = preprocess_data(X_test, pipeline=pipeline, fit_pipeline=False)

print("Training data processed shape:", X_train_processed.shape)
print("Inference data processed shape:", X_test_processed.shape)


# --- Task 2: Pipeline Integration ---
# Already done above by creating and using sklearn Pipeline
# This ensures identical steps applied during train and inference

# --- Task 3: Saving and Loading Preprocessing Models ---

# Save the fitted pipeline to disk
pipeline_filename = 'preprocessing_pipeline.joblib'
joblib.dump(pipeline, pipeline_filename)
print(f"\nPipeline saved to {pipeline_filename}")

# Load the pipeline from disk and apply to new data (simulate inference)
loaded_pipeline = joblib.load(pipeline_filename)
X_new_processed = loaded_pipeline.transform(X_test)

print("Loaded pipeline processed inference data shape:", X_new_processed.shape)


ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>
