<font size="36"><b>Sklearn Pipelines - Self Study - Assignment</b></font>

In this exercise we will work with the Automobile dataset from <a href = "https://archive.ics.uci.edu/ml/datasets/Automobile">UCI</a>. We revised it for your comfort, so please use the attached files.

We will try to predict the automobile **price**

The data dictionary is attached (`imports-85.names` file).

Explanation of some other columns (see also `imports-85.names` file):
- **symboling** - Risk rating.  Corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
- The **normalized-losses** is relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.

# Load the dataset and perform initial EDA

In [None]:
import pandas as pd

df = pd.read_csv("Automobile.csv")

print("First few rows of the dataset:")
display(df.head())

print("\nMissing values in the dataset:")
display(df.isnull().sum())

print("\nSummary statistics of the dataset:")
display(df.describe())


# Do initial critical transformations

Check your **target** variable, and remove samples that will not allow us to train and predict.

In [None]:
df.dropna(subset=['price'], inplace=True)

# Split the data set to train and test sets

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['price'])
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# Split the features into different types
Split into different data types.

It will help you to do EDA on it separately and it preprocess separately.

**Hint:** Use the attached data dictionary.

In [None]:
file_path = "imports-85.names"

with open(file_path, "r") as file:
    file_contents = file.read()

print(file_contents)

# Perform more LIGHT EDA 
- This is to help decide how to preprocess the data.
- Focus on **minimal** things that will help you do encoding, and handling NaNs.
- **No need** to understand correlations between features, and between features and target variable, etc.

**Warning:** Don't overdo it, this exercise is about learning pipelines, not about EDA.

**Important hint**: What feature has almost always the same value?  It's import to recognize it and remove it later.  Otherwise it can cause a lot of issues, especially when transforming with Cross-validation, since one of the values will often not be found, and can give us a lot of problems with preprocessing it.

In [None]:
# Identify categorical and numerical features for encoding
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
numerical_features = df.select_dtypes(exclude=['object']).columns.tolist()
numerical_features.remove('price')

print("\nCategorical features:")
print(categorical_features)
print("\nNumerical features:")
print(numerical_features)

# Check the value counts for categorical features
for column in categorical_features:
    print("\nValue counts for", column, ":")
    print(df[column].value_counts())

We can see that engine-location is a redundant feature so we will drop it in the pipeline

# Define pipeline logical steps
With words and explanations, define the specific steps you want to take as a part of your pipeline and explain the reason for each.

**Hints:**
- Think **not** about specific columns, but how to do as similar as possible steps on **multiple** columns using built-in transformers
- To keep this exercise simple, it's **OK to make sub-optimal preprocessing**, as long as it's reasonable.  For example, you don't need to change distribution shapes of continuous features

**You MUST include:**
* NA care
* Removing 1 problematic feature discussed in EDA step above
* Categorical feature encoding
* Data Normalization
* Feature selection / dimensionaliry reduction
* Modeling

    1. NA Care:
        Impute missing values: We will fill in missing values in numerical features with mean. For categorical features, we will use a separate category for missing values.

    2. Removing 1 Problematic Feature:
        Remove feature with almost always the same value: we will eliminate 'engine-location' since it was identified as redundant during EDA.

    3. Categorical Feature Encoding:
        One-Hot Encoding or Ordinal Encoding: Convert categorical features into numerical format using one-hot encoding for nominal variables and ordinal encoding for ordinal variables.

    4. Data Normalization:
        Standardization or Min-Max Scaling: Scale numerical features to a similar scale using standardization or min-max scaling.

    5. Feature Selection / Dimensionality Reduction:
        SelectKBest: Reduce dimensionality by selecting top k features based on statistical tests (e.g., chi-square, ANOVA) to improve model performance and reduce overfitting.

    6. Modeling:
        Choose a regression algorithm: Train a regression model using selected features. We will use Random Forest Regression

# Implement a pipeline 
Include all of the steps you mentioned above.

**Hints:**
- You can make some changes to your decisions to make the pipeline simpler, but explain all changes, steps and decisions.
- Notice that by default `ColumnTransformer` will drop all features that were not explicitly given to it in one of the transformers.  This is one way to always drop some columns
- If you are having issues running the pipeline, try to debug parts of the pipeline and the it's outputs
- If you are having issues, make sure you understand what `handle_unknown` parameter options do for various transformers

In [None]:
print(categorical_features)

In [None]:
categorical_features.remove('engine-location')

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor



numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),
                                        ('scaler', StandardScaler()),
                                        ('pca', PCA (n_components=3))
                                       ])

categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer (strategy='most_frequent')),
                                          ('onehot', OneHotEncoder (handle_unknown='ignore'))
                                         ])

# Create a preprocessor to handle both numerical and categorical features 
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_features), 
                                               ('cat', categorical_transformer, categorical_features)
                                              ])

# Create the final pipeline with the preprocessor and the model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                            ('model', RandomForestRegressor(n_estimators=100, random_state=42))
                           ])


# Use the pipeline
- Fit the pipeline
- Evaluate the model recieved. Are you satisfied with your score? 
- Print your pipeline

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)


print("Evaluation metrics:")
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R-squared: {r2}")

print("\nPipeline:")
print(pipeline)


I am not so satisfied with the results, even if we obtain a relatively good R2 score the mse and mae are quite big.

# Model selection / hyperparam tuning
- Try a few different options for preprocessing and/or modeling, that you think has a good chance to improve the metric of the final model.  Use `RandomizedSearch`
- Is the score better now?
- Print the pipeline chosen by the search
- Print the best hyperparameters of the search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Define different preprocessing options
preprocessing_options = [
    {
        'preprocessor__num__imputer__strategy': ['mean', 'median'],
        'preprocessor__num__pca__n_components': [3, 5, 7],
        'preprocessor__cat__imputer__strategy': ['most_frequent', 'constant'],
        'preprocessor__cat__onehot__handle_unknown': ['error', 'ignore']
    },
    {
        'preprocessor__num__imputer__strategy': ['mean'],
        'preprocessor__num__pca__n_components': [3,4,5,8],
        'preprocessor__cat__imputer__strategy': ['most_frequent'],
        'preprocessor__cat__onehot__handle_unknown': ['ignore']
    }
]

param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
    'model__bootstrap': [True, False]
}

random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_grid,
    n_iter=10,  # Number of parameter settings that are sampled
    scoring='neg_mean_squared_error',  # Evaluation metric: mean squared error
    cv=5,  # Cross-validation folds
    random_state=42,
    verbose=2  # Verbosity level
)

random_search.fit(X_train, y_train)

print("Best score:", random_search.best_score_)
print("Best parameters:", random_search.best_params_)

print("Best pipeline:", random_search.best_estimator_)
