### A. Data Cleaning 

**Instructions:**
1. Import `pandas` and `numpy`. Import `SimpleImputer` from `sklearn.impute`. Import `StandardScaler` from `sklearn.preprocessing`. Import `train_test_split` from 
`sklearn.model_selection`. 
2. Use `read_csv` to load `AmesHousing.csv` as `housing`. 
3. Use the `.select_dtypes` and `.columns` commands to create `numerical_cols`, a list containing all of the `float64`, `int64` columns.
4. Use the `.select_dtypes` and `.columns` commands to create `numerical_cols`, a list containing all of the `object` columns.
5. Define `num_imputer` as `SimpleImputer` with a `median` strategy. Define `cat_imputer` as `SimpleImputer(strategy='constant', fill_value='missing')`. 
6. Replace the `numerical_cols` of `housing` with imputed columns using the `.fit_transform` function of `num_imputer` on the numerical columns of the `housing` dataset.
7. Replace the `categorical_cols` of `housing` with imputed columns using the `.fit_transform` function of `cat_imputer` on the categorical columns of the `housing` dataset.
8. Define a new list of `categorical_columns` using `.select_dtypes` and `.columns`, searching for all of the `object` columns. Now, use this list of categorical columns in the `columns` argument of `.get_dummies` to create a new dataset `housing_dummies` that contains all of the imputed columns, with the categorical columns having been converted to dummy variables. 
9. Define label variable `y` as `SalePrice` and feature matrix `X` as all other variables.
10. Use the `.fit_transform()` function of the `StandardScaler` on `X` to create `X_scaled`
11. Divide the data into `X_train`, `X_test`, `y_train`, `y_test` using  `train_test_split` with 20% of the data being testing. For replicability, fix a random state. 


In [7]:
# Load Packages
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load Data 
housing = pd.read_csv('AmesHousing.csv')
numerical_cols = housing.select_dtypes(['float', 'int64']).columns
categorical_cols = housing.select_dtypes(['object']).columns
# Impute Missing Values 
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
# 6 and 7 Replacing the two columns
housing[numerical_cols] = num_imputer.fit_transform(housing[numerical_cols])
housing[categorical_cols] = cat_imputer.fit_transform(housing[categorical_cols])

# Convert categorical variables to dummies 
categorical_cols_new = housing.select_dtypes(include=['object']).columns
housing_dummies = pd.get_dummies(housing, columns=categorical_cols_new)

# Define the label and feature 
y = housing_dummies['SalePrice']
X = housing_dummies.drop('SalePrice', axis=1)
# Normalize features 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=8)


### B. Define Models and Hyperparameter Grid

**Instructions:**
1. Import the following regression models from the `linear_model` module of scikit-learn:
   - `Lasso`
   - `Ridge`
   - `ElasticNet`
   - `LinearRegression`
2. Create a dictionary named `models` to store instances of `Lasso`, `Ridge`, and `ElasticNet` where the key is the model name and the value is the model instance.
3. Define a dictionary named `param_grid` that specifies the hyperparameter search space for each model:
   - `Lasso`: `alpha` values of `[10, 100, 1000, 10000]` and `max_iter` set to `[10000]`.
   - `Ridge`: `alpha` values of `[10, 100, 1000, 10000]`.
   - `ElasticNet`: `alpha` values of `[0.01, 0.1, 1, 10]`, `l1_ratio` values of `[0.2, 0.5, 0.8]`, and `max_iter` set to `[10000]`.




In [9]:
# Define models and hyperparameter grid
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression

models = {
    'Lasso': Lasso(),
    'Ridge': Ridge(),
    'ElasticNet': ElasticNet(),
}

param_grid = {
    'Lasso': {
                'model__alpha': [10, 100, 1000, 10000],
                'model__max_iter': [10000]
        },
    'Ridge': {'model__alpha': [10, 100, 1000, 10000]
        },
    'ElasticNet': {
        'model__alpha': [0.01, 0.1, 1, 10],
        'model__l1_ratio': [0.2, 0.5, 0.8],
        'model__max_iter': [10000]
    }
}


### C. Perform Cross-Validation and Hyperparameter Tuning
**Instructions:**
1. Import the necessary components for cross-validation and hyperparameter tuning:
   - `Pipeline` from `sklearn.pipeline`
   - `GridSearchCV` from `sklearn.model_selection`
2. Create an empty dictionary called `best_models` to store the best-tuned models.
3. Using a `for` loop, iterate over the  model name and model instance in the `models` dictionary. To iterate over a dictionary, use `.items()`:
   - Create a `Pipeline()` called `pipeline`. Within the `Pipeline()`, pass in the model instance from the `models` dictionary that you are iterating over. 
   - Create an instance of `GridSearchCV()` called `grid_search` to perform hyperparameter tuning:
      - Pass in the pipeline 
      - Pass in corresponding hyperparameter grid from `param_grid` 
      - Use 5-fold cross-validation 
      - Set `scoring` to `'neg_mean_squared_error'` to optimize for minimizing mean squared error.
   - Fit `GridSearchCV` to the training data, `X_train` and `y_train` 
   - Store the best estimator from grid_search called `best_estimator_` in `best_models`.
   - Print the best hyperparameters for each model.
4. After finding the best models, retrain each model in `best_models` on the full training data (`X_train`, `y_train`) using a `for` loop that iterates over the name and model in `best_models.items()` 


In [10]:
# Import Functions  
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

# Create empty best models dictionary 
best_models = {}

# Loop over models dictionary 
for name, model in models.items():
    # Create pipeline object 
    pipeline = Pipeline([('model', model)])
    
    # Create GridSearchCV object, passing in pipeline and param_grid[name] 
    grid_search = GridSearchCV(pipeline, param_grid[name], cv=5, scoring='neg_mean_squared_error')
    
    # Fit GridSearchCV
    grid_search.fit(X_train, y_train)
    
    # Report Best Model/Parameters 
    best_models[name] = grid_search.best_estimator_
    print(f"Best parameters for {name}: {grid_search.best_params_}")

# Fit Best Models 
for name, model in best_models.items():
    model.fit(X_train, y_train)



Best parameters for Lasso: {'model__alpha': 1000, 'model__max_iter': 10000}
Best parameters for Ridge: {'model__alpha': 1000}
Best parameters for ElasticNet: {'model__alpha': 1, 'model__l1_ratio': 0.8, 'model__max_iter': 10000}


### D. Evaluate Models on the Test Set

**Question:**  
After tuning our models using cross-validation, we now want to assess their performance on the test set and compare them to baseline (unoptimized) models.

1. Use the best-tuned models stored in `best_models` to make predictions on `X_test`.
2. Compute the root mean squared error (RMSE) for each optimized model using `mean_squared_error` from `sklearn.metrics`.  
   - Store the RMSE values in a dictionary called `rmse_results`, with keys formatted as `"Optimized {Model Name}"`.
3. To establish a performance baseline, evaluate **unoptimized models**:
   - Define a dictionary `unoptimized_models` containing:
     - `Lasso` with `alpha=1.0`
     - `Ridge` with `alpha=1.0`
     - `LinearRegression` as a reference model
   - Train each model on `X_train` and predict on `X_test`.
   - Compute and store their RMSE values in `rmse_results` with appropriate labels.
4. Compare the RMSE values of the optimized vs. unoptimized models:
   - Which models improved the most after hyperparameter tuning?
   - Did any models perform worse after tuning?


In [12]:
# Import Function 
from sklearn.metrics import mean_squared_error

# Create dictionary of RMSE for optimized models 
rmse_results = {}

for name, model in best_models.items():
    preds = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    rmse_results[f"Optimized {name}"] = rmse
    # print(f"{name} RMSE on test set: {rmse:.2f}")
    print(f"Optimized {name} RMSE on test set: {rmse:.2f}")
# Compare with unoptimized models
unoptimized_models = {
    'Unoptimized Lasso': Lasso(alpha=1.0),
    'Unoptimized Ridge': Ridge(alpha=1.0),
    'Linear Regression': LinearRegression()
}

for name, model in unoptimized_models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    rmse_results[name] = rmse
    print(f"{name} RMSE on test set: {rmse:.2f}")

# Define print order
print_order = [
    "Linear Regression",
    "Unoptimized Lasso",
    "Optimized Lasso",
    "Unoptimized Ridge",
    "Optimized Ridge",
    "Optimized ElasticNet"
]

# Print Results 
print("\nRMSE Comparison (in dollars):")
for model in print_order:
    if model in rmse_results:
        print(f"{model}: ${rmse_results[model]:,.0f}")

Optimized Lasso RMSE on test set: 29925.28
Optimized Ridge RMSE on test set: 28148.74
Optimized ElasticNet RMSE on test set: 28666.65
Unoptimized Lasso RMSE on test set: 30418.04
Unoptimized Ridge RMSE on test set: 30383.56
Linear Regression RMSE on test set: 30397.98

RMSE Comparison (in dollars):
Linear Regression: $30,398
Unoptimized Lasso: $30,418
Optimized Lasso: $29,925
Unoptimized Ridge: $30,384
Optimized Ridge: $28,149
Optimized ElasticNet: $28,667


  model = cd_fast.enet_coordinate_descent(


### E. Qualitative Questions 
1. **Why do we use cross-validation instead of simply evaluating the model once on the training data? How does cross-validation help us assess model performance more reliably?**  
2. **What does GridSearchCV do conceptually? Why is it important to tune hyperparameters rather than using default values?**  
3. **In this problem set, we tested both optimized (tuned) and unoptimized (default) models. What patterns do you notice when comparing their performance on the test set? Does hyperparameter tuning always improve performance? Why or why not?**  
4. **In the param_grid dictionary, we varied alpha for Ridge and Lasso regression. What does alpha control in these models? How would an extremely large or small value of alpha affect the model's predictions?**  
5. **What are some reasons we might prefer to choose Lasso over Ridge or vice-versa?**  


1. Cross validation helps to avoid overfitting. Using only the training data to test a model may come out as perfectly accurate but it will not be able to handle real world data, or any data outside of the dataset it was trained on. Cross validation also rotates which subset it's using for reducing error, so we know that every point will be used.
2. GridSearchCV runs through the various settings nd tests each one, to find the one with the lowest error. Our default values are just the start so we need to find that balance in terms of complexity, what to prioritize and what to value, and this depends on noise levels.
3. I noticed that the lower RMSE values correlate with the optimized models. This is due to the alpha finidng balance to find the closest values while being able to generalize. If the increased complexity leads to overfitting, then the tuning might show worse results to compared to the unoptimized model.
4. Alpha, in this case of param_grid, is the regularization strength and concrols the penalty of the size of the coefficients. An extremely large alpha will penalize the coefficients, and will head towards zero, resulting in lower complexity and possible underfitting of the model. An extremely small value of alpha will make the model too sensitive to the training data noise having high variance.
5. A reason we might prefer to choose Lasso is when we believe only a few of the features will make the difference in prediction, when the features are driving coefficients to zero, this basically selects the correct features to focus on. We might prefer Ridge when we have many features that contribute similarily to the end product or calculations, Ridge will not usually cause a coefficient to be pushed to zero.

Gemini Transcript: https://gemini.google.com/share/cbf718eec538
