## Group Name: MLOPS Group 63

## Group Member Names:
1.   Sivakumar G - 2023aa05486
2.   Pabbisetty Jayakrishna - 2023aa05487
3.   Ravi shankar S - 2023aa05488
4.   Srivatsan V R - 2023aa05962

## M3: Model Experimentation and Packaging

**Objective**: Train a machine learning model, perform hyperparameter tuning, and package the model for deployment.

### Tasks:

#### 1. Hyperparameter Tuning:
- Use a library like Optuna or Scikit-learn’s GridSearchCV to perform hyperparameter tuning on a chosen model.
- Document the tuning process and the best parameters found.

#### 2. Model Packaging:
- Package the best-performing model using tools like Docker and Flask.
- Create a Dockerfile and a simple Flask application to serve the model.

### Deliverables:
- A report on hyperparameter tuning results.
- A Dockerfile and Flask application code.
- Screenshots of the model running in a Docker container.


This notebook captures the following tasks:

#### 1. Hyperparameter Tuning:
- The notebook uses libraries like Optuna or Scikit-learn’s GridSearchCV to perform hyperparameter tuning on the selected model.
- It documents the entire tuning process, including the hyperparameters tested and the best-performing parameters discovered.



### Imports: Libraries needed for Optuna, scikit-learn, and evaluation.


In [13]:
# Import necessary libraries
import optuna
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

### Data Loading: Fetch and split the California Housing dataset.


In [5]:
# Load California Housing dataset
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

In [6]:
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Objective Function: 
The function Optuna will optimize. This function will suggest different values for hyperparameters of the RandomForest model and evaluate the model using cross-validation.


In [7]:
# Define the objective function for Optuna
def objective(trial):
    # Suggest hyperparameters
    n_estimators = trial.suggest_int("n_estimators", 50, 500, step=50)
    max_depth = trial.suggest_int("max_depth", 5, 30, step=5)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 20, step=2)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 10)

    # Initialize the model with suggested hyperparameters
    model = RandomForestRegressor(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        random_state=42,
    )

    # Perform cross-validation
    scores = cross_val_score(model, X_train, y_train, cv=3, scoring="neg_mean_squared_error")
    avg_mse = -np.mean(scores)
    return avg_mse

### Optuna Optimization: 
This part runs the optimization process, trying 50 different sets of hyperparameters.


In [8]:
# Create Optuna study and perform optimization
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=10)

[I 2025-01-26 23:50:16,499] A new study created in memory with name: no-name-3cfeb86f-6c24-4f90-b19c-976c0748f6b0
[I 2025-01-26 23:51:06,220] Trial 0 finished with value: 0.2799784137895807 and parameters: {'n_estimators': 350, 'max_depth': 25, 'min_samples_split': 14, 'min_samples_leaf': 8}. Best is trial 0 with value: 0.2799784137895807.
[I 2025-01-26 23:52:28,019] Trial 1 finished with value: 0.27457571283102794 and parameters: {'n_estimators': 400, 'max_depth': 25, 'min_samples_split': 10, 'min_samples_leaf': 6}. Best is trial 1 with value: 0.27457571283102794.
[I 2025-01-26 23:53:12,211] Trial 2 finished with value: 0.2785342581544273 and parameters: {'n_estimators': 250, 'max_depth': 15, 'min_samples_split': 8, 'min_samples_leaf': 7}. Best is trial 1 with value: 0.27457571283102794.
[I 2025-01-26 23:53:40,674] Trial 3 finished with value: 0.44420870693335307 and parameters: {'n_estimators': 350, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 2}. Best is trial 1 with

### Results: 
Prints the best hyperparameters and the corresponding MSE.


In [9]:
# Display best parameters and best score
print("Best Parameters:")
print(study.best_params)
print("\nBest MSE:", study.best_value)

Best Parameters:
{'n_estimators': 400, 'max_depth': 25, 'min_samples_split': 10, 'min_samples_leaf': 6}

Best MSE: 0.27457571283102794


In [10]:
# Train and evaluate the model with the best parameters
best_params = study.best_params
best_model = RandomForestRegressor(**best_params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
print("\nTest MSE:", test_mse)


Test MSE: 0.2649736667996871


In [11]:
# Train and evaluate the model with the best parameters
best_params = study.best_params
best_model = RandomForestRegressor(**best_params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
print("\nTest MSE:", test_mse)


Test MSE: 0.2649736667996871


### Export the best performacing model

In [12]:
import joblib
joblib.dump(best_model, 'best_rf_model.joblib')
print("Model saved as 'best_rf_model.joblib'")


Model saved as 'best_rf_model.joblib'


### Final Summary of Hyperparameter Tuning:

The hyperparameter tuning process involved using Optuna to optimize the parameters of a Random Forest model. The following parameters were tuned during the process:

- **n_estimators**: Number of trees in the forest
- **max_depth**: Maximum depth of the trees
- **min_samples_split**: Minimum number of samples required to split an internal node
- **min_samples_leaf**: Minimum number of samples required to be at a leaf node

#### Tuning Process:
The tuning was performed using the Optuna library, which carried out multiple trials to find the best combination of hyperparameters. Each trial evaluated the model's performance based on a specific set of parameters, and the goal was to minimize the validation error (represented by the value).

The trials ran with the following results:

- **Trial 0**: `{'n_estimators': 350, 'max_depth': 25, 'min_samples_split': 14, 'min_samples_leaf': 8}` resulted in a validation error of `0.2799784137895807`.
- **Trial 1**: `{'n_estimators': 400, 'max_depth': 25, 'min_samples_split': 10, 'min_samples_leaf': 6}` achieved the best result so far with a validation error of `0.27457571283102794`.
- **Trial 2**: `{'n_estimators': 250, 'max_depth': 15, 'min_samples_split': 8, 'min_samples_leaf': 7}` resulted in `0.2785342581544273`.
- **Trial 3**: `{'n_estimators': 350, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 2}` resulted in a much higher validation error of `0.44420870693335307`.
- **Trial 4**: `{'n_estimators': 300, 'max_depth': 20, 'min_samples_split': 8, 'min_samples_leaf': 10}` gave a validation error of `0.28578177814243194`.
- **Trial 5**: `{'n_estimators': 350, 'max_depth': 15, 'min_samples_split': 18, 'min_samples_leaf': 3}` gave a validation error of `0.2747816747063055`, slightly worse than Trial 1.
- **Trial 6**: `{'n_estimators': 300, 'max_depth': 10, 'min_samples_split': 6, 'min_samples_leaf': 5}` resulted in a validation error of `0.29287349128411705`.
- **Trial 7**: `{'n_estimators': 500, 'max_depth': 5, 'min_samples_split': 16, 'min_samples_leaf': 1}` produced a much higher validation error of `0.44479343715089775`.
- **Trial 8**: `{'n_estimators': 150, 'max_depth': 30, 'min_samples_split': 12, 'min_samples_leaf': 7}` resulted in `0.2793117434904573`.
- **Trial 9**: `{'n_estimators': 400, 'max_depth': 15, 'min_samples_split': 14, 'min_samples_leaf': 7}` resulted in `0.2778982033241502`.

#### Best Hyperparameters:
The best combination of hyperparameters found during the tuning process was:
- **n_estimators**: 400
- **max_depth**: 25
- **min_samples_split**: 10
- **min_samples_leaf**: 6

This combination resulted in the lowest validation error of **0.27457571283102794** in Trial 1, making it the optimal choice for the model.

### Conclusion:
The hyperparameter tuning process successfully identified the best set of parameters, improving the model's performance. These optimal parameters will be used in the final model for deployment and further evaluation.