In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Install the ucimlrepo package to easily import dataset
!pip install ucimlrepo

from ucimlrepo import fetch_ucirepo 



### Builing a baseline model

We first build a baseline model which serves as a starting point for comparison with more complex models. That is, we can use this baseline as a benchmark to compare the performance of more advanced models developed later.

### Building a linear model

We first develop a linear regression model aimed at predicting the total count of rental bikes, encompassing both casual and registered users, by leveraging a collection of input variables. This model undergoes training utilizing the training dataset, wherein the input variables are employed to anticipate the target variable (`cnt`).

### Building tree-based models and selecting the best

Next, we develop tree-based models, which  involves constructing decision trees or ensemble methods such as Random Forests or Gradient Boosting Trees.

First, let's check the generalization performance of decision tree regressor and the ensemble methods with default parameters.

In [3]:
# Import necessary modules
from sklearn.model_selection import train_test_split

# Load data
hourly_dataset  = fetch_ucirepo(id=275)
features = hourly_dataset.data.features
features = features.drop(['dteday'], axis=1) 
target = hourly_dataset.data.targets
target = target['cnt'].values
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Execute the train.py script
%run train_and_evaluate_model.py



Dummy Regressor - Mean Absolute Error (MSE) on training set: 143.26
Dummy Regressor - Mean Absolute Error (MAE) using cross-validation: 143.27 ± 3.30

Linear Regressor - Mean Absolute Error (MSE) on training set: 106.55
Linear Regressor - Mean Absolute Error (MAE) using cross-validation: 106.63 ± 2.72

Decision Tree Regressor - Mean Absolute Error (MSE) on training set: 0.02
Decision Tree Regressor - Mean Absolute Error (MAE) using cross-validation: 35.58 ± 1.16

Random Forest Regressor - Mean Absolute Error (MSE) on training set: 9.60
Random Forest Regressor - Mean Absolute Error (MAE) using cross-validation: 26.18 ± 0.73

Gradient Boosting Regressor - Mean Absolute Error (MSE) on training set: 48.74
Gradient Boosting Regressor - Mean Absolute Error (MAE) using cross-validation: 48.92 ± 1.54



We see that the linear regression model shows consistent performance in predicting the count of rental bikes, with a mean absolute error of around 106.55-106.63 across both the training set and cross-validation. This suggests that the model captures the underlying patterns in the data reasonably well, though there may still be room for improvement in reducing prediction errors.


- Decision Tree Regressor: The mean absolute error (MAE) on the training set is extremely low, indicating a near-perfect fit to the training data, which might suggest overfitting. Also, the average MAE using cross-validation is higher, indicating that the model's performance may not generalize well to unseen data.
- Random Forest Regression: The MAE on the training set is higher compared to the Decision Tree Regressor but still relatively low. The average MAE using cross-validation is lower compared to the Decision Tree Regressor, suggesting better generalization performance.
- Gradient Boosting Regression: The MAE on the training set is noticeably higher compared to both Decision Tree Regressor and Random Forest Regression, suggesting less overfitting. The average MAE using cross-validation is even higher, reinforcing the concern about overfitting and poor generalization.

Among the models evaluated, the Random Forest Regression model appears to perform the best. It achieves a relatively low average MAE on the training set and demonstrates better generalization performance compared to the Decision Tree and Gradient Boosting Regression models.

However, further investigation is warranted to optimize the Random Forest model's hyperparameters and address potential overfitting. Hyperparameter tuning and feature engineering could potentially enhance the model's performance and make it more robust for real-world applications.

#### Hyperparameter tuning - Linear regression model -Ridge or Lasso
- This focuses on tuning the regularization strength parameter (alpha) for both Lasso and Ridge regression using cross-validation and grid search.

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_absolute_error


# Define hyperparameters to tune
param_grid = {'alpha': [0.001, 0.01, 0.1, 0.5, 1, 10, 100]}

# Lasso Regression
lasso = Lasso()
lasso_cv = GridSearchCV(lasso, param_grid, cv=10)
lasso_cv.fit(X_train, y_train)
# Compute predictions on the training set
y_train_pred_lasso= lasso_cv.predict(X_train)

# Ridge Regression
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=10)
ridge_cv.fit(X_train, y_train)
y_train_pred_ridge= ridge_cv.predict(X_train)



# Evaluate the models
print("Lasso Regression:")
print("Best alpha:", lasso_cv.best_params_)
print(f"Training MSE: {mean_absolute_error(y_train, y_train_pred_lasso):.2f}")


print("\nRidge Regression:")
print("Best alpha:", ridge_cv.best_params_)
print(f"Training MSE: {mean_absolute_error(y_train, y_train_pred_ridge):.2f}")



Lasso Regression:
Best alpha: {'alpha': 0.01}
Training MSE: 106.55

Ridge Regression:
Best alpha: {'alpha': 1}
Training MSE: 106.55


No much improvement after tuning the hyperparameters.

#### Hyperparameter tuning - Random Forest Regressor

In [5]:
from sklearn.ensemble import RandomForestRegressor

# Define the Random Forest Regression model
rf_model = RandomForestRegressor(random_state=42)

# Define the hyperparameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV with the model, hyperparameter grid, and scoring metric
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid,
                           cv=10, scoring='neg_mean_absolute_error', n_jobs=-1)

# Perform grid search to find the best hyperparameters
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and best mean absolute error
print("Best Hyperparameters:", grid_search.best_params_)
print(f"Cross-validated score of the best estimator: {-grid_search.best_score_:.2f}")

# Evaluate the model with best hyperparameters on the training set
best_model = grid_search.best_estimator_
y_train_pred = best_model.predict(X_train)
train_mae = mean_absolute_error(y_train, y_train_pred)
print(f"Mean Absolute Error on Training Set: {train_mae:.2f}")


Best Hyperparameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 300}
Cross-validated score of the best estimator: 26.06
Mean Absolute Error on Training Set: 9.46


However, there were no much improvement.


### Models
| Model Name | Mean Absolute Error (MAE) on training set | MAE via cross-validation |
|---------------|---------------|---------------|
| 1. Dummy Regressor (Baseline) | 143.26 | |
| 2. Linear Regression | 106.554  | 106.630 ± 2.722 |
| 3. Decision Tree | 0.015 | 35.621 ± 1.119 |
| 4. Random Forest | 9.604  | 26.213 ± 0.640  |
| 5. Random Forest (after tuning) | 9.459 | 26.063 |
| 6. Gradient Boosting | 48.744 | 48.916 ± 1.537 |
