# Predict Bike Sharing Demand with AutoGluon Template

## Project: Predict Bike Sharing Demand with AutoGluon
This notebook is a template with each step that you need to complete for the project.

Please fill in your code where there are explicit `?` markers in the notebook. You are welcome to add more cells and code as you see fit.

Once you have completed all the code implementations, please export your notebook as a HTML file so the reviews can view your code. Make sure you have all outputs correctly outputted.

`File-> Export Notebook As... -> Export Notebook as HTML`

There is a writeup to complete as well after all code implememtation is done. Please answer all questions and attach the necessary tables and charts. You can complete the writeup in either markdown or PDF.

Completing the code template and writeup template will cover all of the rubric points for this project.

The rubric contains "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. The stand out suggestions are optional. If you decide to pursue the "stand out suggestions", you can include the code in this notebook and also discuss the results in the writeup file.

## Step 1: Create an account with Kaggle

## Step 2: Download the Kaggle dataset using the kaggle python library

1. Notebook should be using a `ml.t3.medium` instance (2 vCPU + 4 GiB)
2. Notebook should be using kernal: `Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)`

### Install packages

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
!pip install -U pip setuptools wheel
!pip install autogluon --no-cache-dir
# Without --no-cache_dir, smaller aws instances may have trouble installing

import numpy
print(numpy.__version__)

1.26.4


### Setup Kaggle API Key

In [12]:
# create the .kaggle directory and an empty kaggle.json file
import os
import json

kaggle_dir = os.path.join(os.path.expanduser('~'), '.kaggle')
os.makedirs(kaggle_dir, exist_ok=True)
kaggle_json_path = os.path.join(kaggle_dir, 'kaggle.json')
os.chmod(kaggle_json_path, 0o600)

print("Kaggle config ready.")

Kaggle config ready.


In [16]:
# Fill in your user name and key from creating the kaggle account and API token file
import json
kaggle_username = "maryamadibi"
kaggle_key = "1ea64e8b20324c1b0268303ee0c03486"


### Download and explore dataset

### Go to the bike sharing demand competition and agree to the terms

In [None]:
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
!kaggle competitions download -c bike-sharing-demand
# If you already downloaded it you can use the -o command to overwrite the file
!unzip -o bike-sharing-demand.zip

In [None]:
!pip uninstall -y numpy
!pip install numpy==1.23.5

# Re-install autogluon to ensure compatibility with the downgraded numpy
!pip install autogluon --no-cache-dir

In [None]:
import pandas as pd
from autogluon.tabular import TabularPredictor

In [None]:
# Create the train dataset in pandas by reading the csv
# Set the parsing of the datetime column so you can use some of the `dt` features in pandas later
train = pd.read_csv('train.csv')
train.head()

In [None]:
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()

In [None]:
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv('test.csv')
test.head()

In [None]:
# Same thing as train and test dataset
submission = pd.read_csv('sampleSubmission.csv')
submission.head()

## Step 3: Train a model using AutoGluon’s Tabular Prediction

Requirements:
* We are prediting `count`, so it is the label we are setting.
* Ignore `casual` and `registered` columns as they are also not present in the test dataset.
* Use the `root_mean_squared_error` as the metric to use for evaluation.
* Set a time limit of 10 minutes (600 seconds).
* Use the preset `best_quality` to focus on creating the best model.

In [None]:
eval_metric = 'root_mean_squared_error'
label = 'count'
ignored_columns = ["casual", "registered"]
train_data = train
time_limit = 600
presets = "best_quality"

In [None]:
predictor = TabularPredictor(label=label,
                             problem_type= 'regression',
                             eval_metric=eval_metric,
                             learner_kwargs={'ignored_columns': ignored_columns}).fit(
                                                                           train_data = train_data,
                                                                           time_limit=time_limit,
                                                                           presets=presets)

### Review AutoGluon's training run with ranking of models that did the best.

In [None]:
predictor.fit_summary()

In [None]:
# Leaderboard dataframe
leaderboard_df = pd.DataFrame(predictor.leaderboard(silent=True))
leaderboard_df

In [None]:
# Output the model's `score_val` in a bar chart to compare performance
import matplotlib.pyplot as plt
leaderboard_df.plot(kind="bar", x="model", y="score_val", figsize=(14, 7))
plt.ylabel("RMSE Scores")
plt.show()

### Create predictions from test dataset

In [None]:
predictions = predictor.predict(test)
predictions.head()

#### NOTE: Kaggle will reject the submission if we don't set everything to be > 0.

In [None]:
# Describe the `predictions` series to see if there are any negative values
predictions.describe()

In [None]:
# How many negative values do we have?
negative_pred_count = predictions.apply(lambda x: 1 if x<0 else 0)

pred_pos_count = (negative_pred_count==0).sum()
pred_neg_count = (negative_pred_count==1).sum()

print("Total predictions                :", len(predictions.index))
print("Total positive prediction values :", pred_pos_count)
print("Total negative prediction values :", pred_neg_count)

In [None]:
# Set them to zero
predictions[predictions<0] = 0

# Rechecking, if no predictions are less than 0
negative_pred_count = predictions.apply(lambda x: 1 if x<0 else 0)
pred_neg_count = (negative_pred_count==1).sum()
print(f"No. of negative predictions: {pred_neg_count}")

### Set predictions to submission dataframe, save, and submit

In [None]:
submission["count"] = predictions
submission.to_csv("submission.csv", index=False)

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "first raw submission"

#### View submission via the command line or in the web browser under the competition's page - `My Submissions`

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### Initial score of 1.83385

## Step 4: Exploratory Data Analysis and Creating an additional feature
* Any additional feature will do, but a great suggestion would be to separate out the datetime into hour, day, or month parts.

In [None]:
# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
train.hist(figsize=(15,20))
plt.tight_layout()
plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Keep only numeric columns to avoid string/datetime errors
corr_data = train.select_dtypes(include=[np.number])

# Compute correlation matrix
corr_matrix = corr_data.corr()

# Create a mask to eliminate redundant lower triangle
corr_mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Plot heatmap
plt.figure(figsize=(15, 15), dpi=120)
sns.heatmap(
    corr_matrix,
    cmap='RdPu',
    cbar_kws={"shrink": .5},
    vmin=-1, vmax=1, center=0,
    square=True,
    mask=~corr_mask,  # only upper triangle
    annot=True,
    linewidths=0.5,
    annot_kws={"size": 13}
)
plt.xticks(fontsize=14, rotation=90)
plt.yticks(fontsize=14, rotation=0)
plt.title("HeatMap: Correlation Matrix", fontsize=20, fontweight='bold')
plt.tight_layout()
plt.show()


In [None]:
# create a new feature
train['datetime'] = pd.to_datetime(train['datetime'])
test['datetime'] = pd.to_datetime(test['datetime'])

In [None]:
train['datetime'].head()

In [None]:
train["year"] = train["datetime"].dt.year
train["month"] = train["datetime"].dt.month
train["day"] = train["datetime"].dt.dayofweek
train["hour"] = train["datetime"].dt.hour
train.drop(["datetime"], axis=1, inplace=True)
train.head()

In [None]:
test["year"] = test["datetime"].dt.year
test["month"] = test["datetime"].dt.month
test["day"] = test["datetime"].dt.dayofweek
test["hour"] = test["datetime"].dt.hour
test.drop(["datetime"], axis=1, inplace=True)
test.head()

In [None]:
train.info()

## Make category types for these so models know they are not just numbers
* AutoGluon originally sees these as ints, but in reality they are int representations of a category.
* Setting the dtype to category will classify these as categories in AutoGluon.

In [None]:
train["season"] = train["season"].astype("category")
train["weather"] = train["weather"].astype("category")
test["season"] = test["season"].astype("category")
test["weather"] = test["weather"].astype("category")

In [None]:
# View are new feature
train.head()

In [None]:
# View histogram of all features again now with the hour feature
train.hist(figsize=(15,20))
plt.tight_layout()
plt.show()

In [None]:
#Variation in target variable count with respect to new features derived from datetime feature
sns.catplot(x="hour",y="count",data=train,kind='bar',height=5,aspect=1.5)
plt.tight_layout()
plt.show()

In [None]:
# Variation in`count` w.r.t `day` (dayofweek) [0: Monday -> 6: Sunday]
sns.catplot(x="day",y="count",data=train,kind='bar',height=5,aspect=1.5)
plt.tight_layout()
plt.xticks(ticks=range(0,7), labels=["Monday", "Tuesday", "Wednesday",
                                     "Thursday", "Friday", "Saturday", "Sunday"])
plt.show()

In [None]:
sns.catplot(x="month",y="count",data=train,kind='bar',height=5,aspect=1.5)
sns.catplot(x="season",y="count",data=train,kind='bar',height=5,aspect=1.5)
plt.tight_layout()
plt.show()

In [None]:
sns.catplot(x="year",y="count",data=train,kind='bar',height=5,aspect=1.5)
plt.tight_layout()
plt.show()

In [None]:
sns.catplot(x="weather",y="count",data=train,kind='bar',height=5,aspect=1.5)
plt.xticks(ticks=range(0,4), labels=["Clear","Misty",
                                     "Light_\nRain/Snow\n/Thunderstorm",
                                     "Heavy_\nRain/Snow\n/Thunderstorm"])
plt.tight_layout()
plt.show()

In [None]:
#creating a new feature 'day type' in train data
# Adding features - 'day_type' in train data
train["day_type"]=""
train.loc[(train.holiday==1),"day_type"] = "holiday"
train.loc[((train.holiday==0) & (train.workingday==1)), "day_type"] = "weekday"
train.loc[((train.holiday==0) & (train.workingday==0)), "day_type"] = "weekend"

# Adding features - 'day_type' in test data
test["day_type"]=""
test.loc[(test.holiday==1),"day_type"] = "holiday"
test.loc[((test.holiday==0) & (test.workingday==1)), "day_type"] = "weekday"
test.loc[((test.holiday==0) & (test.workingday==0)), "day_type"] = "weekend"

# Change the datatype to category
train["day_type"] = train["day_type"].astype("category")
test["day_type"] = test["day_type"].astype("category")

train.head()

In [None]:
# Statistics of all features within the trian data
train.describe()

In [None]:
# Dropping highly correlated independent feature 'atemp' from train and test datasets
train.drop(["atemp"], axis=1, inplace=True)
test.drop(["atemp"], axis=1, inplace=True)

In [None]:
train.info()

In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Exclude unwanted columns and select only numeric features
feature_numeric = [i for i in train.columns if i not in ['casual', 'registered']]
corr_data = train[feature_numeric].select_dtypes(include=[np.number])

# Calculate the correlation matrix
corr_matrix = corr_data.corr()

# Create a mask to hide the lower triangle of the correlation matrix
corr_mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Plot the heatmap
plt.figure(figsize=(15, 15), dpi=120)
sns.heatmap(
    corr_matrix,
    cmap='RdPu',
    cbar_kws={"shrink": .5},
    vmin=-1, vmax=1, center=0,
    square=True,
    mask=~corr_mask,
    annot=True,
    linewidths=0.5,
    annot_kws={"size": 13}
)
plt.xticks(fontsize=14, rotation=90)
plt.yticks(fontsize=14, rotation=0)
plt.title("HeatMap: Correlation Matrix", fontsize=20, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 5: Rerun the model with the same settings as before, just with more features

In [None]:
eval_metric = 'root_mean_squared_error'
label = 'count'
ignored_columns = ["casual", "registered"]   # Ignored columns while training
train_data = train                           # 'casual' and 'registered' columns are already dropped/ignored
time_limit = 600
presets = "best_quality"

In [None]:
predictor_new_features = TabularPredictor(label=label,
                                          problem_type= 'regression',
                                          eval_metric=eval_metric,
                                          learner_kwargs={'ignored_columns': ignored_columns}).fit(
                                                                                           train_data = train_data,
                                                                                           time_limit=time_limit,
                                                                                           presets=presets)

In [None]:
predictor_new_features.fit_summary()

In [None]:
leaderboard_new_features_df = pd.DataFrame(predictor_new_features.leaderboard(silent=True))
leaderboard_new_features_df

In [None]:
import matplotlib.pyplot as plt
leaderboard_new_features_df.plot(kind="bar", x="model", y="score_val", figsize=(14, 7))
plt.ylabel("RMSE Scores")
plt.show()

In [None]:
predictions_new_features = predictor_new_features.predict(test)
predictions_new_features.head()

In [None]:
predictions_new_features.describe()

In [None]:
negative_pred_count = predictions_new_features.apply(lambda x: 1 if x<0 else 0)

pred_pos_count = (negative_pred_count==0).sum()
pred_neg_count = (negative_pred_count==1).sum()

print("Total predictions                :", len(predictions_new_features.index))
print("Total positive prediction values :", pred_pos_count)
print("Total negative prediction values :", pred_neg_count)

In [None]:
predictions_new_features[predictions_new_features<0] = 0    #

# Rechecking, if no predictions are less than 0
negative_pred_count = predictions_new_features.apply(lambda x: 1 if x<0 else 0)
pred_neg_count = (negative_pred_count==1).sum()
print(f"No. of negative predictions: {pred_neg_count}")
print("All negative values in the predictions (if any) are set to zero successfully.")

In [None]:
submission_new_features = pd.read_csv('sampleSubmission.csv', parse_dates = ['datetime'])
submission_new_features.head()

In [None]:
# Same submitting predictions
submission_new_features["count"] = predictions_new_features
submission_new_features.to_csv("submission_new_features.csv", index=False)

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_features.csv -m "new features"

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### New Score of 0.48747

## Step 6: Hyper parameter optimization
* There are many options for hyper parameter optimization.
* Options are to change the AutoGluon higher level parameters or the individual model hyperparameters.
* The hyperparameters of the models themselves that are in AutoGluon. Those need the `hyperparameter` and `hyperparameter_tune_kwargs` arguments.

In [None]:
eval_metric = 'root_mean_squared_error'
label = 'count'
ignored_columns = ["casual", "registered"]
train_data = train
time_limit = 600
presets = "optimize_for_deployment"

In [None]:
from autogluon.tabular import TabularPredictor

# General settings
label = 'count'
ignored_columns = ['casual', 'registered']
time_limit = 600  # Maximum training time in seconds (10 minutes)
presets = 'optimize_for_deployment'

# Define full model configurations and hyperparameters (no shorthand strings)
hyperparameters = {
    'GBM': [
        {
            'extra_trees': True,
            'num_boost_round': 100,
            'num_leaves': 36,
            'ag_args': {'name_suffix': 'XT'}
        },
        {},  # Default GBM settings
        {
            'learning_rate': 0.03,
            'num_leaves': 128,
            'feature_fraction': 0.9,
            'min_data_in_leaf': 3,
            'ag_args': {'name_suffix': 'Large', 'priority': 0}
        }
    ],
    'NN_TORCH': {
        'num_epochs': 5,
        'learning_rate': [1e-4, 5e-4, 1e-3],
        'activation': ['relu', 'softrelu', 'tanh'],
        'dropout_prob': [0.0, 0.1, 0.3]
    }
}

# Hyperparameter tuning configuration
hyperparameter_tune_kwargs = {
    'num_trials': 20,
    'scheduler': 'local',
    'searcher': 'auto'
}

# Train the model
predictor_new_hpo = TabularPredictor(
    label=label,
    problem_type='regression',
    eval_metric='root_mean_squared_error',
    learner_kwargs={'ignored_columns': ignored_columns}
).fit(
    train_data=train,
    time_limit=time_limit,
    presets=presets,
    hyperparameters=hyperparameters,
    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
    refit_full='best'
)

# Show leaderboard
predictor_new_hpo.leaderboard(silent=True)


In [None]:
predictor_new_hpo.fit_summary()

In [None]:
leaderboard_new_hpo_df = pd.DataFrame(predictor_new_hpo.leaderboard(silent=True))
leaderboard_new_hpo_df

In [None]:
import matplotlib.pyplot as plt
leaderboard_new_hpo_df.plot(kind="bar", x="model", y="score_val", figsize=(12, 6))
plt.ylabel("RMSE Scores")
plt.show()

In [None]:
predictions_new_hpo = predictor_new_hpo.predict(test)
predictions_new_hpo.head()

In [None]:
predictions_new_hpo.describe()

In [None]:
negative_pred_count = predictions_new_hpo.apply(lambda x: 1 if x<0 else 0)

pred_pos_count = (negative_pred_count==0).sum()
pred_neg_count = (negative_pred_count==1).sum()

print("Total predictions                :", len(predictions_new_hpo.index))
print("Total positive prediction values :", pred_pos_count)
print("Total negative prediction values :", pred_neg_count)

In [None]:
predictions_new_hpo[predictions_new_hpo<0] = 0

# Rechecking, if no predictions are less than 0
negative_pred_count = predictions_new_hpo.apply(lambda x: 1 if x<0 else 0)
pred_neg_count = (negative_pred_count==1).sum()
print(f"No. of negative predictions: {pred_neg_count}")
print("All negative values in the predictions (if any) are set to zero successfully.")

In [None]:
submission_new_hpo = pd.read_csv('sampleSubmission.csv', parse_dates = ['datetime'])
submission_new_hpo.head()

In [None]:
# Same submitting predictions
submission_new_hpo["count"] = predictions_new_hpo
submission_new_hpo.to_csv("submission_new_hpo.csv", index=False)

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_hpo.csv -m "new features with hyperparameters"

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

Submisson new hpo: 0.54715

In [None]:
#next hyperparameter optimization
eval_metric = 'root_mean_squared_error'
label = 'count'
ignored_columns = ["casual", "registered"]
train_data = train
time_limit = 600
presets = "optimize_for_deployment"

In [None]:
from autogluon.tabular import TabularPredictor

excluded_model_types = ['NN_TORCH']

# General settings
label = 'count'
ignored_columns = ['casual', 'registered']
time_limit = 600  # Maximum training time in seconds (10 minutes)
presets = 'optimize_for_deployment'

# Define full model configurations and hyperparameters (no shorthand strings)
hyperparameters = {
    'GBM': [
        {
            'extra_trees': True,
            'num_boost_round': 100,
            'num_leaves': 36,
            'ag_args': {'name_suffix': 'XT'}
        },
        {},
        {
            'learning_rate': 0.03,
            'num_leaves': 128,
            'feature_fraction': 0.9,
            'min_data_in_leaf': 3,
            'ag_args': {'name_suffix': 'Large', 'priority': 0}
        }
    ],
    'XT' : [
        {
            'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}
        }
    ],
    'XGB' : [
        {
            'objective': 'reg:squarederror',
            'eval_metric': 'rmse',
            'max_depth' : 6,
            'n_estimators' : 100,
            'eta': 0.3,
            'subsample': 1,
            'colsample_bytree': 1
        }
    ]
    }


# Hyperparameter tuning configuration
hyperparameter_tune_kwargs = {
    'num_trials': 20,
    'scheduler': 'local',
    'searcher': 'auto'
}

# Train the model
predictor_new_hpo1 = TabularPredictor(
    label=label,
    problem_type='regression',
    eval_metric='root_mean_squared_error',
    learner_kwargs={'ignored_columns': ignored_columns}
).fit(
    train_data=train,
    time_limit=time_limit,
    presets=presets,
    hyperparameters=hyperparameters,
    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
    refit_full='best'
)

# Show leaderboard
predictor_new_hpo1.leaderboard(silent=True)


In [None]:
predictor_new_hpo1.fit_summary()

In [None]:
leaderboard_new_hpo1_df = pd.DataFrame(predictor_new_hpo1.leaderboard(silent=True))
leaderboard_new_hpo1_df

In [None]:
import matplotlib.pyplot as plt
leaderboard_new_hpo1_df.plot(kind="bar", x="model", y="score_val", figsize=(12, 6))
plt.ylabel("RMSE Scores")
plt.show()


In [None]:
# Load test data
test = pd.read_csv("test.csv")

# Preserve 'datetime' column before dropping or modifying
test['datetime'] = pd.to_datetime(test['datetime'])

# Reapply the feature engineering steps to the test DataFrame
test["year"] = test["datetime"].dt.year
test["month"] = test["datetime"].dt.month
test["day"] = test["datetime"].dt.dayofweek
test["hour"] = test["datetime"].dt.hour
test.drop(["datetime"], axis=1, inplace=True)

test["season"] = test["season"].astype("category")
test["weather"] = test["weather"].astype("category")

test["day_type"]=""
test.loc[(test.holiday==1),"day_type"] = "holiday"
test.loc[((test.holiday==0) & (test.workingday==1)), "day_type"] = "weekday"
test.loc[((test.holiday==0) & (test.workingday==0)), "day_type"] = "weekend"
test["day_type"] = test["day_type"].astype("category")

# Drop the 'atemp' column as it was dropped from the training data
test.drop(["atemp"], axis=1, inplace=True)

In [None]:
predictions_new_hpo1 = predictor_new_hpo1.predict(test)
predictions_new_hpo1.head()

In [None]:
predictions_new_hpo1.describe()

In [None]:
negative_pred_count = predictions_new_hpo1.apply(lambda x: 1 if x<0 else 0)

pred_pos_count = (negative_pred_count==0).sum()
pred_neg_count = (negative_pred_count==1).sum()

print("Total predictions                :", len(predictions_new_hpo1.index))
print("Total positive prediction values :", pred_pos_count)
print("Total negative prediction values :", pred_neg_count)

In [None]:
predictions_new_hpo1[predictions_new_hpo1<0] = 0
# Rechecking, if no predictions are less than 0
negative_pred_count = predictions_new_hpo1.apply(lambda x: 1 if x<0 else 0)
pred_neg_count = (negative_pred_count==1).sum()
print(f"No.of negative predictions: {pred_neg_count}")
print("All negative values in the predictions (if any) are set to zero successfully.")

In [None]:
submission_new_hpo1 = pd.read_csv('sampleSubmission.csv', parse_dates = ['datetime'])
submission_new_hpo1.head()

In [None]:
# Same submitting predictions
submission_new_hpo1["count"] = predictions_new_hpo1
submission_new_hpo1.to_csv("submission_new_hpo1.csv", index=False)

#### New Score of `?`

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_hpo1.csv -m "new features with hyperparameters 1"

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

New score of: 0.51152

In [None]:
# Requirements: (Same settings as initial run) For AutoGluon's Tabular Predictions
eval_metric = 'root_mean_squared_error'
label = 'count'
ignored_columns = ["casual", "registered"]
train_data = train
time_limit = 600
presets = "optimize_for_deployment"

In [None]:
from autogluon.tabular import TabularPredictor

excluded_model_types = ['NN_TORCH']

# General settings
label = 'count'
ignored_columns = ['casual', 'registered']
time_limit = 600  # Maximum training time in seconds (10 minutes)
presets = 'optimize_for_deployment'

# Define full model configurations and hyperparameters (no shorthand strings)
hyperparameters = {
    'GBM': [
        {
            'extra_trees': True,
            'num_boost_round': 100,
            'num_leaves': 36,
            'ag_args': {'name_suffix': 'XT'}
        },
        {},
        {
            'learning_rate': 0.03,
            'num_leaves': 128,
            'feature_fraction': 0.9,
            'min_data_in_leaf': 3,
            'ag_args': {'name_suffix': 'Large', 'priority': 0}
        }
    ],
    'XT' : [
        {
            'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}
        }
    ],
    'XGB' : [
        {
            'objective': 'reg:squarederror',
            'eval_metric': 'rmse',
            'max_depth' : 6,
            'n_estimators' : 100,
            'eta': 0.3,
            'subsample': 1,
            'colsample_bytree': 1
        }
    ],
    'RF' : [
        {
            'criterion': 'squared_error',
            'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}
        }
    ],
    'KNN' : [
        {
            'weights': 'uniform', 'ag_args': {'name_suffix': 'Uniform'}},
               {'weights': 'distance', 'ag_args': {'name_suffix': 'Distance'}
        }
    ]
    }


# Hyperparameter tuning configuration
hyperparameter_tune_kwargs = {
    'num_trials': 20,
    'scheduler': 'local',
    'searcher': 'auto'
}

# Train the model
predictor_new_hpo2 = TabularPredictor(
    label=label,
    problem_type='regression',
    eval_metric='root_mean_squared_error',
    learner_kwargs={'ignored_columns': ignored_columns}
).fit(
    train_data=train,
    time_limit=time_limit,
    presets=presets,
    hyperparameters=hyperparameters,
    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
    refit_full='best'
)

# Show leaderboard
predictor_new_hpo2.leaderboard(silent=True)

In [None]:
predictor_new_hpo2.fit_summary()

In [None]:
# Leaderboard dataframe
leaderboard_new_hpo2_df = pd.DataFrame(predictor_new_hpo2.leaderboard(silent=True))
leaderboard_new_hpo2_df

In [None]:
import matplotlib.pyplot as plt
leaderboard_new_hpo2_df.plot(kind="bar", x="model", y="score_val", figsize=(12, 6))
plt.ylabel("RMSE Scores")
plt.show()

In [None]:
predictions_new_hpo2 = predictor_new_hpo2.predict(test)
predictions_new_hpo2.head()

In [None]:
predictions_new_hpo2.describe()

In [None]:
negative_pred_count = predictions_new_hpo2.apply(lambda x: 1 if x<0 else 0)

pred_pos_count = (negative_pred_count==0).sum()
pred_neg_count = (negative_pred_count==1).sum()

print("Total predictions                :", len(predictions_new_hpo2.index))
print("Total positive prediction values :", pred_pos_count)
print("Total negative prediction values :", pred_neg_count)

In [None]:
predictions_new_hpo2[predictions_new_hpo2<0] = 0

# Rechecking, if no predictions are less than 0
negative_pred_count = predictions_new_hpo2.apply(lambda x: 1 if x<0 else 0)
pred_neg_count = (negative_pred_count==1).sum()
print(f"No. of negative predictions: {pred_neg_count}")
print("All negative values in the predictions (if any) are set to zero successfully.")

In [None]:
# Same thing as train and test dataset
submission_new_hpo2 = pd.read_csv('sampleSubmission.csv', parse_dates = ['datetime'])
submission_new_hpo2.head()

In [None]:
# Same submitting predictions
submission_new_hpo2["count"] = predictions_new_hpo2
submission_new_hpo2.to_csv("submission_new_hpo2.csv", index=False)

In [None]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_hpo2.csv -m "new features with hyperparameters 2"

In [None]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 7

New score of 0.51464       

## Step 7: Write a Report
### Refer to the markdown file for the full report
### Creating plots and table for report

In [None]:
# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
    {
        "model": ["initial", "add_features", "hpo", "hpo1", "hpo2"],
        "score": [55.036685 ,34.382987 ,38.058735 ,37.797224, 37.830584]
    }
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_train_score.png')

In [None]:
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
    {
        "test_eval": ["initial", "add_features", "hpo", "hpo1", "hpo2"],
        "score": [1.83835, 0.48747, 0.54715, 0.51152, 0.51464]
    }
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_test_score.png')

### Hyperparameter table

In [None]:
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
    "model": ["initial", "add_features", "hpo"],
    "hpo1": ["prescribed_values", "prescribed_values", "Tree-Based Models: (GBM, XT, XGB & RF)"],
    "hpo2": ["prescribed_values", "prescribed_values", "KNN"],
    "hpo3": ["presets: 'high quality' (auto_stack=True)", "presets: 'high quality' (auto_stack=True)", "presets: 'optimize_for_deployment"],
    "score": [1.83835, 0.48747, 0.54715]
})

In [None]:
%%shell
jupyter nbconvert --to html /content/UdacityFinal.ipynb