# Data Understanding and (predictive) model build

## Quick intro

In this notebook we will explore the [Capital Bikeshare](https://capitalbikeshare.com/) service, a bicycle rental system accessible through a mobile application.

We will use a dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset). It contains hourly and daily records of bike rentals between 2011 and 2012, along with weather and seasonal information. This data will help us understand how weather and seasonality influence demand. More details are available at [this link](https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset).

Our goal is to understand the role of climatic and temporal factors in bicycle demand. We will examine how variables like temperature, humidity, and weather conditions affect rental counts. We will also look at differences in usage based on weekdays, weekends, and seasons. Our main objective is to predict the number of rentals given certain environmental conditions.

We will use the CatBoost algorithm to make these predictions, and compare its performance against Decision Tree, Linear Regression, and Random Forest models. We aim to identify the model that best captures the patterns in the data.

We hypothesize that factors like temperature, humidity, time of day, and whether it is a workday or a holiday significantly impact demand. By predicting bicycle demand, we hope to provide insights that optimize resource allocation, improve user experience, and support a more efficient system.

## Prelude

Libraries we will use in our analysis and modeling.

In [1]:
# Data handling
import pandas as pd
import numpy as np

# Visuals and charts
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from catboost import CatBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error, r2_score, make_scorer
from sklearn.model_selection import cross_val_score, KFold

# Interpretability
import shap

# Data pull
from ucimlrepo import fetch_ucirepo

## Get the data

Download the data from the UCI ML repository. We indicate the numerical code that corresponds to our data set.

In [2]:
# Pull the data, whose code corresponds to 275
bike_sharing = fetch_ucirepo(id=275)

Import and visualize the predictor variables.

In [3]:
bike_sharing_df = bike_sharing.data.features
bike_sharing_df

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed
0,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0000
1,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.80,0.0000
2,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.80,0.0000
3,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0000
4,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,2012-12-31,1,1,12,19,0,1,1,2,0.26,0.2576,0.60,0.1642
17375,2012-12-31,1,1,12,20,0,1,1,2,0.26,0.2576,0.60,0.1642
17376,2012-12-31,1,1,12,21,0,1,1,1,0.26,0.2576,0.60,0.1642
17377,2012-12-31,1,1,12,22,0,1,1,1,0.26,0.2727,0.56,0.1343


In [4]:
# Target variable
target = bike_sharing.data.targets
target

Unnamed: 0,cnt
0,16
1,40
2,32
3,13
4,1
...,...
17374,119
17375,89
17376,90
17377,61


## Data pre-processing

We could leave the stations coded as numbers to save resources. Still, we can use a little more memory for a more expressive analysis, and CatBoost will handle the category predictors well. Also, this kind of expressiveness could be helpful when we deploy the model as a REST API service.

In [5]:
# Seasons code
season_mapping = {1: "Winter", 2: "Spring", 3: "Summer", 4: "Fall"}

# Apply change
bike_sharing_df = bike_sharing_df.copy()
bike_sharing_df['season'] = bike_sharing_df['season'].map(season_mapping)

Also, we specify `workingday` as a category.

In [6]:
bike_sharing_df['workingday'] = bike_sharing_df['workingday'].map({1: 'Yes', 0: 'No'}).astype('category')

For exploratory analysis, having the predictors and the target variable in the same data frame is most convenient.

In [7]:
full_df = pd.concat([bike_sharing_df, target], axis=1)
full_df

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,2011-01-01,Winter,0,1,0,0,6,No,1,0.24,0.2879,0.81,0.0000,16
1,2011-01-01,Winter,0,1,1,0,6,No,1,0.22,0.2727,0.80,0.0000,40
2,2011-01-01,Winter,0,1,2,0,6,No,1,0.22,0.2727,0.80,0.0000,32
3,2011-01-01,Winter,0,1,3,0,6,No,1,0.24,0.2879,0.75,0.0000,13
4,2011-01-01,Winter,0,1,4,0,6,No,1,0.24,0.2879,0.75,0.0000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,2012-12-31,Winter,1,12,19,0,1,Yes,2,0.26,0.2576,0.60,0.1642,119
17375,2012-12-31,Winter,1,12,20,0,1,Yes,2,0.26,0.2576,0.60,0.1642,89
17376,2012-12-31,Winter,1,12,21,0,1,Yes,1,0.26,0.2576,0.60,0.1642,90
17377,2012-12-31,Winter,1,12,22,0,1,Yes,1,0.26,0.2727,0.56,0.1343,61


And we continue doing the same for other categorical variables.

In [8]:
# Week days mapping
weekday_mapping = {
    0: "Sunday",
    1: "Monday",
    2: "Tuesday",
    3: "Wednesday",
    4: "Thursday",
    5: "Friday",
    6: "Saturday"
}

# Apply mapping
bike_sharing_df['weekday'] = bike_sharing_df['weekday'].map(weekday_mapping).astype('category')


# Holiday mapping
holiday_mapping = {
    1: "Yes",
    0: "No"
}

# Apply mapping
bike_sharing_df['holiday'] = bike_sharing_df['holiday'].map(holiday_mapping).astype('category')


# Month mapping
month_mapping = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December"
}

# Apply
bike_sharing_df['mnth'] = bike_sharing_df['mnth'].map(month_mapping).astype('category')