# Introduction
<hr style = "border:2px solid black" > </hr >


**What?** Gradient boosting applied to the bike rental dataset



# Import modules
<hr style = "border:2px solid black" > </hr >

In [2]:
import warnings
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Import dataset
<hr style = "border:2px solid black" > </hr >

In [3]:
# Upload 'bike_rentals.csv' to dataFrame
df_bikes = pd.read_csv('../DATASETS/bike_rentals.csv')

In [4]:
# Display first 5 rows
df_bikes.head(5)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600



- Show df_bikes descriptive statistics
- Comparing the mean and median (50%) gives an indication of skewness. As you can see, mean and median are close 
to one another, so the data is roughly symmetrical.”  



In [5]:
df_bikes.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,730.0,730.0,731.0,731.0,731.0,731.0,730.0,730.0,728.0,726.0,731.0,731.0,731.0
mean,366.0,2.49658,0.5,6.512329,0.028728,2.997264,0.682627,1.395349,0.495587,0.474512,0.627987,0.190476,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500343,3.448303,0.167155,2.004787,0.465773,0.544894,0.183094,0.163017,0.142331,0.077725,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.336875,0.337794,0.521562,0.134494,315.5,2497.0,3152.0
50%,366.0,3.0,0.5,7.0,0.0,3.0,1.0,1.0,0.499167,0.487364,0.627083,0.180971,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,9.75,0.0,5.0,1.0,2.0,0.655625,0.608916,0.730104,0.233218,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


# Data checking & cleaning
<hr style = "border:2px solid black" > </hr >


- As you can see, `.info()` gives the number of rows, number of columns, column types, and non-null values.
Since the number of non-null values differs between columns, null values must be present.



In [7]:
df_bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    float64
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    float64
 6   weekday     731 non-null    float64
 7   workingday  731 non-null    float64
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB



- If null values are not corrected, unexpected errors may arise down the road.
- Note that two .sum() methods are required. The first method sums the null values of each column, while the second method sums the column counts.
- The following code displays the total number of null values  



In [9]:
df_bikes.isna().sum().sum()

12


- Now we'd like to see those 12 values missing tagged as "NaN" = Not a Number

- This code may be broken down as follows: df_bikes[conditional] is a subset of df_bikes that meets the condition in 
brackets. .df_bikes.isna().any gathers any and all null values while (axis=1) specifies values in the columns. 
In pandas, rows are axis 0 and columns are axis 1. 



In [11]:
df_bikes[df_bikes.isna().any(axis=1)]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
56,57,2011-02-26,1.0,0.0,2.0,0.0,6.0,0.0,1,0.2825,0.282192,0.537917,,424,1545,1969
81,82,2011-03-23,2.0,0.0,3.0,0.0,3.0,1.0,2,0.346957,0.337939,0.839565,,203,1918,2121
128,129,2011-05-09,2.0,0.0,5.0,0.0,1.0,1.0,1,0.5325,0.525246,0.58875,,664,3698,4362
129,130,2011-05-10,2.0,0.0,5.0,0.0,2.0,1.0,1,0.5325,0.522721,,0.115671,694,4109,4803
213,214,2011-08-02,3.0,0.0,8.0,0.0,2.0,1.0,1,0.783333,0.707071,,0.20585,801,4044,4845
298,299,2011-10-26,4.0,0.0,10.0,0.0,3.0,1.0,2,0.484167,0.472846,0.720417,,404,3490,3894
388,389,2012-01-24,1.0,1.0,1.0,0.0,2.0,1.0,1,0.3425,0.349108,,0.123767,439,3900,4339
528,529,2012-06-12,2.0,1.0,6.0,0.0,2.0,1.0,2,0.653333,0.597875,0.833333,,477,4495,4972
701,702,2012-12-02,4.0,1.0,12.0,0.0,0.0,0.0,2,,,0.823333,0.124379,892,3757,4649
730,731,2012-12-31,1.0,,,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729



- `df_bikes['windspeed'].fillna` means that the null values of the 'windspeed' column will be filled
- `df_bikes['windspeed'].median()` is the median of the 'windspeed' column
- `inplace=True` ensures that the changes are permanent

- Mean vs. Meadian? The median is often a BETTER choice than the mean. The median guarantees that half the data is greater 
than the given value and half the data is lower. The mean, by contrast, is vulnerable to outliers.



In [13]:
# Fill windspeed null values with median
df_bikes['windspeed'].fillna((df_bikes['windspeed'].median()), inplace=True)

In [14]:
# Display rows 56, 81, 128. Just checking if it has worked
df_bikes.iloc[[56, 81, 128]]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
56,57,2011-02-26,1.0,0.0,2.0,0.0,6.0,0.0,1,0.2825,0.282192,0.537917,0.180971,424,1545,1969
81,82,2011-03-23,2.0,0.0,3.0,0.0,3.0,1.0,2,0.346957,0.337939,0.839565,0.180971,203,1918,2121
128,129,2011-05-09,2.0,0.0,5.0,0.0,1.0,1.0,1,0.5325,0.525246,0.58875,0.180971,664,3698,4362



- `Groupby` season with median offer some value we can use to crrect the humidity level that are missing
as shown before. 



In [16]:
df_bikes.groupby(['season']).median()

Unnamed: 0_level_0,instant,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1.0,366.0,0.5,2.0,0.0,3.0,1.0,1.0,0.285833,0.282821,0.54375,0.20275,218.0,1867.0,2209.0
2.0,308.5,0.5,5.0,0.0,3.0,1.0,1.0,0.562083,0.538212,0.646667,0.191546,867.0,3844.0,4941.5
3.0,401.5,0.5,8.0,0.0,3.0,1.0,1.0,0.714583,0.656575,0.635833,0.165115,1050.5,4110.5,5353.5
4.0,493.0,0.5,11.0,0.0,3.0,1.0,1.0,0.41,0.409708,0.661042,0.167918,544.5,3815.0,4634.5



- Convert 'hum' null values to median of season
- To correct the null values in the hum column, short for humidity, we can take the median humidity by season.  



In [18]:
df_bikes['hum'] = df_bikes['hum'].fillna(df_bikes.groupby('season')['hum'].transform('median'))

In [19]:
# Show null values of 'temp' column
df_bikes[df_bikes['temp'].isna()]

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
701,702,2012-12-02,4.0,1.0,12.0,0.0,0.0,0.0,2,,,0.823333,0.124379,892,3757,4649



- When correcting temperature, aside from consulting historical records, taking the mean temperature of the day 
before and the day after should give a good estimate.  



In [21]:
# Compute mean temp and atemp by row
mean_temp = (df_bikes.iloc[700]['temp'] + df_bikes.iloc[702]['temp'])/2
mean_atemp = (df_bikes.iloc[700]['atemp'] + df_bikes.iloc[702]['atemp'])/2

# Replace null values with mean temperatures
df_bikes['temp'].fillna((mean_temp), inplace=True)
df_bikes['atemp'].fillna((mean_atemp), inplace=True)

In [22]:
# Convert 'dteday' to datetime object
df_bikes['dteday'] = pd.to_datetime(df_bikes['dteday'])

In [23]:
df_bikes['dteday'].apply(pd.to_datetime, infer_datetime_format=True, errors='coerce')

0     2011-01-01
1     2011-01-02
2     2011-01-03
3     2011-01-04
4     2011-01-05
         ...    
726   2012-12-27
727   2012-12-28
728   2012-12-29
729   2012-12-30
730   2012-12-31
Name: dteday, Length: 731, dtype: datetime64[ns]

In [24]:
# Import datetime
import datetime as dt

In [25]:
df_bikes['mnth'] = df_bikes['dteday'].dt.month

In [26]:
# Show last 5 rows
df_bikes.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
726,727,2012-12-27,1.0,1.0,12,0.0,4.0,1.0,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1.0,1.0,12,0.0,5.0,1.0,2,0.253333,0.255046,0.59,0.155471,644,2451,3095
728,729,2012-12-29,1.0,1.0,12,0.0,6.0,0.0,2,0.253333,0.2424,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1.0,1.0,12,0.0,0.0,0.0,1,0.255833,0.2317,0.483333,0.350754,364,1432,1796
730,731,2012-12-31,1.0,,12,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [27]:
# Change row 730, column 'yr' to 1.0
df_bikes.loc[730, 'yr'] = 1.0

In [28]:
# Show last 5 rows
df_bikes.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
726,727,2012-12-27,1.0,1.0,12,0.0,4.0,1.0,2,0.254167,0.226642,0.652917,0.350133,247,1867,2114
727,728,2012-12-28,1.0,1.0,12,0.0,5.0,1.0,2,0.253333,0.255046,0.59,0.155471,644,2451,3095
728,729,2012-12-29,1.0,1.0,12,0.0,6.0,0.0,2,0.253333,0.2424,0.752917,0.124383,159,1182,1341
729,730,2012-12-30,1.0,1.0,12,0.0,0.0,0.0,1,0.255833,0.2317,0.483333,0.350754,364,1432,1796
730,731,2012-12-31,1.0,1.0,12,0.0,1.0,0.0,2,0.215833,0.223487,0.5775,0.154846,439,2290,2729


In [29]:
# Drop 'dteday' column
df_bikes = df_bikes.drop('dteday', axis=1)

In [30]:
# Drop 'casual', 'registered' columns
df_bikes = df_bikes.drop(['casual', 'registered'], axis=1)

In [31]:
# Export 'bike_rentals_cleaned' csv file
df_bikes.to_csv('../DATASETS/bike_rentals_cleaned.csv', index=False)

# Splitting data
<hr style = "border:2px solid black" > </hr >

In [32]:
# just to remind ourself the column names. The target is our last column "cnt" as in count
df_bikes.head(0)

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt


In [33]:
# Split data into X and y
X = df_bikes.iloc[:,:-1]
y = df_bikes.iloc[:,-1]

In [34]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

# Scikit linear regression
<hr style = "border:2px solid black" > </hr >


- Before building your first machine learning model, silence all warnings. 
- Scikit-learn includes warnings to notify users of future changes. In general, it's not advisable to silence warnings, but since our code has 
been tested, it's recommended to save space in your Jupyter Notebook.”



In [36]:
warnings.filterwarnings('ignore')

In [37]:
# Initialize LinearRegression model
lin_reg = LinearRegression()

# Fit lin_reg on training data
lin_reg.fit(X_train, y_train)

# Predict X_test using lin_reg
y_pred = lin_reg.predict(X_test)

# Import mean_squared_error
from sklearn.metrics import mean_squared_error

# Import numpy
import numpy as np

# Compute mean_squared_error as mse
mse = mean_squared_error(y_test, y_pred)

# Compute root mean squared error as rmse
rmse = np.sqrt(mse)

# Display root mean squared error
print("MSE: %0.2f" % (mse))
print("RMSE: %0.2f" % (rmse))

MSE: 806776.98
RMSE: 898.21



- It's hard to know whether an error of 898 rentals is good or bad without knowing the expected range of 
rentals per day. With a range of 22 to 8714, a mean of 4504, and a standard deviation of 1937, an RMSE 
of 898 isn't bad, but it's not great either. 



In [39]:
# Display bike rental stats
df_bikes['cnt'].describe()

count     731.000000
mean     4504.348837
std      1937.211452
min        22.000000
25%      3152.000000
50%      4548.000000
75%      5956.000000
max      8714.000000
Name: cnt, dtype: float64

# XGBoost regressor
<hr style = "border:2px solid black" > </hr >

In [40]:
# Instantiate the XGBRegressor, xg_reg
xg_reg = XGBRegressor()

# Fit xg_reg to training set
xg_reg.fit(X_train, y_train)

# Predict labels of test set, y_pred
y_pred = xg_reg.predict(X_test)

# Compute the mean_squared_error, mse
mse = mean_squared_error(y_test, y_pred)

# Compute the root mean squared error, rmse
rmse = np.sqrt(mse)

# Display the root mean squared error
print("RMSE: %0.2f" % (rmse))

RMSE: 705.11



- One test score is not reliable because splitting the data into different training and test sets would give 
different results. In effect, splitting the data into a training set and a test set is arbitrary, and a 
different random_state will give a different RMSE. One way to address the score discrepancies between 
different splits is k-fold cross-validation.

- Scikit-learn is designed to select the highest score when training models. This works well for accuracy, but 
not for errors when the lowest is best. By taking the negative of each mean squared error, the lowest ends 
up being the highest. This is compensated for later with rmse = np.sqrt(-scores), so the final results are 
positive.  



In [42]:
# Instantiate Linear Regression
model = LinearRegression()

# Obtain scores of cross-validation using 10 splits and mean squared error
scores = cross_val_score(model, X, y, scoring = 'neg_mean_squared_error', cv = 10)

# Take square root of the scores
rmse = np.sqrt(-scores)

# Display root mean squared error
print('RMSE of of the 10-fold batches:', np.round(rmse, 2))

# Display mean score
print('RMSE mean: %0.2f' % (rmse.mean()))

RMSE of of the 10-fold batches: [ 504.01  840.55 1140.88  728.39  640.2   969.95 1133.45 1252.85 1084.64
 1425.33]
RMSE mean: 972.02



- Linear regression has a mean error of 972.06. This is slightly better than the 980.38 obtained before. 
The point here is not whether the score is better or worse. 
- The point is that it's a better estimation of 
how linear regression will perform on unseen data. Using cross-validation is always recommended for a 
better estimate of the score.

- XGBRegressor is about 10% better than linear regression. 



In [44]:
# Instantiate XGBRegressor
model = XGBRegressor(objective="reg:squarederror")

# Obtain scores of cross-validation using 10 splits and mean squared error
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)

# Take square root of the scores
rmse = np.sqrt(-scores)

# Display root mean squared error
print('RMSE of of the 10-fold batches:', np.round(rmse, 2))

# Display mean score
print('RMSE mean: %0.2f' % (rmse.mean()))

RMSE of of the 10-fold batches: [ 717.65  692.8   520.7   737.68  835.96 1006.24  991.34  747.61  891.99
 1731.13]
RMSE mean: 887.31


# Reference
<hr style = "border:2px solid black" > </hr >


- Corey Wade. “Hands-On Gradient Boosting with XGBoost and scikit-learn
- https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn

