# [Kaggle Competition: Bike Sharing Demand](https://www.kaggle.com/competitions/bike-sharing-demand/overview)

### Forecasting use of a city bikeshare system

In [5]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

## 1. Data description:

First, let's read in the data into a pandas dataframe:

In [7]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


The training dataset for this Kaggle competition consists of approximately 10000 samples with 12 features (a mixture of both continuous and categorical data), including the total number of bikes in use on a particular day. Our task is to predict this integer value.

We can first check for missing (null) values in our dataset:

In [8]:
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB


There are no missing values. Note we can use **pandas.DatetimeIndex** to convert the datetime variables into separate hour, day, dayofweek, month, and year variables:

In [10]:
df['year']= pd.DatetimeIndex(df['datetime']).year
df['month']= pd.DatetimeIndex(df['datetime']).month
df['day']= pd.DatetimeIndex(df['datetime']).day
df['dayofweek']= pd.DatetimeIndex(df['datetime']).dayofweek
df['hour']= pd.DatetimeIndex(df['datetime']).hour

Now dataframe has five extra columns for year, month, day, hour, and day of the week, and we can drop the datetime column.

In [11]:
df = df.drop("datetime", axis=1)
df.head()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,year,month,day,dayofweek,hour
0,1,0,0,1,9.84,14.395,81,0.0,3,13,16,2011,1,1,5,0
1,1,0,0,1,9.02,13.635,80,0.0,8,32,40,2011,1,1,5,1
2,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2011,1,1,5,2
3,1,0,0,1,9.84,14.395,75,0.0,3,10,13,2011,1,1,5,3
4,1,0,0,1,9.84,14.395,75,0.0,0,1,1,2011,1,1,5,4


A description of the remaining features is given below:

**season**: 1: Spring, 2: Summer, 3: Fall, 4: Winter <br>
**holiday**: 1: Holiday, 0: Non-holiday <br>
**workingday**: 1: Workday, 0: Non-workday <br>
**weather**: 1: Clear, 2: Mist, 3: Light rain, 4: Heavy rain <br>
**temp**: Temperature in Celsius <br>
**atemp**: "Feels like" temperature in Celsius <br>
**humidity**: Relative humidity <br>
**windspeed**: Wind speed <br>
**casual**: Number of non-registered user rentals <br>
**registered**: Number of registered user rentals <br>
**count**: Total number of rentals <br>

There is a mixture of both categorical and continuous variables:

In [12]:
categorical_variables = ["season", "holiday", "workingday", "weather", "year", "month", "day", "dayofweek", "hour"]
continuous_variables = ["temp", "atemp", "humidity", "windspeed", "casual", "registered", "count"]

print(f'There are {len(categorical_variables)} categorical variables.')
print(f'There are {len(continuous_variables)} continuous variables.')

There are 9 categorical variables.
There are 7 continuous variables.
