# Bike Sharing Assignment

### Problem Statement
<br>A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system. <br><br> A US bike-sharing provider <b>BoomBikes</b> has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. <br><br>In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.


They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
- <i> Which variables are significant in predicting the demand for shared bikes.</i>
- <i> How well those variables describe the bike demands</i>

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors. 

#### Business Goal:
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 

<hr>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("day.csv")
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

The dataset consists of 730 roes and 16 columns. <br>There are no null values. <br>The data consists of types `int64` and `float64` with one column of type `object`

In [None]:
df.dteday.head()

The `object` type column `dteday` consists of date values

### Exploring Data

#### 1. Drop unnecessary columns

- `instant` is an indexing column, it can be removed
- we don't really need `dteday` as there are other columns such as `year`, `mnth` and `weekday` already
- columns `casual` and `registered` are described using the colum `cnt`, seems like a redundancy

In [None]:
drop_columns = ["instant", "dteday", "casual", "registered"]
df.drop(drop_columns, axis=1, inplace=True)
df.head()

#### 2. Checking for outliers

In [None]:
continuous_var_colums = ['temp', 'atemp', 'hum', 'windspeed']
plt.figure(figsize=(18,4))

for col in range(1, len(continuous_var_colums)+1):
    plt.subplot(1,4,col)
    sns.boxplot(y=continuous_var_colums[col-1], data=df)

doesn't look like there are any outlying values for the described continuous variable|

#### 3. Categorizing variable

In [None]:
df.season.replace({1:"spring", 2:"summer", 3:"fall", 4:"winter"},inplace = True)

df.weathersit.replace({1:'clear',2:'moderate',3:'bad',4:'severe'},inplace = True)

df.mnth.replace({1: 'jan',2: 'feb',3: 'mar',4: 'apr',5: 'may',6: 'jun',
                  7: 'jul',8: 'aug',9: 'sept',10: 'oct',11: 'nov',12: 'dec'}, inplace=True)

df.weekday.replace({0: 'sun',1: 'mon',2: 'tue',3: 'wed',4: 'thu',5: 'fri',6: 'sat'}, inplace=True)

# df.yr.replace({0: "2018", 1: "2019"}, inplace=True)

df.head()



In [None]:
df.info()

### Visualizing the data

In [None]:
sns.pairplot(df)

Looking at the scatterplots, seems like `temp` and `atemp` have the highest correlation with the target variable `cnt`<br><br>Also, `temp` and `atemp` are highly correlated with each other

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(),annot=True)
plt.show()

looking at a few of the continuous variables up close

In [None]:
sns.heatmap(df[['temp','atemp','hum','windspeed','cnt']].corr(), cmap='BuGn_r', annot = True)
plt.show()

We see that `temp` and `atemp` have a correlation of almost 1

Let's take a look at some categorical variables as well

In [None]:
categorical = ['season','yr','mnth','holiday','weekday','workingday','weathersit']
plt.figure(figsize=(15, 15))
for i in enumerate(categorical):
    plt.subplot(3,3,i[0]+1)
    sns.boxplot(data=df, x=i[1], y='cnt')
plt.show()

Inference:
1. While Seasons are considered, fall has the highest demand for rental bikes
2. The year 2019 showed considerably larger demand for rentals as opposed to the previous year 2018
3. Monthwise demand shows that there is a steady increase till the month of June. After which there is some inconsistent demand and the demand falls off in the months of November and December
4. Demand seems to be higher on non-holidays.
5. Weekdays do not seem to show a change in demand.
6. Rentals are used more often when the weather allows it. i.e. when the weather is clear.

### Preparing Data

#### Creating Dummy Variables
Convert categorical variable into dummy/indicator variables.

In [None]:
df = pd.get_dummies(data=df,columns=["season","mnth","weekday"],drop_first=True)
df = pd.get_dummies(data=df,columns=["weathersit"])
df.columns

In [None]:
df.head()

###  Model Building

#### Splitting Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
y=df.pop('cnt')

X=df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=77)

In [None]:
X.head()

In [None]:
y.head()

In [None]:

print(X_train.shape)
print(X_test.shape)

#### Feature Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
continuous_vars = ['temp','atemp','hum','windspeed']

#Use Normalized scaler to scale
scaler = MinMaxScaler()

#Fit and transform training set only
X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])

In [None]:
X_train.describe()

#### RFE

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)

In [None]:
# Using Linear Regression as an estimator, selecting 15 features
rfe = RFE(lr,n_features_to_select=15)
rfe.fit(X_train,y_train)

In [None]:
# Features selected by RFE
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

#### Manual Elimination

In [None]:
X_train.columns[rfe.support_]

In [None]:
X_train.columns[~rfe.support_]

In [None]:
# Taking 15 features selected by RFE for regression
X_train_rfe = X_train[['yr', 'holiday', 'workingday', 'temp', 'hum', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_jan', 'mnth_jul', 'mnth_sept', 'weekday_sat',
       'weathersit_bad', 'weathersit_moderate']]

##### Model I

In [None]:
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# For model-I we add all columns selected by RFE
target_columns = ['yr', 'holiday', 'workingday', 'temp', 'hum', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_jan', 'mnth_jul', 'mnth_sept', 'weekday_sat',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

##### Model II

Dropping `mnth_jan` for its high p-value

In [None]:
target_columns = ['yr', 'holiday', 'workingday', 'temp', 'hum', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_jul', 'mnth_sept', 'weekday_sat',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

##### Model III

Dropping `holiday` for its high p-value

In [None]:
target_columns = ['yr', 'workingday', 'temp', 'hum', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_jul', 'mnth_sept', 'weekday_sat',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

##### Model IV

Since all the p-values appear to be below 0.05, we can consider the VIF values

Dropping `hum` for its high VIF

In [None]:
target_columns = ['yr', 'workingday', 'temp', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_jul', 'mnth_sept', 'weekday_sat',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

##### Model V

Dropping `mnth_jul` for its high p-value

In [None]:
target_columns = ['yr', 'workingday', 'temp', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_sept', 'weekday_sat',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

##### Model VI

Removing `temp` as it has a high VIF

target_columns = ['yr', 'workingday', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_sept', 'weekday_sat',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

Here, the VIF and p-values seem to be in an acceptable range. <br>But we have an R-squared value of about 0.76, lets try to improve it

##### Model VII

We can try replacing `weekday_sat` with `weekday_sun`

target_columns = ['yr', 'workingday', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_sept', 'weekday_sun',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

##### Model VIII

target_columns = ['yr', 'windspeed', 'season_spring',
       'season_summer', 'season_winter', 'mnth_sept', 'weekday_sun',
       'weathersit_bad', 'weathersit_moderate']

X_train_sm = sm.add_constant(X_train[target_columns])
lm = sm.OLS(y_train, X_train_sm).fit()
print(lm.summary())

df1 = X_train[target_columns]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)

print(vif.sort_values(by='VIF',ascending=False))

In [None]:

lr = LinearRegression()
lr.fit(X_train[target_columns],y_train)
print(lr.intercept_,lr.coef_)

### Model Evaluation

In [None]:
from sklearn.metrics import r2_score

In [None]:

y_train_pred = lr.predict(X_train[target_columns])


In [None]:
sns.distplot(y_train-y_train_pred)
plt.title('Error Terms')
plt.xlabel('Errors')

Errors are normally distributed around a mean 0

In [None]:
r2_score(y_train,y_train_pred)

In [None]:
num_vars = ['temp','atemp','hum','windspeed']

#Test data to be transformed only, no fitting
X_test[num_vars] = scaler.transform(X_test[num_vars])

In [None]:
#Predict the values for test data
y_test_pred = lr.predict(X_test[target_columns])

In [None]:
r2_score(y_test,y_test_pred)