Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

***What actually the data consits of ?***

We are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. We need to predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

In [1]:
# An Overview of the features given.
# datetime - hourly date + timestamp  
# season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
# holiday - whether the day is considered a holiday
# workingday - whether the day is neither a weekend nor holiday
# weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
# 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
# 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
# 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
# temp - temperature in Celsius
# atemp - "feels like" temperature in Celsius
# humidity - relative humidity
# windspeed - wind speed
# casual - number of non-registered user rentals initiated
# registered - number of registered user rentals initiated
# count - number of total rentals

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
train_data=pd.read_csv('train.csv',usecols=['datetime','season','holiday','workingday','weather','temp','atemp','humidity','windspeed','count'])
train_data.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1


In [4]:
#Creating New Features Using "datetime" variable since that can't be used directly with ML algos.
df=train_data.copy()

In [5]:
df['year']=pd.to_datetime(df['datetime']).dt.year
df['month']=pd.to_datetime(df['datetime']).dt.month
df['weekday']=pd.to_datetime(df['datetime']).dt.weekday
df['hour']=pd.to_datetime(df['datetime']).dt.hour
df['minutes']=pd.to_datetime(df['datetime']).dt.minute

In [6]:
df.drop('minutes',1,inplace=True)

In [7]:
df.head(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,year,month,weekday,hour
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16,2011,1,5,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40,2011,1,5,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32,2011,1,5,2


In [8]:
#Missing Values Analysis
for i in df.columns:
    if(df[i].isnull().sum()==0):
        print("Feature {",(i),"} has 0 Nan values")

Feature { datetime } has 0 Nan values
Feature { season } has 0 Nan values
Feature { holiday } has 0 Nan values
Feature { workingday } has 0 Nan values
Feature { weather } has 0 Nan values
Feature { temp } has 0 Nan values
Feature { atemp } has 0 Nan values
Feature { humidity } has 0 Nan values
Feature { windspeed } has 0 Nan values
Feature { count } has 0 Nan values
Feature { year } has 0 Nan values
Feature { month } has 0 Nan values
Feature { weekday } has 0 Nan values
Feature { hour } has 0 Nan values


In [9]:
df.shape

(10886, 14)

In [10]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
season,10886.0,2.506614,1.116174,1.0,2.0,3.0,4.0,4.0
holiday,10886.0,0.028569,0.166599,0.0,0.0,0.0,0.0,1.0
workingday,10886.0,0.680875,0.466159,0.0,0.0,1.0,1.0,1.0
weather,10886.0,1.418427,0.633839,1.0,1.0,1.0,2.0,4.0
temp,10886.0,20.23086,7.79159,0.82,13.94,20.5,26.24,41.0
atemp,10886.0,23.655084,8.474601,0.76,16.665,24.24,31.06,45.455
humidity,10886.0,61.88646,19.245033,0.0,47.0,62.0,77.0,100.0
windspeed,10886.0,12.799395,8.164537,0.0,7.0015,12.998,16.9979,56.9969
count,10886.0,191.574132,181.144454,1.0,42.0,145.0,284.0,977.0
year,10886.0,2011.501929,0.500019,2011.0,2011.0,2012.0,2012.0,2012.0


In [11]:
from sklearn.model_selection import train_test_split
X=df.drop(['count','datetime'],1)
y=df['count']

In [12]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
X_train.shape,X_test.shape

((8708, 12), (2178, 12))

In [13]:
from sklearn.ensemble import RandomForestRegressor
reg=RandomForestRegressor()
reg.fit(X_train,y_train)

RandomForestRegressor()

In [14]:
reg.score(X_train,y_train)

0.9926566274325974

In [15]:
predictions_test=reg.predict(X_test)
from sklearn.metrics import r2_score,mean_squared_error as MSE

In [16]:
print(r2_score(y_test,predictions_test))
print(np.sqrt(MSE(y_test,predictions_test)))

0.9445945695467054
42.90562079416331


In [17]:
print((MSE(y_test,predictions_test)))

1840.8922957325392


In [18]:
from sklearn.linear_model import Ridge
reg1=Ridge(alpha=1.0)
reg1.fit(X_train,y_train)
reg1.score(X_train,y_train)

0.3878485146169435

In [19]:
preds_test=reg1.predict(X_test)
print(r2_score(y_test,preds_test))
print(np.sqrt(MSE(y_test,preds_test)))

0.39339662923205276
141.96798982739367


In [20]:
print((MSE(y_test,preds_test)))

20154.910135630955


In [21]:
#Checking Whether there is some correlation between the predictor variables.
# find and remove correlated features
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [22]:
correlation(X_train,0.85)

{'atemp', 'month'}

In [23]:
#{'atemp', 'month'} these features are 95% correlated

In [24]:
X_train_corr=X_train.drop(['atemp','month'],1)
X_test_corr=X_test.drop(['atemp','month'],1)

In [25]:
from sklearn.ensemble import RandomForestRegressor
reg_corr=RandomForestRegressor()
reg_corr.fit(X_train_corr,y_train)

RandomForestRegressor()

In [26]:
reg_corr.score(X_train_corr,y_train)

0.9920788245386891

In [27]:
predictions_corr=reg_corr.predict(X_test_corr)
from sklearn.metrics import r2_score,mean_squared_error as MSE

In [31]:
print(r2_score(y_test,predictions_corr))
print(np.sqrt(MSE(y_test,predictions_corr)))

0.9368575941130266
45.80349167289611


In [32]:
from sklearn.linear_model import Ridge
reg1_corr=Ridge(alpha=1.0)
reg1_corr.fit(X_train,y_train)
reg1_corr.score(X_train,y_train)

0.3878485146169435

In [33]:
preds_corr=reg1_corr.predict(X_test)
print(r2_score(y_test,preds_corr))
print(np.sqrt(MSE(y_test,preds_corr)))

0.39339662923205276
141.96798982739367


In [34]:
test_data=pd.read_csv('test.csv')
test_data.head(3)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
0,2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
1,2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0
2,2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0


In [35]:
test_data['year']=pd.to_datetime(test_data['datetime']).dt.year
test_data['month']=pd.to_datetime(test_data['datetime']).dt.month
test_data['weekday']=pd.to_datetime(test_data['datetime']).dt.weekday
test_data['hour']=pd.to_datetime(test_data['datetime']).dt.hour

In [36]:
test_data.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
year          0
month         0
weekday       0
hour          0
dtype: int64

In [37]:
test_data.drop('datetime',1,inplace=True)

In [43]:
final_predictions=pd.DataFrame(reg.predict(test_data))

In [44]:
final_predictions

Unnamed: 0,0
0,13.48
1,5.49
2,4.91
3,3.72
4,2.96
...,...
6488,319.23
6489,208.45
6490,148.54
6491,113.74


In [46]:
output=round(final_predictions,0)

In [47]:
output.to_csv('output.csv')

In [48]:
output.shape

(6493, 1)