# Bike Sharing Count Prediction(Hour)

This workbook provides the template for the final project. 

## Instructions
- Work individually or in pairs
- Each team is to complete 1 copy of this template.
  - Complete all sections.
  - Feel free to include supporting material / slides / documents as needed.
- At the end of the project, you will get 15 minutes to present this workbook to the class.

### Submission Instructions
- Submit the .ipynb with the Output cells showing the results
  - Naming convention:
  ```
      <name1>-<name2>-<project_short_name>.ipynb
  ```
- If you provide your own datasets, include the data with your .ipynb

In [248]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.model_selection import learning_curve

## Section 0: Team Members
- Member 1
- Member 2

## Section 1: Project Title

- The title should 1 sentence that describes the goal of this project.
- Example: Clustering analysis of COE premiums and quotas

## Section 2: Project Definition

### Goals

Describe the goal of this project.

Example:
The goal of this project is to determine if the bid quotas and premiums can be > used to predict the vehicle category.

Important:
- If this is your first project, keep the project definition as simple as possible. 
- As a rule of thumb, pick something that can be completed in 2-3 days. There is always more you can add to it if you finish early. 
- If you are not sure, use the workshop problems as a reference.

### Dataset

Briefly describe the source(s) of data you are using.
- Provide the URL to the data source.
- If you are providing your own data set, include the data with your project submission.
- You can find sample datasets from:
  - http://archive.ics.uci.edu/ml/datasets.html
  - http://data.gov.sg/

Example:

We will use the dataset from: https://data.gov.sg/dataset/coe-bidding-results

#### Format: CSV

#### Columns:
 
|Name|Type|Unit of Measure|Description|
|--|--|--|--|
|month|Datetime, YYYY-MM|none|date range: Jan 1, 2010 to Mar 31, 2018|
|bidding_no|Numeric|No. of Bids|Number of Bids|
|vehicle_class|Text|none|Vehicle category: A to E|
|quota|Numeric|No. of Bids|Number of Quota|
|bids_success|Numeric|No. of Bids|Number of Successful Bids|
|bids_received|Numeric|No. of Bids|Number of Bids Received|
|premium|Numeric|S$|COE premium|

### Tasks

List the tasks you will perform. 

Example:
 
1. Process the dataset to convert strings into labels.
2. Shuffle and split into train and test sets
3. Train a clustering algorithm, using Gaussian Mixture Model with 5 components, where each component is a vehicle category.
4. Compute the metrics for the algorithm.
5. Perform analysis for possible improvements.

## Section 3: Prepare Dataset

Write your code below to prepare the dataset using pandas

In [249]:
df = pd.read_csv("C:\\courses\\data\\bike-sharing\\hour.csv", parse_dates=True, encoding='latin-1')
#df=df1[(df1['hr']>=0) & (df1['hr']<4)]
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1/1/2011,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,1/1/2011,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,1/1/2011,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,1/1/2011,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,1/1/2011,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [250]:
df.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0,17370.0
mean,8685.5,2.501957,0.502533,6.538169,11.547323,0.028728,3.003742,0.682671,1.425273,0.496987,0.47577,0.627187,0.190102,35.681405,153.796891,189.478296
std,5014.431423,1.106938,0.500008,3.438639,6.91426,0.167045,2.00599,0.46545,0.639422,0.192548,0.171841,0.192954,0.122346,49.314213,151.368927,181.403466
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4343.25,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8685.5,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13027.75,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17370.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


In [251]:
# Convert dates (Optional)
#df['dteday'] = pd.to_datetime(df['dteday'], format='%Y/%m/%d')
#df['dteday']
df.drop(['instant'],axis=1,inplace=True)
#df.index = df['instant']


In [252]:
df.columns

Index(['dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')

In [253]:
#casual_columns=['season',
#       'workingday', 'temp', 'atemp', 'hum', 'windspeed']
casual_columns=['hr','workingday', 'temp', 'hum', 'windspeed']
print(casual_columns)
regd_columns=['mnth',
       'workingday', 'temp', 'atemp', 'hum', 'windspeed','casual']
print(regd_columns)

['hr', 'workingday', 'temp', 'hum', 'windspeed']
['mnth', 'workingday', 'temp', 'atemp', 'hum', 'windspeed', 'casual']


In [254]:

#df.drop(columns='No',inplace=True)
#X = df.loc[:,['season', 'yr', 'mnth', 'holiday', 'weekday',
       #'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       #'casual', 'registered']]

#X = df.loc[:,['season', 'mnth', 'holiday', 'weekday',
       #'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed']]


df2 = df.loc[:,casual_columns]

#[df[(df['hr']>=0) & (df['hr']<4)],df[(df['hr']>=4) & (df['hr']<9)],df[(df['hr']>=9) & (df['hr']<20)]
#     ,df[(df['hr']>=20)]]

X = [df2[(df['hr']>=0) & (df['hr']<4)],df2[(df['hr']>=4) & (df['hr']<9)],df2[(df['hr']>=9) & (df['hr']<20)]
     ,df2[(df['hr']>=20)]]


#print(X.shape)
#print(df.describe())


#print(X.columns)

df3 = df.casual
y_casual = [df3[(df['hr']>=0) & (df['hr']<4)],df3[(df['hr']>=4) & (df['hr']<9)],df3[(df['hr']>=9) & (df['hr']<20)]
     ,df3[(df['hr']>=20)]]
df4 = df.registered
y_regd = [df4[(df['hr']>=0) & (df['hr']<4)],df4[(df['hr']>=4) & (df['hr']<9)],df4[(df['hr']>=9) & (df['hr']<20)]
     ,df4[(df['hr']>=20)]]

temp = df2[(df['hr']>=0) & (df['hr']<4)]

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X[0])
X_2d.shape
fig, ax = plt.subplots()
ax.scatter(X_2d[:, 0], X_2d[:, 1], c=y_casual)
plt.show()

from pandas.plotting import scatter_matrix
#df = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd'])
scatter_matrix(df, alpha=0.2, figsize=(15, 15), diagonal='kde')
plt.show()

## Section 4: Select Features

Write your code below to create X_train, X_test, y_train, y_test

In [255]:
#X_train, X_test, y_train_casual, y_test_casual = train_test_split(X, y_casual,random_state=42)

In [256]:
print(len(X))
print(len(y_casual))
for val in X:
    print(val.shape)

4
4
(2860, 5)
(3591, 5)
(8009, 5)
(2910, 5)


In [257]:
i=0
X_train = []
X_test = []
y_train_casual = []
y_test_casual = []
for val in X:
    a, b, c, d = train_test_split(val, y_casual[i],random_state=42)
    X_train.append(a)
    X_test.append(b)
    y_train_casual.append(c)
    y_test_casual.append(d)
    i+=1
    print(X_train[0].shape)
    
print(len(X_train))

(2145, 5)
(2145, 5)
(2145, 5)
(2145, 5)
4


In [258]:
print(len(X_train))

4


In [259]:
scaler_X = [StandardScaler(), StandardScaler(), StandardScaler(), StandardScaler()]

#i = 0;
#while i<4:
 #   scaler_X.fit(X_train[i])
 #   a1 = scaler_X.transform(X_train[i])
 #   X_scaled_train.append(a1)
 #   i+=1
X_scaled_train = []
for val, scaler in zip(X_train,scaler_X):
    scaler.fit(val)
    X_scaled_train.append(scaler.transform(val))
    print(len(val))
    
X_scaled_test = []
for testval, scaler in zip(X_test,scaler_X):    
    X_scaled_test.append(scaler.transform(testval))

# 2. scale both y series
scaler_y = [StandardScaler(), StandardScaler(), StandardScaler(), StandardScaler()]
#y_scaled_train = scaler_y.transform(y_train.values.reshape(-1, 1))
#y_scaled_test = scaler_y.transform(y_test.values.reshape(-1, 1))

#Without scaling Y
y_scaled_casual_train = []
for val, scaler in zip(y_train_casual, scaler_y):
    scaler.fit(val.values.reshape(-1, 1))
    y_scaled_casual_train.append(scaler.transform(val.values.reshape(-1, 1)))

#y_scaled_casual_train = y_train_casual.values
#y_scaled_casual_test = y_test_casual.values
#y_scaled_casual_train[0]

y_scaled_casual_test = []
for testval, scaler in zip(y_test_casual,scaler_y):
    y_scaled_casual_test.append(scaler.transform(testval.values.reshape(-1, 1)))

print(len(X_scaled_train))
print(len(X_scaled_test))
print(len(y_scaled_casual_train))
print(len(y_scaled_casual_test))

2145
2693
6006
2182
4
4
4
4




X_regd = df.loc[:,regd_columns]

X_train_regd, X_test_regd, y_train_regd, y_test_regd = train_test_split(X_regd, y_regd,random_state=42)

scaler_X_regd = StandardScaler()
scaler_X_regd.fit(X_train_regd)
X_scaled_train_regd = scaler_X_regd.transform(X_train_regd)
X_scaled_test_regd = scaler_X_regd.transform(X_test_regd)
print(X_scaled_train_regd.shape)
print(X_scaled_test_regd.shape)

#Without scaling Y
#scaler_y.fit(y_train.values.reshape(-1, 1))
#print(y_train.values.reshape(-1,1).shape)
scaler_y_regd = StandardScaler()
scaler_y_regd.fit(y_train_regd.values.reshape(-1, 1))
y_scaled_regd_train = scaler_y_regd.transform(y_train_regd.values.reshape(-1, 1))
y_scaled_regd_test = scaler_y_regd.transform(y_test_regd.values.reshape(-1, 1))
y_scaled_regd_train[0]
print(y_scaled_regd_train.shape)
print(y_scaled_regd_test.shape)

## Section 5: Train the algorithm(s)

Write your code below to initialize and train the algorithm(s)

In [260]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lin = [LinearRegression(), LinearRegression(), LinearRegression(), LinearRegression()]
for a, b, c, d, model in zip(X_scaled_train, y_scaled_casual_train, X_scaled_test, y_scaled_casual_test, lin):
    model.fit(a, b)
    pred_scaled = model.predict(c)
    #print(scaler_X.inverse_transform(a[0]))
    print('MSE', mean_squared_error(d, pred_scaled))
    print('R2', r2_score(d, pred_scaled))
    


MSE 0.5421850683897761
R2 0.48459547603778774
MSE 0.37633408722596345
R2 0.591033793874605
MSE 0.46460599738943875
R2 0.5164060154951338
MSE 0.30312676810929073
R2 0.6209995303544691


from sklearn.linear_model import SGDRegressor

sgd = SGDRegressor(verbose=True,
                   tol=1e-4, # stop training when |new_loss - loss| < 1e-4
                   max_iter = 1000) # sklearn forces us to set max_iter
sgd.fit(X_scaled_train, y_scaled_casual_train) # ravel converts 2-D array to 1-D vector

pred_scaled_sgd = sgd.predict(X_scaled_test)

print('MSE', mean_squared_error(y_scaled_casual_test, pred_scaled_sgd))
print('R2', r2_score(y_scaled_casual_test, pred_scaled_sgd))

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lin_regd = LinearRegression()
lin_regd.fit(X_scaled_train_regd, y_scaled_regd_train)

pred_scaled_regd = lin_regd.predict(X_scaled_test_regd)

print('MSE', mean_squared_error(y_scaled_regd_test, pred_scaled_regd))
print('R2', r2_score(y_scaled_regd_test, pred_scaled_regd))

from sklearn.linear_model import SGDRegressor

sgd_regd = SGDRegressor(verbose=True,
                   tol=1e-4, # stop training when |new_loss - loss| < 1e-4
                   max_iter = 1000) # sklearn forces us to set max_iter
sgd_regd.fit(X_scaled_train_regd, y_scaled_regd_train.ravel()) # ravel converts 2-D array to 1-D vector

pred_scaled_regd_sgd = sgd_regd.predict(X_scaled_test_regd)

print('MSE', mean_squared_error(y_scaled_regd_test, pred_scaled_regd_sgd))
print('R2', r2_score(y_scaled_regd_test, pred_scaled_regd_sgd))

## Section 6: Evaluate metrics

Write your code below to evaluate metrics for the trained algorithm(s).

Feel free to plot the algorithm to visualize it, as appropriately.

In [261]:
df_predict = pd.read_csv('C:\\courses\\data\\bike-sharing\\hour_Test.csv',na_values=['?', 'nan'])
df_predict

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1/1/2011,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2,0,2
1,2,18/5/2011,2,0,5,0,0,3,1,2,0.54,0.5152,0.88,0.2239,8,15,23
2,3,3/5/2011,2,0,5,0,0,2,1,2,0.56,0.5303,0.83,0.2239,0,16,16
3,4,15/5/2012,2,1,5,14,0,2,1,2,0.64,0.6061,0.73,0.194,39,145,184
4,5,16/5/2012,2,1,5,20,0,3,1,1,0.66,0.6212,0.65,0.2537,61,398,459
5,6,29/5/2012,2,1,5,12,0,2,1,1,0.8,0.7576,0.55,0.3284,56,181,237
6,7,3/12/2012,4,1,12,14,0,1,1,1,0.6,0.6212,0.53,0.0,51,209,260
7,8,25/12/2012,1,1,12,20,1,2,0,2,0.32,0.303,0.66,0.2836,11,29,40
8,9,4/2/2011,1,0,2,8,0,5,1,1,0.14,0.1515,0.74,0.1343,3,217,220


In [262]:
X_predict = df_predict.loc[:,casual_columns]

[df2[(df['hr']>=0) & (df['hr']<4)],df2[(df['hr']>=4) & (df['hr']<9)],df2[(df['hr']>=9) & (df['hr']<20)]
     ,df2[(df['hr']>=20)]]

In [263]:
np.reshape?

In [264]:
scaler_X[3]

StandardScaler(copy=True, with_mean=True, with_std=True)

In [265]:
i = 0
a = X_predict.shape
row = [[]]
while i < a[0]:
    row[0] = (X_predict.iloc[i])
    #print(row[0])
    hour = row[0][0]
    if(hour<4):
        index = 0
    elif(hour < 9):
        index = 1
    elif(hour < 20):
        index = 2
    else:
        index = 3
    #print(index)
    
    #print(np.reshape(row, (-1, 5)))
    test_scaled = scaler_X[index].transform(row)
    y_pred = lin[index].predict(test_scaled)
    print(scaler_y[index].inverse_transform(y_pred))
    i+=1

[[2.23189396]]
[[7.67067278]]
[[8.13790264]]
[[46.52242676]]
[[44.30917557]]
[[75.56469239]]
[[55.56915225]]
[[24.7837527]]
[[10.74727565]]


X_scaled_predict = scaler_X.transform(X_predict)
X_scaled_predict.shape

y_pred1_casual_lin = lin.predict(X_scaled_predict)
y_pred1_casual_lin.shape

#print(scaler_y.inverse_transform(y_pred1_lin))
print(y_pred1_casual_lin)

y_pred1_casual_sgd = sgd.predict(X_scaled_predict)

#print(scaler_y.inverse_transform(y_pred1_sgd))
print(y_pred1_casual_sgd)

X_predict = df_predict.loc[:,regd_columns]
#X_predict.drop(['yr'],axis=1,inplace=True)
#X_predict.drop(['yr','season','holiday','weekday','weathersit'],axis=1,inplace=True)

X_scaled_predict = scaler_X_regd.transform(X_predict)

y_pred1_regd_lin = lin_regd.predict(X_scaled_predict)

print(scaler_y_regd.inverse_transform(y_pred1_regd_lin))
#print(y_pred1_regd_lin)

## Section 7: Observations and analysis

Answer the following questions:
1. How did you measure the algorithm? Specify the metrics you used.

2. What is the outcome of the measurement? Explain the interpretation of the metrics.

  - Is there overfitting or underfitting?
  - Is there low accuracy or high error? If so, why do you think this is the case?

3. What improvements do you propose? 

4. What is the most challenging part of this project?