#### Case Study
1. Lakukan analisis pada data _bike sharing_ pada `bike-sharing.csv`, target column adalah `count`.
2. Gunakan model berikut sebagai percobaan awal:
 * Linear regression
 * SVM dengan kernel linear
 * MLP dengan hanya 1 _hidden layer_
3. Lakukan evaluasi dari ketiga model tersebut.
4. Apabila dibutuhkan, lakukan _feature selection_ sebelum melakukan pemodelan pada tiap-tiap metode.
#### Additional Task
5. Convert model terbaik dengan metode Serializable (Pickle atau Joblib) dan simpan model tersebut didalam folder /model.
6.Buat jupyter notebook baru / notebook kedua untuk mengimport model tersebut dan lakukan prediction (bisa menggunakan data X_test dari data yang dipakai di Notebook pertama) *Jika memungkinkan notebook kedua bisa diganti dengan membuat backend/app sederhana dengan menggunakan Flask/Fast/Streamlit API. *Jika tidak memungkinkan, bisa dipertimbangkan untuk digunakan menjadi projek akhir.


Lakukan analisis pada data bike sharing pada bike-sharing.csv, target column adalah count.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('..//..//data//input//bikesharing_data.csv')
data.head(10)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1
5,2011-01-01 05:00:00,1,0,0,2,9.84,12.88,75,6.0032,0,1,1
6,2011-01-01 06:00:00,1,0,0,1,9.02,13.635,80,0.0,2,0,2
7,2011-01-01 07:00:00,1,0,0,1,8.2,12.88,86,0.0,1,2,3
8,2011-01-01 08:00:00,1,0,0,1,9.84,14.395,75,0.0,1,7,8
9,2011-01-01 09:00:00,1,0,0,1,13.12,17.425,76,0.0,8,6,14


In [3]:
data.drop('datetime', axis=1, inplace=True)

In [4]:
data.head(10)

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,1,0,0,1,9.84,14.395,75,0.0,0,1,1
5,1,0,0,2,9.84,12.88,75,6.0032,0,1,1
6,1,0,0,1,9.02,13.635,80,0.0,2,0,2
7,1,0,0,1,8.2,12.88,86,0.0,1,2,3
8,1,0,0,1,9.84,14.395,75,0.0,1,7,8
9,1,0,0,1,13.12,17.425,76,0.0,8,6,14


In [5]:
target = 'count'
X = data.drop(target, axis=1)
y = data[target]

Gunakan model berikut sebagai percobaan awal:

Linear regression

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [7]:
from sklearn.linear_model import LinearRegression

clf1 = LinearRegression()
clf1.fit(X_train, y_train)

LinearRegression()

SL (Reg (memprediksi numerical values or continuous) & Classification (discreat terbatas or categorical) kategori 0 atau 1 )

SVM dengan kernel linear

In [8]:
from sklearn.svm import SVR

clf2 = SVR(kernel='linear')
clf2.fit(X_train, y_train)

SVR(kernel='linear')

MLP dengan hanya 1 hidden layer

In [9]:
from sklearn.neural_network import MLPRegressor

clf3 = MLPRegressor(activation='tanh', hidden_layer_sizes=(100), max_iter=1000)
clf3.fit(X_train, y_train)

MLPRegressor(activation='tanh', hidden_layer_sizes=100, max_iter=1000)

Lakukan evaluasi dari ketiga model tersebut.

In [10]:
clf1.score(X_test, y_test) #1

1.0

In [11]:
clf2.score(X_test, y_test)

0.9999998271471009

In [12]:
clf3.score(X_test, y_test)

0.9997051896183984

In [13]:
from sklearn.metrics import mean_absolute_error

In [14]:
mean_absolute_error(y_test, clf1.predict(X_test))

1.0292363839073554e-13

In [15]:
mean_absolute_error(y_test, clf2.predict(X_test)) #2

0.07038888942150909

In [16]:
mean_absolute_error(y_test, clf3.predict(X_test))

1.2586591535276377

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
mean_squared_error(y_test, clf1.predict(X_test))

2.0672186636876722e-26

In [19]:
mean_squared_error(y_test, clf2.predict(X_test)) #2

0.0055909600144281296

In [20]:
mean_squared_error(y_test, clf3.predict(X_test))

9.535698064979641

Apabila dibutuhkan, lakukan feature selection sebelum melakukan pemodelan pada tiap-tiap metode.

In [21]:
from sklearn.feature_selection import chi2
chi2(X,y)

(array([6.10371058e+02, 6.81355480e+02, 3.20135759e+02, 2.34265530e+02,
        8.17418862e+03, 8.15250147e+03, 1.22860675e+04, 5.28275691e+03,
        4.62518232e+05, 1.52880232e+06]),
 array([0.99999999, 0.99986728, 1.        , 1.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]))

In [22]:
from sklearn.feature_selection import mutual_info_classif
mutual_info_classif(X,y)

array([0.13741393, 0.00861652, 0.31843165, 0.46509974, 0.13249691,
       0.13009019, 0.0899838 , 0.02719872, 0.68229086, 2.19872992])

In [23]:
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(mutual_info_classif,k=3)
selector.fit(X_train, y_train)
X_new = selector.transform(X_train)
X_new.shape

(8708, 3)

In [24]:
X_train.shape

(8708, 10)

In [25]:
selector.get_support()

array([False, False, False,  True, False, False, False, False,  True,
        True])

In [42]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
selector1 = RFE(clf,n_features_to_select=3, step=1)
selector1.fit(X_train, y_train)
X_new1 = selector1.transform(X_train)
X_new1.shape

(8708, 3)

In [43]:
X_new1

array([[ 46., 134., 311.],
       [ 44.,  31., 191.],
       [ 69.,  66., 128.],
       ...,
       [ 67.,  16., 303.],
       [ 53.,  51., 209.],
       [ 83.,   1.,  10.]])

In [44]:
X_train.shape

(8708, 10)

In [45]:
selector.get_support()

array([False, False, False, False, False, False,  True, False,  True,
        True])

Convert model terbaik dengan metode Serializable (Pickle atau Joblib) dan simpan model tersebut didalam folder /model.

In [46]:
from sklearn.svm import SVR

#define
rgr_svr = SVR(kernel='rbf')

#fit
rgr_svr.fit(X_train, y_train)

#predict
rgr_svr.predict(X_test)

array([341.07552414,   9.46763484,   4.71582277, ..., 131.8214766 ,
        72.56025214,  33.62427045])

In [47]:
import joblib

with open('model_svm.pkl', 'wb') as file:
    joblib.dump(rgr_svr, file)

Buat jupyter notebook baru / notebook kedua untuk mengimport model tersebut dan lakukan prediction (bisa menggunakan data X_test dari data yang dipakai di Notebook pertama) *Jika memungkinkan notebook kedua bisa diganti dengan membuat backend/app sederhana dengan menggunakan Flask/Fast/Streamlit API. *Jika tidak memungkinkan, bisa dipertimbangkan untuk digunakan menjadi projek akhir.
Notebook kedua ada difile terpisah