# 장애인콜택시 대기시간 예측
## 단계3. 모델링

## 0.미션

* 1.시계열 데이터에 맞는 전처리
    * 데이터가 2015 ~ 2022년까지 데이터 입니다.
    * 이 중 2022년 10월 ~ 2022년 12월까지(3개월)의 데이터는 성능 검증용으로 사용합니다.
    * 나머지 데이터를 이용하여 학습 및 튜닝을 수행합니다.
    * 이를 위한 데이터 전처리를 수행하시오.
* 2.모델 최적화
    * 머신러닝 
        * 알고리즘 중 3가지 이상을 선정하여 모델링을 수행합니다.
        * 각각 알고리즘에 맞는 튜닝 방법으로 최적의 모델을 만듭니다.
    * 딥러닝
        * 모델 구조를 2가지 이상을 생성하고 모델링을 수행합니다.
        * epochs, learning_rate 등을 조절하며 성능을 튜닝합니다.
    * 성능 평가
        * 시계열 그래프로 모델의 실제값과 예측값을 시각화하여 비교합니다.
        * 성능 지표 (MAE, MAPE) 를 사용해 가장 성능이 높은 모델을 선정합니다.
        * 성능 가이드
            * MAE : 4 ~ 6
            * MAPE : 0.09~0.14

## 1.환경설정

* 세부 요구사항
    - 경로 설정 : 다음의 두가지 방법 중 하나를 선택하여 폴더를 준비하고 데이터를 로딩하시오.
        * 1) 로컬 수행(Ananconda)
            * 제공된 압축파일을 다운받아 압축을 풀고
            * anaconda의 root directory(보통 C:/Users/< ID > 에 project 폴더를 만들고, 복사해 넣습니다.
        * 2) 구글콜랩
            * 구글 드라이브 바로 밑에 project 폴더를 만들고, 
            * 데이터 파일을 복사해 넣습니다.
    - 라이브러리 설치 및 로딩
        * requirements.txt 파일로 부터 라이브러리 설치
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다. 
        * 필요하다고 판단되는 라이브러리를 추가하세요.

### (1) 경로 설정

#### 1) 로컬 수행(Anaconda)
* project 폴더에 필요한 파일들을 넣고, 본 파일을 열었다면, 별도 경로 지정이 필요하지 않습니다.

In [None]:
path = ''

#### 2) 구글 콜랩 수행

* 구글 드라이브 연결

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = '/content/drive/MyDrive/project/'

### (2) 라이브러리 설치 및 불러오기

#### 1) 설치

* requirements.txt 파일을 아래 위치에 두고 다음 코드를 실행하시오.
    * 로컬 : 다음 코드셀 실행
    * 구글콜랩 : requirements.txt 파일을 왼쪽 [파일]탭에 복사해 넣고 다음 코드셀 실행

In [None]:
!pip install -r requirements.txt

#### 2) 라이브러리 로딩

* **세부 요구사항**
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import joblib

# 필요하다고 판단되는 라이브러리를 추가하세요.
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score as recall


### (3) 데이터 불러오기
* 주어진 데이터셋
    * [2.탐색적 데이터분석] 단계에서 저장한 파일 : data2.pkl

In [22]:
file1 = 'data2.pkl'

In [23]:
data = joblib.load(file1)

In [24]:
data = data.reset_index() # 인덱스 취소
data.head()

Unnamed: 0,Date,car_cnt,request_cnt,ride_cnt,fare,distance,weekday,month,week,year,...,temp_min,rain(mm),humidity_max(%),humidity_min(%),sunshine(MJ/m2),season,Holiday_Name,7day_waiting_time_mean,30day_waiting_time_mean,ride_percentage
0,2015-01-01,213,1023,924,2427,10764,Thr,1,1,2015,...,-9.8,0.0,52.0,33.0,9.79,겨울,New year,17.2,17.2,90.322581
1,2015-01-02,420,3158,2839,2216,8611,Fri,1,1,2015,...,-8.9,0.0,63.0,28.0,9.07,겨울,0,21.7,21.7,89.89867
2,2015-01-03,209,1648,1514,2377,10198,Sat,1,1,2015,...,-9.2,0.0,73.0,37.0,8.66,겨울,0,22.633333,22.633333,91.868932
3,2015-01-04,196,1646,1526,2431,10955,Sun,1,1,2015,...,0.2,0.0,89.0,58.0,5.32,겨울,0,23.525,23.525,92.709599
4,2015-01-05,421,4250,3730,2214,8663,Mon,1,2,2015,...,-0.9,0.0,95.0,52.0,6.48,겨울,0,23.54,23.54,87.764706


## 2.데이터 준비
* **세부요구사항**
    * NaN에 대한 조치를 수행하시오.
        * rolling 혹은 shift로 발생된 초기 행의 NaN은 삭제해도 무방합니다.
    * 가변수화 : 범주형에 대해서 가변수화를 수행합니다.
    * 데이터분할
        * 시계열 데이터 특성에 맞게 분할합니다.
        * 마지막 91일(3개월) 데이터를 검증셋으로 사용합니다.

### (1) NA 조치

In [25]:
data.isna().sum()

Date                       0
car_cnt                    0
request_cnt                0
ride_cnt                   0
fare                       0
distance                   0
weekday                    0
month                      0
week                       0
year                       0
target                     0
temp_max                   0
temp_min                   0
rain(mm)                   0
humidity_max(%)            0
humidity_min(%)            0
sunshine(MJ/m2)            0
season                     0
Holiday_Name               0
7day_waiting_time_mean     0
30day_waiting_time_mean    0
ride_percentage            0
dtype: int64

In [26]:
data.columns

Index(['Date', 'car_cnt', 'request_cnt', 'ride_cnt', 'fare', 'distance',
       'weekday', 'month', 'week', 'year', 'target', 'temp_max', 'temp_min',
       'rain(mm)', 'humidity_max(%)', 'humidity_min(%)', 'sunshine(MJ/m2)',
       'season', 'Holiday_Name', '7day_waiting_time_mean',
       '30day_waiting_time_mean', 'ride_percentage'],
      dtype='object')

### (2) 가변수화

In [27]:
# 가변수 대상 변수 식별  
dumm_cols = ['weekday', 'season', 'Holiday_Name']

# 가변수화
data = pd.get_dummies(data, columns=dumm_cols, drop_first=True)

# 확인
data.head()

Unnamed: 0,Date,car_cnt,request_cnt,ride_cnt,fare,distance,month,week,year,target,...,Holiday_Name_Children's Day,Holiday_Name_Christmas Day,Holiday_Name_Hangul Day,Holiday_Name_Independence Day,Holiday_Name_Korean New Year's Day,Holiday_Name_Liberation Day,Holiday_Name_Memorial Day,Holiday_Name_Midautumn Festival,Holiday_Name_National Foundation Day,Holiday_Name_New year
0,2015-01-01,213,1023,924,2427,10764,1,1,2015,17.2,...,0,0,0,0,0,0,0,0,0,1
1,2015-01-02,420,3158,2839,2216,8611,1,1,2015,26.2,...,0,0,0,0,0,0,0,0,0,0
2,2015-01-03,209,1648,1514,2377,10198,1,1,2015,24.5,...,0,0,0,0,0,0,0,0,0,0
3,2015-01-04,196,1646,1526,2431,10955,1,1,2015,26.2,...,0,0,0,0,0,0,0,0,0,0
4,2015-01-05,421,4250,3730,2214,8663,1,2,2015,23.6,...,0,0,0,0,0,0,0,0,0,0


### (3) 데이터분할
* **세부요구사항**
    * 마지막 91일 간의 데이터를 검증 셋으로 만듭니다. (2022-10-01 ~ )
    * 이 기간의 날짜 리스트를 별도로 저장하여, 모델 검증시 시각화할 때 활용합니다.

In [28]:
# 2022-10-01부터 마지막 91일 간의 데이터 추출
start_date = '2022-10-01'
modeling_data = data[data['Date'] >= start_date]

modeling_data = modeling_data.set_index(keys='Date') # 'Date' 인덱스 처리
modeling_data

Unnamed: 0_level_0,car_cnt,request_cnt,ride_cnt,fare,distance,month,week,year,target,temp_max,...,Holiday_Name_Children's Day,Holiday_Name_Christmas Day,Holiday_Name_Hangul Day,Holiday_Name_Independence Day,Holiday_Name_Korean New Year's Day,Holiday_Name_Liberation Day,Holiday_Name_Memorial Day,Holiday_Name_Midautumn Festival,Holiday_Name_National Foundation Day,Holiday_Name_New year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-10-01,345,2528,2037,2487,10845,10,39,2022,36.4,27.5,...,0,0,0,0,0,0,0,0,0,0
2022-10-02,249,1935,1631,2495,10803,10,39,2022,24.9,21.4,...,0,0,0,0,0,0,0,0,0,0
2022-10-03,267,1707,1374,2367,9868,10,40,2022,41.0,23.3,...,0,0,0,0,0,0,0,0,1,0
2022-10-04,650,5923,4968,2218,8345,10,40,2022,48.4,23.0,...,0,0,0,0,0,0,0,0,0,0
2022-10-05,638,5916,4935,2214,8355,10,40,2022,46.5,21.3,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-27,669,5635,4654,2198,8178,12,52,2022,44.8,3.0,...,0,0,0,0,0,0,0,0,0,0
2022-12-28,607,5654,4648,2161,7882,12,52,2022,52.5,-0.3,...,0,0,0,0,0,0,0,0,0,0
2022-12-29,581,5250,4247,2229,8433,12,52,2022,38.3,1.7,...,0,0,0,0,0,0,0,0,0,0
2022-12-30,600,5293,4200,2183,8155,12,52,2022,33.7,2.1,...,0,0,0,0,0,0,0,0,0,0


In [29]:
data = data.set_index(keys='Date') # 'Date' 인덱스 처리
data.head()

Unnamed: 0_level_0,car_cnt,request_cnt,ride_cnt,fare,distance,month,week,year,target,temp_max,...,Holiday_Name_Children's Day,Holiday_Name_Christmas Day,Holiday_Name_Hangul Day,Holiday_Name_Independence Day,Holiday_Name_Korean New Year's Day,Holiday_Name_Liberation Day,Holiday_Name_Memorial Day,Holiday_Name_Midautumn Festival,Holiday_Name_National Foundation Day,Holiday_Name_New year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-01,213,1023,924,2427,10764,1,1,2015,17.2,-4.3,...,0,0,0,0,0,0,0,0,0,1
2015-01-02,420,3158,2839,2216,8611,1,1,2015,26.2,-2.0,...,0,0,0,0,0,0,0,0,0,0
2015-01-03,209,1648,1514,2377,10198,1,1,2015,24.5,2.4,...,0,0,0,0,0,0,0,0,0,0
2015-01-04,196,1646,1526,2431,10955,1,1,2015,26.2,8.2,...,0,0,0,0,0,0,0,0,0,0
2015-01-05,421,4250,3730,2214,8663,1,2,2015,23.6,7.9,...,0,0,0,0,0,0,0,0,0,0


In [30]:
target = 'target'

#### 1) x, y 나누기

In [31]:
x = data.drop(columns=[target], axis=1)
y = data.loc[:, target]

#### 2) train : validation 나누기
* 힌트 : train_test_split(  ,   ,  test_size = 91, shuffle = False) 

In [32]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 91, shuffle=False)

In [33]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2832 entries, 2015-01-01 to 2022-10-01
Data columns (total 37 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   car_cnt                               2832 non-null   int64  
 1   request_cnt                           2832 non-null   int64  
 2   ride_cnt                              2832 non-null   int64  
 3   fare                                  2832 non-null   int64  
 4   distance                              2832 non-null   int64  
 5   month                                 2832 non-null   int64  
 6   week                                  2832 non-null   UInt32 
 7   year                                  2832 non-null   int64  
 8   temp_max                              2832 non-null   float64
 9   temp_min                              2832 non-null   float64
 10  rain(mm)                              2832 non-null   float64
 11 

### (4) Scaling
* KNN, SVM 알고리즘 및 DL을 적용하기 위해서는 스케일링을 해야 합니다.

In [34]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

In [35]:
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (2832, 37)
x_test shape: (91, 37)


## 3.모델링
* **세부요구사항**
    * 머신러닝 알고리즘 중 3가지 이상을 사용하여 모델을 만들고 튜닝을 수행합니다.
    * 딥러닝 모델 구조 2가지 이상을 설계하고 모델을 생성합니다.
    * 성능 측정은 MAE, MAPE로 수행합니다.
    * 모델링 후 실제값과 예측값을 시각화(라인차트)하여 분석합니다.

### (1) 머신러닝

#### 1) 모델1 - RandomForestRegressor

In [36]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error, r2_score

model_rf = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=1)
model_rf.fit(x_train, y_train)
ypred_rf = model_rf.predict(x_test)

print('MAE : ', mean_absolute_error(ypred_rf, y_test))
print('MAPE : ', mean_absolute_percentage_error(ypred_rf, y_test))
print('R2 : ', r2_score(ypred_rf, y_test))

MAE :  8.361947078959876
MAPE :  0.24323751309336245
R2 :  0.14743525774838107


#### 2) 모델2 - LinearRegression

In [37]:
model_LR = LinearRegression()
model_LR.fit(x_train, y_train)
ypred_LR = model_LR.predict(x_test)

print('MSE:', mse(y_test, ypred_LR))
print('R2:', r2_score(y_test, ypred_LR))

MSE: 122.29280468638738
R2: -1.1354911673195347


#### 3) 모델3 - GradientBoostingRegressor

In [38]:
model_gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=1)
model_gb.fit(x_train, y_train)
y_pred_gb = model_gb.predict(x_test)

print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred_gb)))
print('R2:', r2_score(y_test, y_pred_gb))

RMSE: 10.51070797355679
R2: -0.9291269760379053


#### 4) 모델4 - DecisionTreeRegressor

In [39]:
model_DT = DecisionTreeRegressor(max_depth=5, random_state=1)
model_DT.fit(x_train, y_train)
y_pred_DT = model_DT.predict(x_test)

print('MAE:', mae(y_test, y_pred_DT))
print('R2:', r2_score(y_test, y_pred_DT))

MAE: 8.925627358543206
R2: -1.0489067161650816


### (2) 딥러닝

#### 1) 모델1

In [44]:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Input, Flatten, Dropout

keras.backend.clear_session()

model1 = Sequential()
model1.add(Input(shape=(37,)))
model1.add(Dense(512, activation='relu'))
model1.add(Dropout(0.2))
model1.add(Dense(256, activation='relu'))
model1.add(Dropout(0.2))
model1.add(Dense(1, activation='linear'))

model1.compile(loss='mae', optimizer='adam', metrics=['MAE'])
model1.summary()

from keras.callbacks import EarlyStopping

es = EarlyStopping(monitor='val_loss',
                   min_delta=0,
                   patience=5,
                   verbose=1,
                   restore_best_weights=True)

model1.fit(x_train, y_train, validation_split=0.2, epochs=30, callbacks=[es], verbose=1, batch_size=32)

pred1 = model1.predict(x_test)
print('MAE : ', mean_absolute_error(pred1, y_test))
print('MAPE : ', mean_absolute_percentage_error(pred1, y_test))
print('R2 : ', r2_score(pred1, y_test))


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 512)               19456     
                                                                 
 dropout (Dropout)           (None, 512)               0         
                                                                 
 dense_1 (Dense)             (None, 256)               131328    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 257       
                                                                 
Total params: 151041 (590.00 KB)
Trainable params: 151041 (590.00 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/30
Epoch

#### 2) 모델2  - GRU

In [64]:
# Assuming you have correctly reshaped your data as follows
X_train = x_train.reshape(-1, 1, 37)
X_test = x_test.reshape(-1, 1, 37)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

model3 = Sequential()
model3.add(LSTM(32, input_shape=(1, 37)))  # Update input shape to match your reshaped data
model3.add(Dense(1))
model3.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
model3.fit(X_train, y_train, epochs=30, batch_size=32)  # Use X_train and y_train
loss_acc = model3.evaluate(X_train, y_train)
loss_test = model3.evaluate(X_test, y_test)  # Use X_test and y_test
print(loss_acc)
print(loss_test)

pred2 = model2.predict(x_test)
print('MAE : ', mean_absolute_error(pred2, y_test))
print('MAPE : ', mean_absolute_percentage_error(pred2, y_test))
print('R2 : ', r2_score(pred2, y_test))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
[112.1762924194336, 7.144388675689697]
[127.44291687011719, 9.486318588256836]
MAE :  10.51224905206607
MAPE :  1.435467605984939
R2 :  0.008241705833980872


In [65]:
# Assuming you have correctly reshaped your data as follows
x_train = x_train.reshape(-1, 1, 37)
x_test = x_test.reshape(-1, 1, 37)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

model3 = Sequential()
model3.add(LSTM(32, input_shape=(1, 37)))
model3.add(Dense(1))
model3.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
model3.fit(x_train, y_train, epochs=30, batch_size=32)
loss_acc = model3.evaluate(x_train, y_train)
loss_test = model3.evaluate(x_test, y_test) 
print(loss_acc)
print(loss_test)

pred3 = model3.predict(x_test)
print('MAE : ', mean_absolute_error(pred3, y_test))
print('MAPE : ', mean_absolute_percentage_error(pred3, y_test))
print('R2 : ', r2_score(pred3, y_test))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
[125.77813720703125, 7.591837406158447]
[141.4839324951172, 9.936774253845215]
MAE :  9.936773794823951
MAPE :  0.36725753695407876
R2 :  -0.042870447883183616


## 4.모델 비교
* **세부요구사항**
    * 모델링 단계에서 생성한 모든 모델의 성능을 하나로 모아서 비교합니다.
    * 가장 성능이 높은 모델을 선정합니다.