## 목표

다음 아래의 모델을 이용하여 4월 동안의 관광객 예측을 한다.

A, B, C, D, E (범위 구분하기)

**사용한 모델**

- Logisic Regression

## Data 전처리

### 데이터 불러오기

In [488]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [522]:
timeseries_data = pd.read_excel('seoul_merge_clean_monthly_yesapril_new2.xlsx')
# test_data = pd.read_excel('seoul_merge_clean_monthly_yesapril.xlsx')
# 3월까지의 데이터
timeseries_data = timeseries_data.dropna(axis=0)

In [523]:
timeseries_data.head()

Unnamed: 0,Date,Day,ST,rain_mean,temper_mean,event_total,dust,warninglevel,season,ST_level,rain_level,temper_level,monthly_predict,daily_predict
1,2009-01-02,금,2115.1,0.0,-3.3,1,좋음,없음,겨울,A,안 옴,추움,73966,5857.0
2,2009-01-03,토,2430.2,0.0,-1.7,1,좋음,없음,겨울,A,안 옴,추움,73966,3660.0
3,2009-01-04,일,2109.2,0.0,-0.2,1,좋음,없음,겨울,A,안 옴,추움,73966,3757.0
4,2009-01-05,월,2176.666667,0.0,-2.0,1,좋음,없음,겨울,A,안 옴,추움,73966,3342.0
5,2009-01-06,화,6568.0,0.0,-2.9,1,좋음,없음,겨울,A,안 옴,추움,73966,3246.0


### 명목형 변수를 Dummay Variable로 변경

> 이유 : 회귀식에 input은 반드시 : 수치형 데이터 (명목형 변수 -> 수치형 변수)
>
> 항목 : 
- 'Day'
- 'dust'
- 'event_total'
- 'warninglevel'

In [524]:
dummies_Day = pd.get_dummies(timeseries_data['Day'])
dummies_dust = pd.get_dummies(timeseries_data['dust'])
dummies_warninglevel = pd.get_dummies(timeseries_data['warninglevel'])
dummies_season = pd.get_dummies(timeseries_data['season'])

In [525]:
timeseries_data = timeseries_data.join(dummies_Day.add_prefix('Day_'))
timeseries_data = timeseries_data.join(dummies_dust.add_prefix('dust_'))
timeseries_data = timeseries_data.join(dummies_warninglevel.add_prefix('warninglevel_'))
timeseries_data = timeseries_data.join(dummies_season.add_prefix('season_'))

In [526]:
re_columnlist = ['Day_월', 'Day_화', 'Day_수', 'Day_목', 'Day_금', 'Day_토', 'Day_일'
             ,'dust_초미세먼지', 'dust_미세먼지', 'dust_둘 다 있음', 'dust_좋음'
             ,'warninglevel_경보', 'warninglevel_주의보', 'warninglevel_없음'
             ,'season_봄', 'season_여름', 'season_가을', 'season_겨울'
             ,'event_total'
             ,'rain_mean', 'temper_mean', 'daily_predict'
             ,'ST_level']

In [527]:
timeseries_data = timeseries_data[re_columnlist]

In [528]:
timeseries_data.head()

Unnamed: 0,Day_월,Day_화,Day_수,Day_목,Day_금,Day_토,Day_일,dust_초미세먼지,dust_미세먼지,dust_둘 다 있음,...,warninglevel_없음,season_봄,season_여름,season_가을,season_겨울,event_total,rain_mean,temper_mean,daily_predict,ST_level
1,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,1,1,0.0,-3.3,5857.0,A
2,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,1,1,0.0,-1.7,3660.0,A
3,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,1,1,0.0,-0.2,3757.0,A
4,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,1,0.0,-2.0,3342.0,A
5,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,1,0.0,-2.9,3246.0,A


### Train/Test Set 7:3으로 나누기

In [533]:
train_test_timeseries_data = timeseries_data[timeseries_data.index < 3742]
month4_timeseries_data = timeseries_data[timeseries_data.index >= 3742]

In [534]:
X = train_test_timeseries_data[['Day_월', 'Day_화', 'Day_수', 'Day_목', 'Day_금', 'Day_토', 'Day_일'
             ,'dust_초미세먼지', 'dust_미세먼지', 'dust_둘 다 있음', 'dust_좋음'
             ,'warninglevel_경보', 'warninglevel_주의보', 'warninglevel_없음'
             ,'season_봄', 'season_여름', 'season_가을', 'season_겨울'
             ,'event_total'
             ,'rain_mean', 'temper_mean', 'daily_predict']]
Y = train_test_timeseries_data['ST_level']

In [535]:
X_train, X_test, Y_train, Y_test = train_test_split(X.values, Y.values, test_size=0.30)

In [536]:
print('X_train의 크기',np.shape(X_train))
print('Y_train의 크기',np.shape(Y_train))
print('X_test의 크기',np.shape(X_test))
print('Y_test의 크기',np.shape(Y_test))

X_train의 크기 (2618, 22)
Y_train의 크기 (2618,)
X_test의 크기 (1123, 22)
Y_test의 크기 (1123,)


## Logist Regression 모델 설계

In [537]:
from sklearn.linear_model import LogisticRegression

In [538]:
LRC = LogisticRegression()
LRC.fit(X_train, Y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### 모델 평가(Logistic Regression)

In [539]:
predicted = LRC.predict(X_test)

In [540]:
accuracy = LRC.score(X_test, Y_test)
print("Logistic Regression test file accuracy:" + str(accuracy))

Logistic Regression test file accuracy:0.7693677649154052


### 4월 데이터 예측

In [542]:
month4_X = month4_timeseries_data[['Day_월', 'Day_화', 'Day_수', 'Day_목', 'Day_금', 'Day_토', 'Day_일'
                     ,'dust_초미세먼지', 'dust_미세먼지', 'dust_둘 다 있음', 'dust_좋음'
                     ,'warninglevel_경보', 'warninglevel_주의보', 'warninglevel_없음'
                     ,'season_봄', 'season_여름', 'season_가을', 'season_겨울'
                     ,'event_total'
                     ,'rain_mean', 'temper_mean', 'daily_predict']]

month4_Y = month4_timeseries_data['ST_level']

In [544]:
predicted = LRC.predict(month4_X)

accuracy = LRC.score(month4_X, month4_Y)
print("Logistic Regression test file accuracy:" + str(accuracy))

Logistic Regression test file accuracy:0.6666666666666666


문제점 : Train / Test를 구분하여 Logistic Regression 모델링을 만든 경우에는 78%이지만, 실제로 새로운 데이터를(4월) 만났을 때 모델이 잘 맞지 않음, 

해결 방안 : 보다 의미있는 변수를 찾아야 할것으로 판단 및 더 많은 데이터 확보하여 학습 필요