**AIVLE School 미니프로젝트**
### **공공데이터를 활용한 <span style="color:darkgreen">미세먼지 농도</span> 예측**
---

#### **<span style="color:red">[미션 안내]</span>**
* 개인 미션: 미세먼지 농도를 예측하는 머신러닝 모델을 만드세요. <br> ([1-1]부터 [1-2]까지, [2-1]부터 [2-8]까지, [3-1]부터 [3-2]까지 필수 수행, [4-1]부터 선택 수행)
* 조별 미션: 개인 미션 수행한 내용에 대해 토론하여 발표 준비를 해주세요.

#### **<span style="color:red">[데이터 설명]</span>**

* 학습 데이터
    * air_2021.csv : 2021년 미세먼지 데이터
    * weather_2021.csv : 2021년 날씨 데이터
* 테스트 데이터
    * air_2022.csv : 2022년 미세먼지 데이터
    * weather_2022.csv : 2022년 날씨 데이터

# [Step 1] 탐색적 데이터 분석

In [48]:
# 필요한 라이브러리 설치 및 임포트
!pip install pandas

import pandas as pd
import datetime
pd.options.display.max_columns = None



---

#### **<span style="color:blue">[1-1] air_21, air_22, weather_21, weather_22 데이터 로딩</span>**

In [157]:
# 데이터 로딩

air_21 = pd.read_csv("air_2021.csv", sep=',', index_col = 0, encoding = 'utf-8' )
air_22 = pd.read_csv("air_2022.csv", sep=',', index_col = 0, encoding = 'utf-8' )
weather_21 = pd.read_csv("weather_2021.csv", sep = ',', encoding='cp949')
weather_22 = pd.read_csv("weather_2022.csv", sep = ',', encoding='cp949')

#### **<span style="color:blue">[1-2] 필요한 데이터 분석 진행 </span>**

In [158]:
air_21.head()

Unnamed: 0,지역,망,측정소코드,측정소명,측정일시,SO2,CO,O3,NO2,PM10,PM25,주소
0,서울 종로구,도시대기,111123,종로구,2021100101,0.003,0.6,0.002,0.039,31.0,18.0,서울 종로구 종로35가길 19
1,서울 종로구,도시대기,111123,종로구,2021100102,0.003,0.6,0.002,0.035,27.0,16.0,서울 종로구 종로35가길 19
2,서울 종로구,도시대기,111123,종로구,2021100103,0.003,0.6,0.002,0.033,28.0,18.0,서울 종로구 종로35가길 19
3,서울 종로구,도시대기,111123,종로구,2021100104,0.003,0.6,0.002,0.03,26.0,16.0,서울 종로구 종로35가길 19
4,서울 종로구,도시대기,111123,종로구,2021100105,0.003,0.5,0.003,0.026,26.0,16.0,서울 종로구 종로35가길 19


In [159]:
# 아래에 필요한 코드를 작성하고 결과를 확인합니다.
# head, tail, info, plot을 활용한 시각화 등 진행

air_22.head()


Unnamed: 0,지역,망,측정소코드,측정소명,측정일시,SO2,CO,O3,NO2,PM10,PM25,주소
0,서울 종로구,도시대기,111123,종로구,2022010101,0.003,0.4,0.026,0.016,23.0,12.0,서울 종로구 종로35가길 19
1,서울 종로구,도시대기,111123,종로구,2022010102,0.003,0.4,0.022,0.02,20.0,9.0,서울 종로구 종로35가길 19
2,서울 종로구,도시대기,111123,종로구,2022010103,0.003,0.5,0.014,0.028,20.0,9.0,서울 종로구 종로35가길 19
3,서울 종로구,도시대기,111123,종로구,2022010104,0.003,0.5,0.016,0.027,19.0,10.0,서울 종로구 종로35가길 19
4,서울 종로구,도시대기,111123,종로구,2022010105,0.003,0.5,0.005,0.04,24.0,11.0,서울 종로구 종로35가길 19


In [160]:
air_21.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 0 to 8759
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   지역      8760 non-null   object 
 1   망       8760 non-null   object 
 2   측정소코드   8760 non-null   int64  
 3   측정소명    8760 non-null   object 
 4   측정일시    8760 non-null   int64  
 5   SO2     8648 non-null   float64
 6   CO      8680 non-null   float64
 7   O3      8663 non-null   float64
 8   NO2     8680 non-null   float64
 9   PM10    8655 non-null   float64
 10  PM25    8663 non-null   float64
 11  주소      8760 non-null   object 
dtypes: float64(6), int64(2), object(4)
memory usage: 889.7+ KB


In [161]:
weather_21.head()

Unnamed: 0,지점,지점명,일시,기온(°C),기온 QC플래그,강수량(mm),강수량 QC플래그,풍속(m/s),풍속 QC플래그,풍향(16방위),풍향 QC플래그,습도(%),습도 QC플래그,증기압(hPa),이슬점온도(°C),현지기압(hPa),현지기압 QC플래그,해면기압(hPa),해면기압 QC플래그,일조(hr),일조 QC플래그,일사(MJ/m2),일사 QC플래그,적설(cm),3시간신적설(cm),전운량(10분위),중하층운량(10분위),운형(운형약어),최저운고(100m ),시정(10m),지면상태(지면상태코드),현상번호(국내식),지면온도(°C),지면온도 QC플래그,5cm 지중온도(°C),10cm 지중온도(°C),20cm 지중온도(°C),30cm 지중온도(°C)
0,108,서울,2021-01-01 01:00,-8.7,,,,2.4,,270.0,,68,,2.2,-13.5,1016.4,,1027.7,,,9.0,,9.0,,,0.0,0,,,2000,,,-6.9,,-1.0,-0.8,0.3,1.6
1,108,서울,2021-01-01 02:00,-9.1,,,,1.6,,270.0,,69,,2.1,-13.7,1016.2,,1027.5,,,9.0,,9.0,,,0.0,0,,,2000,,,-7.1,,-1.1,-0.8,0.3,1.6
2,108,서울,2021-01-01 03:00,-9.3,,,,1.1,,250.0,,70,,2.1,-13.7,1016.8,,1028.1,,,9.0,,9.0,,,0.0,0,,,2000,,,-7.3,,-1.2,-0.9,0.3,1.6
3,108,서울,2021-01-01 04:00,-9.3,,,,0.3,,0.0,,71,,2.2,-13.5,1016.2,,1027.5,,,9.0,,9.0,,,0.0,0,,,2000,,,-7.5,,-1.3,-1.0,0.2,1.5
4,108,서울,2021-01-01 05:00,-9.7,,,,1.9,,20.0,,72,,2.1,-13.8,1015.6,,1026.9,,,9.0,,9.0,,,0.0,0,,,2000,,,-7.6,,-1.3,-1.0,0.2,1.5


In [162]:
weather_21.tail()

Unnamed: 0,지점,지점명,일시,기온(°C),기온 QC플래그,강수량(mm),강수량 QC플래그,풍속(m/s),풍속 QC플래그,풍향(16방위),풍향 QC플래그,습도(%),습도 QC플래그,증기압(hPa),이슬점온도(°C),현지기압(hPa),현지기압 QC플래그,해면기압(hPa),해면기압 QC플래그,일조(hr),일조 QC플래그,일사(MJ/m2),일사 QC플래그,적설(cm),3시간신적설(cm),전운량(10분위),중하층운량(10분위),운형(운형약어),최저운고(100m ),시정(10m),지면상태(지면상태코드),현상번호(국내식),지면온도(°C),지면온도 QC플래그,5cm 지중온도(°C),10cm 지중온도(°C),20cm 지중온도(°C),30cm 지중온도(°C)
8754,108,서울,2021-12-31 19:00,-6.4,,,,2.2,,250.0,,34,,1.3,-19.7,1021.1,,1032.3,,,9.0,,9.0,,,0.0,0,,,2000,,,-4.5,,-0.5,-0.7,-0.1,1.0
8755,108,서울,2021-12-31 20:00,-6.3,,,,4.1,,320.0,,35,,1.3,-19.2,1021.2,,1032.4,,,9.0,,9.0,,,0.0,0,,,2000,,,-5.3,,-0.6,-0.8,-0.1,1.0
8756,108,서울,2021-12-31 21:00,-6.7,,,,4.8,,320.0,,36,,1.3,-19.3,1021.2,,1032.4,,,9.0,,9.0,,,0.0,0,,,2000,,,-5.7,,-0.7,-0.8,-0.1,1.0
8757,108,서울,2021-12-31 22:00,-7.5,,,,3.0,,320.0,,37,,1.3,-19.7,1021.8,,1033.1,,,9.0,,9.0,,,0.0,0,,,2000,,,-6.2,,-0.8,-0.9,-0.2,1.0
8758,108,서울,2021-12-31 23:00,-7.7,,,,2.9,,320.0,,38,,1.3,-19.5,1021.9,,1033.2,,,9.0,,9.0,,,0.0,0,,,2000,,,-6.5,,-0.9,-0.9,-0.2,1.0


In [163]:
weather_21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8759 entries, 0 to 8758
Data columns (total 38 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   지점             8759 non-null   int64  
 1   지점명            8759 non-null   object 
 2   일시             8759 non-null   object 
 3   기온(°C)         8759 non-null   float64
 4   기온 QC플래그       0 non-null      float64
 5   강수량(mm)        949 non-null    float64
 6   강수량 QC플래그      1763 non-null   float64
 7   풍속(m/s)        8757 non-null   float64
 8   풍속 QC플래그       2 non-null      float64
 9   풍향(16방위)       8757 non-null   float64
 10  풍향 QC플래그       2 non-null      float64
 11  습도(%)          8759 non-null   int64  
 12  습도 QC플래그       0 non-null      float64
 13  증기압(hPa)       8759 non-null   float64
 14  이슬점온도(°C)      8759 non-null   float64
 15  현지기압(hPa)      8759 non-null   float64
 16  현지기압 QC플래그     0 non-null      float64
 17  해면기압(hPa)      8759 non-null   float64
 18  해면기압 QC플

# [Step 2] 데이터 전처리

#### **<span style="color:blue">[2-1] air_21, air_22 의 '측정일시'를 활용하여 'time' 변수 생성</span>**

* air_21, air_22  각각 '측정일시'를 활용하여 'time'변수 생성
    * 참고: 미세먼지 데이터는 1시-24시, 날씨 데이터는 0시-23시로 구성되어 있습니다. [2-3]에서 미세먼지와 날씨 데이터를 time 기준으로 합치려면 기준이 동일해야 합니다. 미세먼지 데이터에서 time 변수 생성 시 이를 미리 고려(예: air_21['측정일시'] -1)하세요.
* time 변수를 pd.to_datetime으로 데이터 타입 변경
    * 참고: format = '%Y%m%d%H'

In [164]:
# 아래에 필요한 코드를 작성하고 결과를 확인합니다.
air_21['time'] = air_21['측정일시']-1
air_21['time'] =pd.to_datetime(air_21['time'],format = '%Y%m%d%H')

air_22['time'] = air_22['측정일시']-1
air_22['time'] =pd.to_datetime(air_22['time'],format = '%Y%m%d%H')

---

In [165]:
air_21.tail()

Unnamed: 0,지역,망,측정소코드,측정소명,측정일시,SO2,CO,O3,NO2,PM10,PM25,주소,time
8755,서울 종로구,도시대기,111123,종로구,2021093020,0.003,0.7,0.02,0.036,35.0,24.0,서울 종로구 종로35가길 19,2021-09-30 19:00:00
8756,서울 종로구,도시대기,111123,종로구,2021093021,0.003,0.6,0.016,0.035,34.0,21.0,서울 종로구 종로35가길 19,2021-09-30 20:00:00
8757,서울 종로구,도시대기,111123,종로구,2021093022,0.003,0.6,0.012,0.036,30.0,19.0,서울 종로구 종로35가길 19,2021-09-30 21:00:00
8758,서울 종로구,도시대기,111123,종로구,2021093023,0.003,0.6,0.004,0.042,33.0,19.0,서울 종로구 종로35가길 19,2021-09-30 22:00:00
8759,서울 종로구,도시대기,111123,종로구,2021093024,0.003,0.6,0.003,0.042,29.0,17.0,서울 종로구 종로35가길 19,2021-09-30 23:00:00


In [166]:
air_21.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8760 entries, 0 to 8759
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   지역      8760 non-null   object        
 1   망       8760 non-null   object        
 2   측정소코드   8760 non-null   int64         
 3   측정소명    8760 non-null   object        
 4   측정일시    8760 non-null   int64         
 5   SO2     8648 non-null   float64       
 6   CO      8680 non-null   float64       
 7   O3      8663 non-null   float64       
 8   NO2     8680 non-null   float64       
 9   PM10    8655 non-null   float64       
 10  PM25    8663 non-null   float64       
 11  주소      8760 non-null   object        
 12  time    8760 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 958.1+ KB


#### **<span style="color:blue">[2-2] weather_21, weather_22 의 '일시'를 활용하여 'time' 변수 생성</span>**

* weather_21, weather_22 의 '일시'를 활용하여 'time'변수 생성
* time 변수를 pd.to_datetime으로 데이터 타입 변경

In [167]:
# 아래에 필요한 코드를 작성하고 결과를 확인합니다.
weather_21['time'] = pd.to_datetime(weather_21['일시'])
weather_22['time'] = pd.to_datetime(weather_22['일시'])

In [168]:
weather_21.tail()

Unnamed: 0,지점,지점명,일시,기온(°C),기온 QC플래그,강수량(mm),강수량 QC플래그,풍속(m/s),풍속 QC플래그,풍향(16방위),풍향 QC플래그,습도(%),습도 QC플래그,증기압(hPa),이슬점온도(°C),현지기압(hPa),현지기압 QC플래그,해면기압(hPa),해면기압 QC플래그,일조(hr),일조 QC플래그,일사(MJ/m2),일사 QC플래그,적설(cm),3시간신적설(cm),전운량(10분위),중하층운량(10분위),운형(운형약어),최저운고(100m ),시정(10m),지면상태(지면상태코드),현상번호(국내식),지면온도(°C),지면온도 QC플래그,5cm 지중온도(°C),10cm 지중온도(°C),20cm 지중온도(°C),30cm 지중온도(°C),time
8754,108,서울,2021-12-31 19:00,-6.4,,,,2.2,,250.0,,34,,1.3,-19.7,1021.1,,1032.3,,,9.0,,9.0,,,0.0,0,,,2000,,,-4.5,,-0.5,-0.7,-0.1,1.0,2021-12-31 19:00:00
8755,108,서울,2021-12-31 20:00,-6.3,,,,4.1,,320.0,,35,,1.3,-19.2,1021.2,,1032.4,,,9.0,,9.0,,,0.0,0,,,2000,,,-5.3,,-0.6,-0.8,-0.1,1.0,2021-12-31 20:00:00
8756,108,서울,2021-12-31 21:00,-6.7,,,,4.8,,320.0,,36,,1.3,-19.3,1021.2,,1032.4,,,9.0,,9.0,,,0.0,0,,,2000,,,-5.7,,-0.7,-0.8,-0.1,1.0,2021-12-31 21:00:00
8757,108,서울,2021-12-31 22:00,-7.5,,,,3.0,,320.0,,37,,1.3,-19.7,1021.8,,1033.1,,,9.0,,9.0,,,0.0,0,,,2000,,,-6.2,,-0.8,-0.9,-0.2,1.0,2021-12-31 22:00:00
8758,108,서울,2021-12-31 23:00,-7.7,,,,2.9,,320.0,,38,,1.3,-19.5,1021.9,,1033.2,,,9.0,,9.0,,,0.0,0,,,2000,,,-6.5,,-0.9,-0.9,-0.2,1.0,2021-12-31 23:00:00


---

In [169]:
weather_21.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8759 entries, 0 to 8758
Data columns (total 39 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   지점             8759 non-null   int64         
 1   지점명            8759 non-null   object        
 2   일시             8759 non-null   object        
 3   기온(°C)         8759 non-null   float64       
 4   기온 QC플래그       0 non-null      float64       
 5   강수량(mm)        949 non-null    float64       
 6   강수량 QC플래그      1763 non-null   float64       
 7   풍속(m/s)        8757 non-null   float64       
 8   풍속 QC플래그       2 non-null      float64       
 9   풍향(16방위)       8757 non-null   float64       
 10  풍향 QC플래그       2 non-null      float64       
 11  습도(%)          8759 non-null   int64         
 12  습도 QC플래그       0 non-null      float64       
 13  증기압(hPa)       8759 non-null   float64       
 14  이슬점온도(°C)      8759 non-null   float64       
 15  현지기압(hPa)      8759 n

#### **<span style="color:blue">[2-3] 'time' 기준으로 데이터 합치기</span>**

* 미세먼지 데이터와 날씨 데이터를 'time' 기준으로 합쳐보세요.
    * df_21에는 'time' 기준으로 21년도 미세먼지, 날씨 데이터를 합쳐보세요.
    * df_22에는 'time' 기준으로 22년도 미세먼지, 날씨 데이터를 합쳐보세요.

In [170]:
# 아래에 필요한 코드를 작성하고 결과를 확인합니다.
df_21 = pd.merge(air_21,weather_21,on='time',how='inner')
df_22 = pd.merge(air_22,weather_22,on='time',how='inner')

---

In [171]:
df_21.columns

Index(['지역', '망', '측정소코드', '측정소명', '측정일시', 'SO2', 'CO', 'O3', 'NO2', 'PM10',
       'PM25', '주소', 'time', '지점', '지점명', '일시', '기온(°C)', '기온 QC플래그',
       '강수량(mm)', '강수량 QC플래그', '풍속(m/s)', '풍속 QC플래그', '풍향(16방위)', '풍향 QC플래그',
       '습도(%)', '습도 QC플래그', '증기압(hPa)', '이슬점온도(°C)', '현지기압(hPa)', '현지기압 QC플래그',
       '해면기압(hPa)', '해면기압 QC플래그', '일조(hr)', '일조 QC플래그', '일사(MJ/m2)',
       '일사 QC플래그', '적설(cm)', '3시간신적설(cm)', '전운량(10분위)', '중하층운량(10분위)',
       '운형(운형약어)', '최저운고(100m )', '시정(10m)', '지면상태(지면상태코드)', '현상번호(국내식)',
       '지면온도(°C)', '지면온도 QC플래그', '5cm 지중온도(°C)', '10cm 지중온도(°C)',
       '20cm 지중온도(°C)', '30cm 지중온도(°C)'],
      dtype='object')

In [172]:
df_21.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8759 entries, 0 to 8758
Data columns (total 51 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   지역             8759 non-null   object        
 1   망              8759 non-null   object        
 2   측정소코드          8759 non-null   int64         
 3   측정소명           8759 non-null   object        
 4   측정일시           8759 non-null   int64         
 5   SO2            8647 non-null   float64       
 6   CO             8679 non-null   float64       
 7   O3             8662 non-null   float64       
 8   NO2            8679 non-null   float64       
 9   PM10           8654 non-null   float64       
 10  PM25           8662 non-null   float64       
 11  주소             8759 non-null   object        
 12  time           8759 non-null   datetime64[ns]
 13  지점             8759 non-null   int64         
 14  지점명            8759 non-null   object        
 15  일시             8759 n

#### **<span style="color:blue">[2-4] 사용하지 않을 변수 제거</span>**

* 머신러닝에 사용하지 않을 변수들을 제거해줍니다.
    * df_21, df_22에 사용할 변수들만 넣어보세요.
* time 변수를 index로 세팅하고 (set_index) 데이터가 정렬되어 있지 않으므로 index 기준으로 정렬하세요. (sort_index)

In [173]:
# df_21, df_22에 사용할 변수들만 할당
df_21_2 = df_21.copy()
df_22_2 = df_22.copy()
del_cols=['지역','망','측정소명','지점명','일시','주소','기온 QC플래그','강수량(mm)',
       '강수량 QC플래그', '풍속 QC플래그', '풍향 QC플래그',
       '습도 QC플래그', '증기압(hPa)', '이슬점온도(°C)', '현지기압(hPa)', '현지기압 QC플래그','증기압(hPa)', '이슬점온도(°C)', '현지기압(hPa)', '현지기압 QC플래그',
       '해면기압(hPa)', '해면기압 QC플래그', '일조(hr)', '일조 QC플래그', '일사(MJ/m2)',
       '일사 QC플래그', '적설(cm)', '3시간신적설(cm)', '전운량(10분위)', '중하층운량(10분위)',
       '운형(운형약어)', '최저운고(100m )', '시정(10m)', '지면상태(지면상태코드)', '현상번호(국내식)',
       '지면온도(°C)', '지면온도 QC플래그', '5cm 지중온도(°C)', '10cm 지중온도(°C)',
       '20cm 지중온도(°C)', '30cm 지중온도(°C)']
df_21.drop(del_cols, axis=1, inplace=True)
df_22.drop(del_cols, axis=1, inplace=True)


In [174]:
df_21

Unnamed: 0,측정소코드,측정일시,SO2,CO,O3,NO2,PM10,PM25,time,지점,기온(°C),풍속(m/s),풍향(16방위),습도(%)
0,111123,2021100101,0.003,0.6,0.002,0.039,31.0,18.0,2021-10-01 00:00:00,108,19.2,1.3,360.0,83
1,111123,2021100102,0.003,0.6,0.002,0.035,27.0,16.0,2021-10-01 01:00:00,108,18.7,1.0,20.0,85
2,111123,2021100103,0.003,0.6,0.002,0.033,28.0,18.0,2021-10-01 02:00:00,108,18.3,0.3,0.0,89
3,111123,2021100104,0.003,0.6,0.002,0.030,26.0,16.0,2021-10-01 03:00:00,108,17.7,2.0,20.0,92
4,111123,2021100105,0.003,0.5,0.003,0.026,26.0,16.0,2021-10-01 04:00:00,108,17.4,1.0,50.0,91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8754,111123,2021093020,0.003,0.7,0.020,0.036,35.0,24.0,2021-09-30 19:00:00,108,22.7,0.2,0.0,71
8755,111123,2021093021,0.003,0.6,0.016,0.035,34.0,21.0,2021-09-30 20:00:00,108,21.7,0.9,320.0,79
8756,111123,2021093022,0.003,0.6,0.012,0.036,30.0,19.0,2021-09-30 21:00:00,108,20.9,0.4,0.0,83
8757,111123,2021093023,0.003,0.6,0.004,0.042,33.0,19.0,2021-09-30 22:00:00,108,20.4,0.8,70.0,81


In [175]:
# time 변수를 index로 세팅
df_21.set_index('time',inplace=True)
df_22.set_index('time',inplace=True)

---

In [176]:
df_21 = df_21.sort_index(ascending=True)
df_22 = df_22.sort_index(ascending=True)

In [177]:
df_21

Unnamed: 0_level_0,측정소코드,측정일시,SO2,CO,O3,NO2,PM10,PM25,지점,기온(°C),풍속(m/s),풍향(16방위),습도(%)
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2021-01-01 01:00:00,111123,2021010102,0.002,0.6,0.018,0.020,25.0,14.0,108,-8.7,2.4,270.0,68
2021-01-01 02:00:00,111123,2021010103,0.002,0.6,0.013,0.025,27.0,16.0,108,-9.1,1.6,270.0,69
2021-01-01 03:00:00,111123,2021010104,0.003,0.6,0.011,0.027,23.0,13.0,108,-9.3,1.1,250.0,70
2021-01-01 04:00:00,111123,2021010105,0.003,0.6,0.008,0.032,24.0,14.0,108,-9.3,0.3,0.0,71
2021-01-01 05:00:00,111123,2021010106,0.002,0.7,0.003,0.037,26.0,16.0,108,-9.7,1.9,20.0,72
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-31 19:00:00,111123,2021123120,0.003,0.4,0.025,0.020,26.0,8.0,108,-6.4,2.2,250.0,34
2021-12-31 20:00:00,111123,2021123121,0.003,0.4,0.030,0.014,27.0,9.0,108,-6.3,4.1,320.0,35
2021-12-31 21:00:00,111123,2021123122,0.003,0.4,0.033,0.011,20.0,8.0,108,-6.7,4.8,320.0,36
2021-12-31 22:00:00,111123,2021123123,0.003,0.4,0.029,0.015,20.0,9.0,108,-7.5,3.0,320.0,37


#### **<span style="color:blue">[2-5] 변수들의 결측치 처리</span>**

In [178]:
# df_21, df_22의 결측치 확인
df_21.isna().sum()


측정소코드       0  
측정일시        0  
SO2         112
CO          80 
O3          97 
NO2         80 
PM10        105
PM25        97 
지점          0  
기온(°C)      0  
풍속(m/s)     2  
풍향(16방위)    2  
습도(%)       0  
dtype: int64

In [179]:
df_22.isna().sum()

측정소코드       0 
측정일시        0 
SO2         21
CO          21
O3          21
NO2         21
PM10        38
PM25        25
지점          0 
기온(°C)      0 
풍속(m/s)     0 
풍향(16방위)    0 
습도(%)       0 
dtype: int64

In [180]:
# df_21, df_22의 변수 중 결측치를 처리 (ex: 강수량(mm))
df_21[ 'SO2'].fillna(method='ffill', inplace=True)
df_21[ 'CO'].fillna(method='ffill', inplace=True)
df_21[ 'O3'].fillna(method='ffill', inplace=True)
df_21[ 'NO2'].fillna(method='ffill', inplace=True)
df_21[ 'PM10'].fillna(method='ffill', inplace=True)
df_21[ 'PM25'].fillna(method='ffill', inplace=True)
df_21[ '풍속(m/s)'].fillna(method='ffill', inplace=True)
df_21[ '풍향(16방위)'].fillna(method='ffill', inplace=True)
df_22[ 'SO2'].fillna(method='ffill', inplace=True)
df_22[ 'CO'].fillna(method='ffill', inplace=True)
df_22[ 'O3'].fillna(method='ffill', inplace=True)
df_22[ 'NO2'].fillna(method='ffill', inplace=True)
df_22[ 'PM10'].fillna(method='ffill', inplace=True)
df_22[ 'PM25'].fillna(method='ffill', inplace=True)

In [181]:
# df_21, df_22의 남은 결측치를 처리
df_22.isna().sum()


측정소코드       0
측정일시        0
SO2         0
CO          0
O3          0
NO2         0
PM10        0
PM25        0
지점          0
기온(°C)      0
풍속(m/s)     0
풍향(16방위)    0
습도(%)       0
dtype: int64

In [182]:
# df_21, df_22의 결측치 재확인
df_21.isna().sum()


측정소코드       0
측정일시        0
SO2         0
CO          0
O3          0
NO2         0
PM10        0
PM25        0
지점          0
기온(°C)      0
풍속(m/s)     0
풍향(16방위)    0
습도(%)       0
dtype: int64

---

#### **<span style="color:blue">[2-6] 전일 같은 시간 미세먼지 농도 변수 추가</span>**

* 먼저 df_21, df_22에 month, day, hour 변수를 추가하세요.
    * 예) dt.month, dt.day, dt.hour 사용 또는 datetimeindex에서는 df.index.month 등 사용 가능
* 모델링에 유용한 변수로 전일 같은 시간(24시간 전) 미세먼지 농도 변수를 추가하세요.
    * 시계열 데이터 처리를 위한 shift 연산을 참고하세요.

In [183]:
# df_21, df_22의 index(time)를 month, day, hour 로 쪼개기 (year는 필요 없음). 이후에 저장 시 index(time)은 포함하지 않음.
df_21['month'] = df_21.index.month
df_21['day'] = df_21.index.day
df_21['hour'] = df_21.index.hour

df_22['month'] = df_22.index.month
df_22['day'] = df_22.index.day
df_22['hour'] = df_22.index.hour

In [184]:
# df_21, df_22에 전일 같은 시간 미세먼지 농도 변수(PM10_lag1) 추가
# 전일 같은 시간은 24시간 전 입니다.
df_21['PM10_lag1'] = df_21['PM10'].shift(24)
df_22['PM10_lag1'] = df_22['PM10'].shift(24)
pd.options.display.max_columns = None
df_21.head()

Unnamed: 0_level_0,측정소코드,측정일시,SO2,CO,O3,NO2,PM10,PM25,지점,기온(°C),풍속(m/s),풍향(16방위),습도(%),month,day,hour,PM10_lag1
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2021-01-01 01:00:00,111123,2021010102,0.002,0.6,0.018,0.02,25.0,14.0,108,-8.7,2.4,270.0,68,1,1,1,
2021-01-01 02:00:00,111123,2021010103,0.002,0.6,0.013,0.025,27.0,16.0,108,-9.1,1.6,270.0,69,1,1,2,
2021-01-01 03:00:00,111123,2021010104,0.003,0.6,0.011,0.027,23.0,13.0,108,-9.3,1.1,250.0,70,1,1,3,
2021-01-01 04:00:00,111123,2021010105,0.003,0.6,0.008,0.032,24.0,14.0,108,-9.3,0.3,0.0,71,1,1,4,
2021-01-01 05:00:00,111123,2021010106,0.002,0.7,0.003,0.037,26.0,16.0,108,-9.7,1.9,20.0,72,1,1,5,


---

#### **<span style="color:blue">[2-7] t+1 시점의 미세먼지 농도 데이터 생성</span>**

* t+1 시점은 1시간 후 입니다.
* t+1 시점의 미세먼지 농도 변수(PM10_1)를 생성하세요.
* t+1 시점의 미세먼지 농도는 머신러닝 모델을 통해 예측하려는 y값(target) 입니다.

In [185]:
# df_21, df_22에 t+1 시점 변수(PM10_1) 추가
df_21['PM10_1']=df_21['PM10'].shift(-1)
df_22['PM10_1']=df_22['PM10'].shift(-1)

In [186]:
# 결측치가 있다면 처리
df_21.isna().sum()
df_22.isna().sum()


측정소코드        0 
측정일시         0 
SO2          0 
CO           0 
O3           0 
NO2          0 
PM10         0 
PM25         0 
지점           0 
기온(°C)       0 
풍속(m/s)      0 
풍향(16방위)     0 
습도(%)        0 
month        0 
day          0 
hour         0 
PM10_lag1    24
PM10_1       1 
dtype: int64

In [187]:
mean_pm10 = df_21['PM10'].mean()

df_21[ 'PM10_lag1'].fillna(mean_pm10, inplace=True)
df_21[ 'PM10_1'].fillna(mean_pm10, inplace=True)

df_22[ 'PM10_lag1'].fillna(mean_pm10, inplace=True)
df_22[ 'PM10_1'].fillna(mean_pm10, inplace=True)

In [188]:
df_21.isna().sum()

측정소코드        0
측정일시         0
SO2          0
CO           0
O3           0
NO2          0
PM10         0
PM25         0
지점           0
기온(°C)       0
풍속(m/s)      0
풍향(16방위)     0
습도(%)        0
month        0
day          0
hour         0
PM10_lag1    0
PM10_1       0
dtype: int64

In [189]:
df_22.isna().sum()

측정소코드        0
측정일시         0
SO2          0
CO           0
O3           0
NO2          0
PM10         0
PM25         0
지점           0
기온(°C)       0
풍속(m/s)      0
풍향(16방위)     0
습도(%)        0
month        0
day          0
hour         0
PM10_lag1    0
PM10_1       0
dtype: int64

In [190]:
df_21.head(30)

Unnamed: 0_level_0,측정소코드,측정일시,SO2,CO,O3,NO2,PM10,PM25,지점,기온(°C),풍속(m/s),풍향(16방위),습도(%),month,day,hour,PM10_lag1,PM10_1
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2021-01-01 01:00:00,111123,2021010102,0.002,0.6,0.018,0.02,25.0,14.0,108,-8.7,2.4,270.0,68,1,1,1,38.786049,27.0
2021-01-01 02:00:00,111123,2021010103,0.002,0.6,0.013,0.025,27.0,16.0,108,-9.1,1.6,270.0,69,1,1,2,38.786049,23.0
2021-01-01 03:00:00,111123,2021010104,0.003,0.6,0.011,0.027,23.0,13.0,108,-9.3,1.1,250.0,70,1,1,3,38.786049,24.0
2021-01-01 04:00:00,111123,2021010105,0.003,0.6,0.008,0.032,24.0,14.0,108,-9.3,0.3,0.0,71,1,1,4,38.786049,26.0
2021-01-01 05:00:00,111123,2021010106,0.002,0.7,0.003,0.037,26.0,16.0,108,-9.7,1.9,20.0,72,1,1,5,38.786049,27.0
2021-01-01 06:00:00,111123,2021010107,0.003,0.7,0.002,0.039,27.0,18.0,108,-9.7,2.0,50.0,75,1,1,6,38.786049,30.0
2021-01-01 07:00:00,111123,2021010108,0.003,0.8,0.002,0.041,30.0,18.0,108,-9.3,1.6,50.0,71,1,1,7,38.786049,33.0
2021-01-01 08:00:00,111123,2021010109,0.003,0.8,0.004,0.04,33.0,19.0,108,-9.3,1.6,50.0,72,1,1,8,38.786049,35.0
2021-01-01 09:00:00,111123,2021010110,0.004,0.9,0.007,0.039,35.0,19.0,108,-8.6,2.5,20.0,74,1,1,9,38.786049,44.0
2021-01-01 10:00:00,111123,2021010111,0.004,0.8,0.01,0.036,44.0,27.0,108,-6.1,1.1,50.0,68,1,1,10,38.786049,42.0


---

#### **<span style="color:blue">[2-8] train, test 데이터 분리</span>**

* 21년도 데이터(df_21)를 train 데이터로 저장하세요. y 값을 제외하고 train_x로 저장한 후 y 값은 train_y로 저장하세요.
* 22년도 데이터(df_22)를 test 데이터로 저장하세요. y 값을 제외하고 test_x로 저장한 후 y 값은 test_y로 저장하세요.
* 각각의 데이터프레임을 csv 파일로 저장하세요. (train_x.csv / train_y.csv / test_x.csv / test_y.csv) (단, 인덱스 제외)
* y값은 'PM10_1' 즉, t+1 시점의 미세먼지 농도입니다.

In [191]:
# 아래에 필요한 코드를 작성하고 결과를 확인합니다.
target = 'PM10_1'
train_x = df_21.drop(target, axis=1)
train_y = df_21[target]
test_x = df_22.drop(target, axis=1)
test_y = df_22[target]


In [192]:
# 각각의 데이터프레임을 csv 파일로 저장 (train_x.csv / train_y.csv / test_x.csv / test_y.csv)
train_x.to_csv("train_x.csv",index=False)
train_y.to_csv("train_y.csv",index=False)
test_x.to_csv("test_x.csv",index=False)
test_y.to_csv("test_y.csv",index=False)

PermissionError: [Errno 13] Permission denied: 'train_x.csv'