# 미세먼지 농도에 따른 WHO 예보 등급 구하기

#### Data
주어진 데이터는 다음과 같습니다.

* stn_code: 데이터 측정 장소 아이디
* state: 주(州)
* location: 도시
* type: 측정 장소의 타입(주거지대, 공업지대 등)
* so2: 아황산가스 농도(μg/m3)
* no2: 이산화질소 농도(μg/m3)
* location\_monitoring\_station: 관측소 상세 주소
* pm2_5: 미세먼지 농도(μg/m3)
* date: 관측일(YYYY-MM-DD 형식)



### 초기코드 선택

Python, R 중 본인의 선호 언어에 따라 초기 코드를 선택하세요.
Python을 선호한다면 Python 초기 코드를, R을 선호한다면 R 초기 코드만 남기면 됩니다.

In [1]:
# 초기코드 - python

# 모듈 import
import pandas as pd

# 데이터 로드
df = pd.read_csv(
    'C:/Users/82104/Time series/data/station_day.csv'
)
df.head()

Unnamed: 0,StationId,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,AP001,2017-11-24,71.36,115.75,1.75,20.65,12.4,12.19,0.1,10.76,109.26,0.17,5.92,0.1,,
1,AP001,2017-11-25,81.4,124.5,1.44,20.5,12.08,10.72,0.12,15.24,127.09,0.2,6.5,0.06,184.0,Moderate
2,AP001,2017-11-26,78.32,129.06,1.26,26.0,14.85,10.28,0.14,26.96,117.44,0.22,7.95,0.08,197.0,Moderate
3,AP001,2017-11-27,88.76,135.32,6.6,30.85,21.77,12.91,0.11,33.59,111.81,0.29,7.63,0.12,198.0,Moderate
4,AP001,2017-11-28,64.18,104.09,2.56,28.07,17.01,11.42,0.09,19.0,138.18,0.17,5.02,0.07,188.0,Moderate


## 과제

WHO의 대기질 가이드라인에 따르면 대기 중 미세먼지 농도에 따른 예보 등급은 다음과 같습니다. 주어진 데이터를 활용해, 각 예보 등급이 나타난 횟수를 알아내는 것이 이번 과제의 목표입니다.

| pm 2.5 농도(μg/m3)     | 예보 등급  |
|-----------------|------------|
| 76 이상         |  매우 나쁨 |
| 36 이상 76 미만 | 나쁨       |
| 16 이상 36 미만 | 보통       |
| 16 미만         | 좋음       |

## 할 일

##### 1. 결측치 처리

그래프를 그리기 앞서, `PM2.5`이 없는 row를 지워주세요.

In [2]:
df = df[df['PM2.5'].isnull() == False]
df.reset_index(inplace = True)
df

Unnamed: 0,index,StationId,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,0,AP001,2017-11-24,71.36,115.75,1.75,20.65,12.40,12.19,0.10,10.76,109.26,0.17,5.92,0.10,,
1,1,AP001,2017-11-25,81.40,124.50,1.44,20.50,12.08,10.72,0.12,15.24,127.09,0.20,6.50,0.06,184.0,Moderate
2,2,AP001,2017-11-26,78.32,129.06,1.26,26.00,14.85,10.28,0.14,26.96,117.44,0.22,7.95,0.08,197.0,Moderate
3,3,AP001,2017-11-27,88.76,135.32,6.60,30.85,21.77,12.91,0.11,33.59,111.81,0.29,7.63,0.12,198.0,Moderate
4,4,AP001,2017-11-28,64.18,104.09,2.56,28.07,17.01,11.42,0.09,19.00,138.18,0.17,5.02,0.07,188.0,Moderate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86405,108030,WB013,2020-06-27,8.65,16.46,,,,,0.69,4.36,30.59,1.32,7.26,,50.0,Good
86406,108031,WB013,2020-06-28,11.80,18.47,,,,,0.68,3.49,38.95,1.42,7.92,,65.0,Satisfactory
86407,108032,WB013,2020-06-29,18.60,32.26,13.65,200.87,214.20,11.40,0.78,5.12,38.17,3.52,8.64,,63.0,Satisfactory
86408,108033,WB013,2020-06-30,16.07,39.30,7.56,29.13,36.69,29.26,0.69,5.88,29.64,1.86,8.40,,57.0,Satisfactory


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86410 entries, 0 to 86409
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   index       86410 non-null  int64  
 1   StationId   86410 non-null  object 
 2   Date        86410 non-null  object 
 3   PM2.5       86410 non-null  float64
 4   PM10        61525 non-null  float64
 5   NO          84336 non-null  float64
 6   NO2         85008 non-null  float64
 7   NOx         82658 non-null  float64
 8   NH3         58100 non-null  float64
 9   CO          83460 non-null  float64
 10  SO2         76848 non-null  float64
 11  O3          78770 non-null  float64
 12  Benzene     68896 non-null  float64
 13  Toluene     62284 non-null  float64
 14  Xylene      19935 non-null  float64
 15  AQI         83537 non-null  float64
 16  AQI_Bucket  83537 non-null  object 
dtypes: float64(13), int64(1), object(3)
memory usage: 11.2+ MB


##### 2. 각 예보 등급이 나타난 횟수 구하기

주어진 데이터를 활용해, "매우 나쁨" ,"나쁨", "보통", "좋음" 예보 등급이 각각 나타난 횟수를 알아내 주세요. 결과는 대략 다음과 같은 형식으로 나올 겁니다.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>count</th>
    </tr>
    <tr>
      <th>예보 등급</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>매우 나쁨</th>
      <td>0</td>
    </tr>
    <tr>
      <th>나쁨</th>
      <td>0</td>
    </tr>
    <tr>
      <th>보통</th>
      <td>0</td>
    </tr>
    <tr>
      <th>좋음</th>
      <td>0</td>
    </tr>
  </tbody>
</table>

| pm 2.5 농도(μg/m3)     | 예보 등급  |
|-----------------|------------|
| 76 이상         |  매우 나쁨 |
| 36 이상 76 미만 | 나쁨       |
| 16 이상 36 미만 | 보통       |
| 16 미만         | 좋음       |

In [4]:
import numpy as np

df['예보 등급'] = np.nan

In [5]:
f1 = df[df['PM2.5'] >= 76]['PM2.5']
f2 = df[(df['PM2.5'] >= 36) & (df['PM2.5'] < 76)]['PM2.5']
f3 = df[(df['PM2.5'] >= 16) & (df['PM2.5'] < 36)]['PM2.5']
f4 = df[df['PM2.5'] < 16]['PM2.5']

In [6]:
cnt1 = df[df['PM2.5'] >= 76]['PM2.5'].count()
cnt2 = df[(df['PM2.5'] >= 36) & (df['PM2.5'] < 76)]['PM2.5'].count()
cnt3 = df[(df['PM2.5'] >= 16) & (df['PM2.5'] < 36)]['PM2.5'].count()
cnt4 = df[df['PM2.5'] < 16]['PM2.5'].count()

In [7]:
print(cnt1, cnt2, cnt3, cnt4)

30509 30264 18894 6743


In [8]:
df['예보 등급'][f1.index] = '매우 나쁨'
df['예보 등급'][f2.index] = '나쁨'
df['예보 등급'][f3.index] = '보통'
df['예보 등급'][f4.index] = '좋음'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['예보 등급'][f1.index] = '매우 나쁨'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['예보 등급'][f2.index] = '나쁨'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['예보 등급'][f3.index] = '보통'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['예보 등급'][f4.index] = '좋음'


In [9]:
dfNew = pd.DataFrame({'예보 등급' : ['매우 나쁨', '나쁨', '보통', '좋음'], 'count' : [cnt1, cnt2, cnt3, cnt4]})
dfNew

Unnamed: 0,예보 등급,count
0,매우 나쁨,30509
1,나쁨,30264
2,보통,18894
3,좋음,6743


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86410 entries, 0 to 86409
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   index       86410 non-null  int64  
 1   StationId   86410 non-null  object 
 2   Date        86410 non-null  object 
 3   PM2.5       86410 non-null  float64
 4   PM10        61525 non-null  float64
 5   NO          84336 non-null  float64
 6   NO2         85008 non-null  float64
 7   NOx         82658 non-null  float64
 8   NH3         58100 non-null  float64
 9   CO          83460 non-null  float64
 10  SO2         76848 non-null  float64
 11  O3          78770 non-null  float64
 12  Benzene     68896 non-null  float64
 13  Toluene     62284 non-null  float64
 14  Xylene      19935 non-null  float64
 15  AQI         83537 non-null  float64
 16  AQI_Bucket  83537 non-null  object 
 17  예보 등급       86410 non-null  object 
dtypes: float64(13), int64(1), object(4)
memory usage: 11.9+ MB


In [11]:
df['StationId'] = df['StationId'].astype('category')

In [12]:
df['Date'] = pd.to_datetime(df['Date'])

In [13]:
F1 = pd.get_dummies(df['AQI_Bucket'], prefix = 'AQI_Bucket')
F1

Unnamed: 0,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
0,0,0,0,0,0,0
1,0,1,0,0,0,0
2,0,1,0,0,0,0
3,0,1,0,0,0,0
4,0,1,0,0,0,0
...,...,...,...,...,...,...
86405,1,0,0,0,0,0
86406,0,0,0,1,0,0
86407,0,0,0,1,0,0
86408,0,0,0,1,0,0


In [14]:
F1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86410 entries, 0 to 86409
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   AQI_Bucket_Good          86410 non-null  uint8
 1   AQI_Bucket_Moderate      86410 non-null  uint8
 2   AQI_Bucket_Poor          86410 non-null  uint8
 3   AQI_Bucket_Satisfactory  86410 non-null  uint8
 4   AQI_Bucket_Severe        86410 non-null  uint8
 5   AQI_Bucket_Very Poor     86410 non-null  uint8
dtypes: uint8(6)
memory usage: 506.4 KB


In [15]:
#
df = pd.merge(df, F1, left_index = True, right_index = True, how = 'left')
df

Unnamed: 0,index,StationId,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,...,Xylene,AQI,AQI_Bucket,예보 등급,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
0,0,AP001,2017-11-24,71.36,115.75,1.75,20.65,12.40,12.19,0.10,...,0.10,,,나쁨,0,0,0,0,0,0
1,1,AP001,2017-11-25,81.40,124.50,1.44,20.50,12.08,10.72,0.12,...,0.06,184.0,Moderate,매우 나쁨,0,1,0,0,0,0
2,2,AP001,2017-11-26,78.32,129.06,1.26,26.00,14.85,10.28,0.14,...,0.08,197.0,Moderate,매우 나쁨,0,1,0,0,0,0
3,3,AP001,2017-11-27,88.76,135.32,6.60,30.85,21.77,12.91,0.11,...,0.12,198.0,Moderate,매우 나쁨,0,1,0,0,0,0
4,4,AP001,2017-11-28,64.18,104.09,2.56,28.07,17.01,11.42,0.09,...,0.07,188.0,Moderate,나쁨,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86405,108030,WB013,2020-06-27,8.65,16.46,,,,,0.69,...,,50.0,Good,좋음,1,0,0,0,0,0
86406,108031,WB013,2020-06-28,11.80,18.47,,,,,0.68,...,,65.0,Satisfactory,좋음,0,0,0,1,0,0
86407,108032,WB013,2020-06-29,18.60,32.26,13.65,200.87,214.20,11.40,0.78,...,,63.0,Satisfactory,보통,0,0,0,1,0,0
86408,108033,WB013,2020-06-30,16.07,39.30,7.56,29.13,36.69,29.26,0.69,...,,57.0,Satisfactory,보통,0,0,0,1,0,0


In [16]:
df.columns

Index(['index', 'StationId', 'Date', 'PM2.5', 'PM10', 'NO', 'NO2', 'NOx',
       'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI',
       'AQI_Bucket', '예보 등급', 'AQI_Bucket_Good', 'AQI_Bucket_Moderate',
       'AQI_Bucket_Poor', 'AQI_Bucket_Satisfactory', 'AQI_Bucket_Severe',
       'AQI_Bucket_Very Poor'],
      dtype='object')

In [17]:
df.drop('AQI_Bucket', axis = 1, inplace = True)
df

Unnamed: 0,index,StationId,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,...,Toluene,Xylene,AQI,예보 등급,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
0,0,AP001,2017-11-24,71.36,115.75,1.75,20.65,12.40,12.19,0.10,...,5.92,0.10,,나쁨,0,0,0,0,0,0
1,1,AP001,2017-11-25,81.40,124.50,1.44,20.50,12.08,10.72,0.12,...,6.50,0.06,184.0,매우 나쁨,0,1,0,0,0,0
2,2,AP001,2017-11-26,78.32,129.06,1.26,26.00,14.85,10.28,0.14,...,7.95,0.08,197.0,매우 나쁨,0,1,0,0,0,0
3,3,AP001,2017-11-27,88.76,135.32,6.60,30.85,21.77,12.91,0.11,...,7.63,0.12,198.0,매우 나쁨,0,1,0,0,0,0
4,4,AP001,2017-11-28,64.18,104.09,2.56,28.07,17.01,11.42,0.09,...,5.02,0.07,188.0,나쁨,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86405,108030,WB013,2020-06-27,8.65,16.46,,,,,0.69,...,7.26,,50.0,좋음,1,0,0,0,0,0
86406,108031,WB013,2020-06-28,11.80,18.47,,,,,0.68,...,7.92,,65.0,좋음,0,0,0,1,0,0
86407,108032,WB013,2020-06-29,18.60,32.26,13.65,200.87,214.20,11.40,0.78,...,8.64,,63.0,보통,0,0,0,1,0,0
86408,108033,WB013,2020-06-30,16.07,39.30,7.56,29.13,36.69,29.26,0.69,...,8.40,,57.0,보통,0,0,0,1,0,0


In [18]:
df['예보 등급'] = df['예보 등급'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86410 entries, 0 to 86409
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   index                    86410 non-null  int64         
 1   StationId                86410 non-null  category      
 2   Date                     86410 non-null  datetime64[ns]
 3   PM2.5                    86410 non-null  float64       
 4   PM10                     61525 non-null  float64       
 5   NO                       84336 non-null  float64       
 6   NO2                      85008 non-null  float64       
 7   NOx                      82658 non-null  float64       
 8   NH3                      58100 non-null  float64       
 9   CO                       83460 non-null  float64       
 10  SO2                      76848 non-null  float64       
 11  O3                       78770 non-null  float64       
 12  Benzene                  68896 n

In [19]:
df.fillna(method = 'ffill', inplace = True)
df

Unnamed: 0,index,StationId,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,...,Toluene,Xylene,AQI,예보 등급,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
0,0,AP001,2017-11-24,71.36,115.75,1.75,20.65,12.40,12.19,0.10,...,5.92,0.10,,나쁨,0,0,0,0,0,0
1,1,AP001,2017-11-25,81.40,124.50,1.44,20.50,12.08,10.72,0.12,...,6.50,0.06,184.0,매우 나쁨,0,1,0,0,0,0
2,2,AP001,2017-11-26,78.32,129.06,1.26,26.00,14.85,10.28,0.14,...,7.95,0.08,197.0,매우 나쁨,0,1,0,0,0,0
3,3,AP001,2017-11-27,88.76,135.32,6.60,30.85,21.77,12.91,0.11,...,7.63,0.12,198.0,매우 나쁨,0,1,0,0,0,0
4,4,AP001,2017-11-28,64.18,104.09,2.56,28.07,17.01,11.42,0.09,...,5.02,0.07,188.0,나쁨,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86405,108030,WB013,2020-06-27,8.65,16.46,23.51,16.50,40.02,25.09,0.69,...,7.26,2.42,50.0,좋음,1,0,0,0,0,0
86406,108031,WB013,2020-06-28,11.80,18.47,23.51,16.50,40.02,25.09,0.68,...,7.92,2.42,65.0,좋음,0,0,0,1,0,0
86407,108032,WB013,2020-06-29,18.60,32.26,13.65,200.87,214.20,11.40,0.78,...,8.64,2.42,63.0,보통,0,0,0,1,0,0
86408,108033,WB013,2020-06-30,16.07,39.30,7.56,29.13,36.69,29.26,0.69,...,8.40,2.42,57.0,보통,0,0,0,1,0,0


In [20]:
df.drop('Date', axis = 1, inplace = True)
df

Unnamed: 0,index,StationId,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,...,Toluene,Xylene,AQI,예보 등급,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
0,0,AP001,71.36,115.75,1.75,20.65,12.40,12.19,0.10,10.76,...,5.92,0.10,,나쁨,0,0,0,0,0,0
1,1,AP001,81.40,124.50,1.44,20.50,12.08,10.72,0.12,15.24,...,6.50,0.06,184.0,매우 나쁨,0,1,0,0,0,0
2,2,AP001,78.32,129.06,1.26,26.00,14.85,10.28,0.14,26.96,...,7.95,0.08,197.0,매우 나쁨,0,1,0,0,0,0
3,3,AP001,88.76,135.32,6.60,30.85,21.77,12.91,0.11,33.59,...,7.63,0.12,198.0,매우 나쁨,0,1,0,0,0,0
4,4,AP001,64.18,104.09,2.56,28.07,17.01,11.42,0.09,19.00,...,5.02,0.07,188.0,나쁨,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86405,108030,WB013,8.65,16.46,23.51,16.50,40.02,25.09,0.69,4.36,...,7.26,2.42,50.0,좋음,1,0,0,0,0,0
86406,108031,WB013,11.80,18.47,23.51,16.50,40.02,25.09,0.68,3.49,...,7.92,2.42,65.0,좋음,0,0,0,1,0,0
86407,108032,WB013,18.60,32.26,13.65,200.87,214.20,11.40,0.78,5.12,...,8.64,2.42,63.0,보통,0,0,0,1,0,0
86408,108033,WB013,16.07,39.30,7.56,29.13,36.69,29.26,0.69,5.88,...,8.40,2.42,57.0,보통,0,0,0,1,0,0


In [21]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [22]:
target = df['예보 등급']
df.drop('예보 등급', axis = 1, inplace = True)

In [23]:
# F2 = pd.get_dummies(df['StationId'], prefix = 'StationId')
# F2

In [24]:
# df = pd.merge(df, F2, left_index = True, right_index = True, how = 'left')
# df

In [25]:
df.drop('StationId', axis = 1, inplace = True)
df

Unnamed: 0,index,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
0,0,71.36,115.75,1.75,20.65,12.40,12.19,0.10,10.76,109.26,0.17,5.92,0.10,,0,0,0,0,0,0
1,1,81.40,124.50,1.44,20.50,12.08,10.72,0.12,15.24,127.09,0.20,6.50,0.06,184.0,0,1,0,0,0,0
2,2,78.32,129.06,1.26,26.00,14.85,10.28,0.14,26.96,117.44,0.22,7.95,0.08,197.0,0,1,0,0,0,0
3,3,88.76,135.32,6.60,30.85,21.77,12.91,0.11,33.59,111.81,0.29,7.63,0.12,198.0,0,1,0,0,0,0
4,4,64.18,104.09,2.56,28.07,17.01,11.42,0.09,19.00,138.18,0.17,5.02,0.07,188.0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86405,108030,8.65,16.46,23.51,16.50,40.02,25.09,0.69,4.36,30.59,1.32,7.26,2.42,50.0,1,0,0,0,0,0
86406,108031,11.80,18.47,23.51,16.50,40.02,25.09,0.68,3.49,38.95,1.42,7.92,2.42,65.0,0,0,0,1,0,0
86407,108032,18.60,32.26,13.65,200.87,214.20,11.40,0.78,5.12,38.17,3.52,8.64,2.42,63.0,0,0,0,1,0,0
86408,108033,16.07,39.30,7.56,29.13,36.69,29.26,0.69,5.88,29.64,1.86,8.40,2.42,57.0,0,0,0,1,0,0


In [26]:
df.fillna(method = 'ffill', inplace = True, axis = 0)
df

Unnamed: 0,index,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
0,0,71.36,115.75,1.75,20.65,12.40,12.19,0.10,10.76,109.26,0.17,5.92,0.10,,0,0,0,0,0,0
1,1,81.40,124.50,1.44,20.50,12.08,10.72,0.12,15.24,127.09,0.20,6.50,0.06,184.0,0,1,0,0,0,0
2,2,78.32,129.06,1.26,26.00,14.85,10.28,0.14,26.96,117.44,0.22,7.95,0.08,197.0,0,1,0,0,0,0
3,3,88.76,135.32,6.60,30.85,21.77,12.91,0.11,33.59,111.81,0.29,7.63,0.12,198.0,0,1,0,0,0,0
4,4,64.18,104.09,2.56,28.07,17.01,11.42,0.09,19.00,138.18,0.17,5.02,0.07,188.0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86405,108030,8.65,16.46,23.51,16.50,40.02,25.09,0.69,4.36,30.59,1.32,7.26,2.42,50.0,1,0,0,0,0,0
86406,108031,11.80,18.47,23.51,16.50,40.02,25.09,0.68,3.49,38.95,1.42,7.92,2.42,65.0,0,0,0,1,0,0
86407,108032,18.60,32.26,13.65,200.87,214.20,11.40,0.78,5.12,38.17,3.52,8.64,2.42,63.0,0,0,0,1,0,0
86408,108033,16.07,39.30,7.56,29.13,36.69,29.26,0.69,5.88,29.64,1.86,8.40,2.42,57.0,0,0,0,1,0,0


In [27]:
df.reset_index(inplace = True)

In [28]:
df.drop('index', axis = 1, inplace = True)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size = 0.2, random_state = 42)

In [30]:
X_train

Unnamed: 0,level_0,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket_Good,AQI_Bucket_Moderate,AQI_Bucket_Poor,AQI_Bucket_Satisfactory,AQI_Bucket_Severe,AQI_Bucket_Very Poor
46368,46368,92.79,155.42,3.26,18.43,21.69,9.11,0.61,14.02,63.98,0.00,0.00,6.99,123.0,0,1,0,0,0,0
46867,46867,23.75,155.42,4.15,22.33,26.48,9.11,0.85,8.22,25.91,0.00,0.00,6.99,52.0,0,0,0,1,0,0
5780,5780,46.03,155.58,14.99,38.94,161.73,51.07,0.62,44.65,10.95,0.68,9.36,0.00,202.0,0,0,1,0,0,0
64654,64654,79.85,164.87,8.23,18.10,26.32,28.82,0.34,8.43,52.85,0.14,3.03,8.82,172.0,0,1,0,0,0,0
8974,8974,85.02,106.61,8.80,2.81,11.31,33.15,1.28,23.12,43.27,0.38,17.62,0.00,301.0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6265,6265,197.56,343.59,83.07,71.39,124.25,112.10,2.66,19.89,16.11,8.50,23.29,0.00,452.0,0,0,0,0,1,0
54886,54886,13.47,28.95,1.38,2.83,10.62,2.40,0.55,1.60,17.67,0.00,0.00,0.00,41.0,1,0,0,0,0,0
76820,76820,50.32,4.85,26.75,48.57,42.74,45.05,0.62,35.54,45.27,1.38,2.26,0.40,80.0,0,0,0,1,0,0
860,860,19.96,58.07,1.42,5.43,4.05,10.08,0.55,9.11,21.51,0.05,0.34,0.30,62.0,0,0,0,1,0,0


In [31]:
# # stationId때문에 Scaler는 끝나고 여유 있을 때 처리
# scaler = StandardScaler()
# X_train_scaler = scaler.fit_transform(X_train)
# X_test_scaler = scaler.fit_transform(X_test)


In [32]:
X_train.drop('level_0', axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [33]:
X_test.drop('level_0', axis = 1, inplace = True)

In [34]:
X_train.isnull()
y_train.isnull()

46368    False
46867    False
5780     False
64654    False
8974     False
         ...  
6265     False
54886    False
76820    False
860      False
15795    False
Name: 예보 등급, Length: 69128, dtype: bool

In [35]:
from sklearn.metrics import confusion_matrix

In [46]:
forest = RandomForestClassifier(random_state = 42)
forest.fit(X_train, y_train)

y_pred = forest.predict(X_test)

confusion_matrix(y_true = y_test, y_pred = y_pred)

array([[6065,    0,    0,    0],
       [   0, 6136,    0,    0],
       [   0,    0, 3712,    1],
       [   0,    0,    0, 1368]], dtype=int64)

In [None]:
np.any(np.isnan(X_train))

In [None]:
cross_val_score(estimator = forest, X = X_test, y = y_test, cv = 5)

In [None]:
X_train.fillna(method = 'bfill')
X_train

In [None]:
X_train[np.isnan(X_train)] = X_train.median(axis = 1)

In [45]:
X_train['AQI'][0] = X_train['AQI'][1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['AQI'][0] = X_train['AQI'][1]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [43]:
X_train[np.isnan(X_train) == False].count()

PM2.5                      69128
PM10                       69128
NO                         69128
NO2                        69128
NOx                        69128
NH3                        69128
CO                         69128
SO2                        69128
O3                         69128
Benzene                    69128
Toluene                    69128
Xylene                     69128
AQI                        69127
AQI_Bucket_Good            69128
AQI_Bucket_Moderate        69128
AQI_Bucket_Poor            69128
AQI_Bucket_Satisfactory    69128
AQI_Bucket_Severe          69128
AQI_Bucket_Very Poor       69128
dtype: int64