## 다중 분류 문제
- 난방 부하 단계를 예측해주세요!

- 예측할 값(y): Heat_Load (Very Low, Low, Medium, High, Very High)
- 평가: f1-macro
- data: train.csv, test.csv
- 제출 형식: result.csv파일을 아래와 같은 형식으로 제출
~~~
pred
Very Low
Low
High
...
Very High
~~~

### 답안 제출 참고
- pd.read_csv('result.csv') 로 제출 코드 확인

# 1. 문제정의
- 문제: 다중 분류
- 평가: f1-macro
- 타겟: 문자(5가지)

# 2. 라이브러리 및 데이터 불러오기

In [1]:
# 데이터 불러오기
import pandas as pd
train = pd.read_csv("energy_train.csv")
test = pd.read_csv("energy_test.csv")

# 3. 탐색적 데이터 분석(EDA)

In [2]:
# 데이터 크기 확인
train.shape, test.shape

((537, 10), (231, 9))

In [3]:
# train 샘플 확인
train.head(2)

Unnamed: 0,Compac,Surf_Area,Wall_Area,Roof,Height,Orient,Glaze_Area,Glaze_Distr,Cool_Load,Heat_Load
0,0.74,686.0,245.0,220.5,Short,South,0.25,3,14.72,Very Low
1,0.98,514.5,294.0,Small,Tall,South,0.4,2,33.94,High


In [4]:
# test 샘플 확인
test.head(1)

Unnamed: 0,Compac,Surf_Area,Wall_Area,Roof,Height,Orient,Glaze_Area,Glaze_Distr,Cool_Load
0,0.64,784.0,343.0,220.5,Short,South,0.4,4,22.25


In [5]:
# type 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537 entries, 0 to 536
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Compac       537 non-null    float64
 1   Surf_Area    537 non-null    float64
 2   Wall_Area    537 non-null    float64
 3   Roof         537 non-null    object 
 4   Height       537 non-null    object 
 5   Orient       537 non-null    object 
 6   Glaze_Area   537 non-null    float64
 7   Glaze_Distr  537 non-null    int64  
 8   Cool_Load    537 non-null    float64
 9   Heat_Load    537 non-null    object 
dtypes: float64(5), int64(1), object(4)
memory usage: 42.1+ KB


In [6]:
train['Roof'].value_counts()

220.5     257
Large     141
Medium     92
Small      47
Name: Roof, dtype: int64

In [7]:
# 기초통계 train(object)
train.describe(include='O')

Unnamed: 0,Roof,Height,Orient,Heat_Load
count,537.0,537,537,537
unique,4.0,2,4,5
top,220.5,Tall,South,Very Low
freq,257.0,280,145,142


In [8]:
# 기초통계 test(object)
test.describe(include='O')

Unnamed: 0,Roof,Height,Orient
count,231.0,231,231
unique,4.0,2,4
top,220.5,Short,North
freq,127.0,127,74


In [9]:
# 결측치 확인(train)
train.isnull().sum()

Compac         0
Surf_Area      0
Wall_Area      0
Roof           0
Height         0
Orient         0
Glaze_Area     0
Glaze_Distr    0
Cool_Load      0
Heat_Load      0
dtype: int64

In [10]:
# 결측치 확인(test)
test.isnull().sum().sum()

0

In [11]:
# target 확인
train['Heat_Load'].value_counts()

Very Low     142
Low          123
High         122
Very High     79
Medium        71
Name: Heat_Load, dtype: int64

# 4. 데이터 전처리

In [12]:
# target컬럼 처리
target = train.pop('Heat_Load')

In [13]:
# 원핫 인코딩(판다스)
print(train.shape, test.shape)
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train.shape, test.shape)

(537, 9) (231, 9)
(537, 16) (231, 16)


# 5. 검증 데이터 분할

In [14]:
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train,
                                           target,
                                           test_size=0.2,
                                           random_state=0)
X_tr.shape, X_val.shape, y_tr.shape, y_val.shape

((429, 16), (108, 16), (429,), (108,))

# 6. 머신러닝 학습 및 평가

In [17]:
# 평가 함수
from sklearn.metrics import f1_score

In [19]:
# 랜덤 포레스트
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(X_tr, y_tr)
pred = rf.predict(X_val)

f1_score(y_val, pred, average='macro')

0.9277616846430405

# 7. 예측 및 결과 파일 생성

In [20]:
# test 예측
pred = rf.predict(test)
submit = pd.DataFrame({
    'pred':pred
})
submit.to_csv('result.csv', index=False)

In [21]:
pd.read_csv("result.csv")

Unnamed: 0,pred
0,Low
1,High
2,High
3,Low
4,Low
...,...
226,Very Low
227,Medium
228,Very Low
229,Low
