### 🧠 **머신러닝 실습 - 분류**

머신러닝(ML) : 데이터를 학습하고 패턴을 발견하여 의사결정을 하는 인공지능 분야

#### 📊 **머신러닝 절차**
1. 문제 정의 (문제를 읽고 파악)
2. 필요한 라이브러리 및 데이터 불러오기 (import)
3. EDA(탐색적 데이터 분석)
    - 데이터 샘플/크기(자료형, 통계량(수치/범주)), 결측치 확인 등
4. 데이터 전처리 
    - 결측치 및 이상치 처리, 인코딩, 스케일링 등
5. 검증 데이터 나누기
6. 모델 학습 및 평가
7. 예측 및 결과 파일 생성



| 변수명      | 의미                                  |
|-------------|----------------------------------------|
| x_train     | 학습용 입력 데이터(특징, 피처)             |
| y_train     | 학습용 정답 데이터(레이블, 타깃)            |
| x_test      | 테스트용 입력 데이터(특징, 피처)            |
| y_test      | 테스트용 정답 데이터(레이블, 타깃)          |


---
#### **1. 문제 정의**

**[문제]**
- 데이터 : 미국의 인구 조사 데이터
- 예측할 값 : 각 사람의 소득
    - `income`(소득) 컬럼은 연소득이 50K 이상/미만으로 구분됨.
- 평가 기준 : ROC-AUC
- 제출 파일 : 예측값만 'result.csv' 파일로 생성함. (컬럼명: pred, 1개)

---


#### **2. 필요한 라이브러리 및 데이터 불러오기**

In [208]:
import pandas as pd

In [209]:
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

---
#### **3. EDA (탐색적 데이터 분석)**

In [210]:
train.head()

Unnamed: 0,id,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,3331,34.0,State-gov,177331,Some-college,10,Married-civ-spouse,Prof-specialty,Husband,Black,Male,4386,0,40.0,United-States,>50K
1,19749,58.0,Private,290661,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40.0,United-States,<=50K
2,1157,48.0,Private,125933,Some-college,10,Widowed,Exec-managerial,Unmarried,Black,Female,0,1669,38.0,United-States,<=50K
3,693,58.0,Private,100313,Some-college,10,Married-civ-spouse,Protective-serv,Husband,White,Male,0,1902,40.0,United-States,>50K
4,12522,41.0,Private,195661,Some-college,10,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,54.0,United-States,<=50K


In [211]:
test.head()

Unnamed: 0,id,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,11574,39.0,State-gov,114055,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,40.0,United-States
1,15847,38.0,Private,254114,Some-college,10,Married-spouse-absent,Prof-specialty,Own-child,Black,Female,0,0,40.0,United-States
2,17655,44.0,State-gov,55395,HS-grad,9,Never-married,Craft-repair,Not-in-family,White,Male,0,0,,United-States
3,19790,47.0,Private,28035,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50.0,United-States
4,31812,62.0,,186611,HS-grad,9,Never-married,,Not-in-family,White,Male,0,0,40.0,United-States


In [212]:
## 데이터의 크기 파악하기 (행, 열 개수)
print(train.shape)
print(test.shape)

# train의 column 개수는 test보다 1개 많을 수밖에 없음.
# test 데이터에는 test의 정답 데이터(income) column이 포함되어 있지 않기 때문임.

(29304, 16)
(3257, 15)


In [213]:
## 데이터의 정보 파악하기
print(train.info())

# object 범주 자료형 -> 수치형으로 변환하는 과정이 필수.
# int, float 수치 자료형 -> 스케일링 과정 필요 가능성 높음.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29304 entries, 0 to 29303
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              29304 non-null  int64  
 1   age             29292 non-null  float64
 2   workclass       27642 non-null  object 
 3   fnlwgt          29304 non-null  int64  
 4   education       29304 non-null  object 
 5   education.num   29304 non-null  int64  
 6   marital.status  29304 non-null  object 
 7   occupation      27636 non-null  object 
 8   relationship    29304 non-null  object 
 9   race            29304 non-null  object 
 10  sex             29304 non-null  object 
 11  capital.gain    29304 non-null  int64  
 12  capital.loss    29304 non-null  int64  
 13  hours.per.week  29291 non-null  float64
 14  native.country  28767 non-null  object 
 15  income          29304 non-null  object 
dtypes: float64(2), int64(5), object(9)
memory usage: 3.6+ MB
None


In [214]:
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3257 entries, 0 to 3256
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              3257 non-null   int64  
 1   age             3251 non-null   float64
 2   workclass       3083 non-null   object 
 3   fnlwgt          3257 non-null   int64  
 4   education       3257 non-null   object 
 5   education.num   3257 non-null   int64  
 6   marital.status  3257 non-null   object 
 7   occupation      3082 non-null   object 
 8   relationship    3257 non-null   object 
 9   race            3257 non-null   object 
 10  sex             3257 non-null   object 
 11  capital.gain    3257 non-null   int64  
 12  capital.loss    3257 non-null   int64  
 13  hours.per.week  3248 non-null   float64
 14  native.country  3211 non-null   object 
dtypes: float64(2), int64(5), object(8)
memory usage: 381.8+ KB
None


In [215]:
## 결측치 파악
train.isnull().sum()

# 결측치가 있을 경우, 대체가 필요함.
# 최댓값, 최솟값, 최빈값, 타 데이터 값 등등 상황에 적절한 값으로 대치.

id                   0
age                 12
workclass         1662
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1668
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week      13
native.country     537
income               0
dtype: int64

In [216]:
test.isnull().sum()

id                  0
age                 6
workclass         174
fnlwgt              0
education           0
education.num       0
marital.status      0
occupation        175
relationship        0
race                0
sex                 0
capital.gain        0
capital.loss        0
hours.per.week      9
native.country     46
dtype: int64

In [217]:
## 기초 통계량 확인
train.describe()

Unnamed: 0,id,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,29304.0,29292.0,29304.0,29304.0,29304.0,29304.0,29291.0
mean,16264.02788,38.553223,189748.8,10.080842,1093.858722,86.744506,40.434229
std,9384.518323,13.628811,105525.0,2.570824,7477.43564,401.518928,12.324036
min,0.0,-38.0,12285.0,1.0,0.0,0.0,1.0
25%,8145.75,28.0,117789.0,9.0,0.0,0.0,40.0
50%,16253.5,37.0,178376.5,10.0,0.0,0.0,40.0
75%,24374.25,48.0,237068.2,12.0,0.0,0.0,45.0
max,32560.0,90.0,1484705.0,16.0,99999.0,4356.0,99.0


---
#### **4. 데이터 전처리**

In [218]:
## 결측치 삭제

# train = train.dropna()
# print(train.shape)

결측치 대체 (범주형)
- 범주형 데이터는 주로 최빈값으로 대체.
- train에서 결측치 대체한 컬럼은 test에서도 동일하게 적용해야 함.

In [219]:
## 결측치 대체
train['workclass'] = train['workclass'].fillna(train['workclass'].mode()[0])
train['native.country'] = train['native.country'].fillna(train['native.country'].mode()[0])
train['occupation'] = train['occupation'].fillna(train['occupation'].mode()[0])

test['workclass'] = test['workclass'].fillna(test['workclass'].mode()[0])
test['native.country'] = test['native.country'].fillna(test['native.country'].mode()[0])
test['occupation'] = test['occupation'].fillna(test['occupation'].mode()[0])

In [220]:
train.isnull().sum()

id                 0
age               12
workclass          0
fnlwgt             0
education          0
education.num      0
marital.status     0
occupation         0
relationship       0
race               0
sex                0
capital.gain       0
capital.loss       0
hours.per.week    13
native.country     0
income             0
dtype: int64

결측치 대체 (수치형)
- 수치형 데이터는 적절한 값으로 대체.

In [221]:
train['age'] = train['age'].fillna(train['age'].mean())
train['hours.per.week'] = train['hours.per.week'].fillna(train['hours.per.week'].median())

test['age'] = test['age'].fillna(test['age'].mean())
test['hours.per.week'] = test['hours.per.week'].fillna(test['hours.per.week'].median())

In [222]:
train.isnull().sum() # 결측치 다 채워졌는지 확인

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

In [223]:
test.isnull().sum()

id                0
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
dtype: int64

이상치 처리
- `describe()`로 확인
- age가 음수인 값이 있으므로 제거해야 함. (나이는 음수 값이 나올 수 없음.)
- test에는 age가 음수인 값이 없음.
- 항상 코드 실행 후 `print(df.shape)`를 통해 확인하기.

In [224]:
train.describe()

Unnamed: 0,id,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,29304.0,29304.0,29304.0,29304.0,29304.0,29304.0,29304.0
mean,16264.02788,38.553223,189748.8,10.080842,1093.858722,86.744506,40.434036
std,9384.518323,13.62602,105525.0,2.570824,7477.43564,401.518928,12.321306
min,0.0,-38.0,12285.0,1.0,0.0,0.0,1.0
25%,8145.75,28.0,117789.0,9.0,0.0,0.0,40.0
50%,16253.5,37.0,178376.5,10.0,0.0,0.0,40.0
75%,24374.25,48.0,237068.2,12.0,0.0,0.0,45.0
max,32560.0,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [225]:
test.describe()

Unnamed: 0,id,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,3257.0,3257.0,3257.0,3257.0,3257.0,3257.0,3257.0
mean,16423.704943,38.80283,190044.7,10.079214,931.804728,92.336199,40.466994
std,9535.416746,13.904759,105790.2,2.590118,6496.962999,415.732721,12.581146
min,3.0,17.0,18827.0,1.0,0.0,0.0,1.0
25%,8078.0,28.0,118652.0,9.0,0.0,0.0,40.0
50%,16626.0,37.0,178319.0,10.0,0.0,0.0,40.0
75%,24743.0,48.0,236436.0,12.0,0.0,0.0,45.0
max,32559.0,90.0,1033222.0,16.0,99999.0,3900.0,99.0


In [226]:
print(train.shape)
train = train[train['age'] > 0]
print(train.shape)

(29304, 16)
(29301, 16)


#### **인코딩**



**원-핫 인코딩**
- 범주형 데이터를 수치형 데이터로 변환하는 과정.
- 원-핫 인코딩 후 train과 test의 컬럼 개수가 불일치하므로 데이터 합치는 작업 필요.


In [227]:
y_train = train.pop('income')
# pop()은 income 컬럼을 y_train에 대입하고, income 컬럼을 삭제하는 작업 동시 수행

In [228]:
# 원-핫 인코딩
train_oh = pd.get_dummies(train)
test_oh = pd.get_dummies(test)
print(train.shape, train_oh.shape, test.shape, test_oh.shape)

(29301, 15) (29301, 106) (3257, 15) (3257, 102)


train_oh와 test_oh의 컬럼 개수가 불일치하므로 데이터 병합을 통해 개수 일치시키는 작업

In [229]:
print(train.shape, test.shape)
data = pd.concat([train, test], axis=0) # axis=0 : train과 test 위아래로 병합
data_oh = pd.get_dummies(data)

train_oh = data_oh.iloc[:len(train)].copy() # train 부분
test_oh = data_oh.iloc[len(train):].copy() # test 부분
print(train_oh.shape, test_oh.shape)

(29301, 15) (3257, 15)
(29301, 106) (3257, 106)


**레이블 인코딩**
- 범주형 데이터를 수치형 데이터로 변환하는 과정.
- 각 고유 값을 정수로 매핑함.

In [230]:
from sklearn.preprocessing import LabelEncoder

cols = train.select_dtypes(include='object').columns # 범주형 데이터만 선택

for col in cols :
    le = LabelEncoder()
    train[col] = le.fit_transform(train[col])
    test[col] = le.transform(test[col]) # 학습은 train에서만 진행하므로, fit은 제거.

train.head()

Unnamed: 0,id,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,3331,34.0,6,177331,15,10,2,9,0,2,1,4386,0,40.0,38
1,19749,58.0,3,290661,11,9,2,2,0,4,1,0,0,40.0,38
2,1157,48.0,3,125933,15,10,6,3,4,2,0,0,1669,38.0,38
3,693,58.0,3,100313,15,10,2,10,0,4,1,0,1902,40.0,38
4,12522,41.0,3,195661,15,10,2,13,0,4,1,0,0,54.0,38


#### **스케일링**
- 수치형 데이터의 범위를 조정하는 작업.



**Min-Max Scaling**
- 데이터를 0과 1 사이로 변환

In [231]:
def get_df() :
    train_copy = train.copy()
    test_copy = test.copy()
    return train_copy, test_copy

In [232]:
from sklearn.preprocessing import MinMaxScaler

train_copy, test_copy = get_df()
cols = ['age','fnlwgt','education.num','capital.gain','capital.loss','hours.per.week'] # 수치형 데이터

scaler = MinMaxScaler()
train_copy[cols] = scaler.fit_transform(train_copy[cols])
test_copy[cols] = scaler.transform(test_copy[cols])

train_copy[cols].head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
0,0.232877,0.112092,0.6,0.04386,0.0,0.397959
1,0.561644,0.18906,0.533333,0.0,0.0,0.397959
2,0.424658,0.077184,0.6,0.0,0.38315,0.377551
3,0.561644,0.059785,0.6,0.0,0.436639,0.397959
4,0.328767,0.124541,0.6,0.0,0.0,0.540816


In [233]:
train_copy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29301 entries, 0 to 29303
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              29301 non-null  int64  
 1   age             29301 non-null  float64
 2   workclass       29301 non-null  int32  
 3   fnlwgt          29301 non-null  float64
 4   education       29301 non-null  int32  
 5   education.num   29301 non-null  float64
 6   marital.status  29301 non-null  int32  
 7   occupation      29301 non-null  int32  
 8   relationship    29301 non-null  int32  
 9   race            29301 non-null  int32  
 10  sex             29301 non-null  int32  
 11  capital.gain    29301 non-null  float64
 12  capital.loss    29301 non-null  float64
 13  hours.per.week  29301 non-null  float64
 14  native.country  29301 non-null  int32  
dtypes: float64(6), int32(8), int64(1)
memory usage: 2.7 MB


**Standard Scaling** 
- 데이터를 평균이 0이고 표준편차가 1인 분포로 변환

In [234]:
from sklearn.preprocessing import StandardScaler

train_copy, test_copy = get_df()

scaler = StandardScaler()
train_copy[cols] = scaler.fit_transform(train_copy[cols])
test_copy[cols] = scaler.transform(test_copy[cols])

display(train_copy[cols].head())

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
0,-0.335138,-0.117705,-0.031462,0.440247,-0.216056,-0.035121
1,1.428574,0.956277,-0.42043,-0.146298,-0.216056,-0.035121
2,0.693694,-0.604783,-0.031462,-0.146298,3.940528,-0.19745
3,1.428574,-0.847573,-0.031462,-0.146298,4.520806,-0.035121
4,0.179278,0.056001,-0.031462,-0.146298,-0.216056,1.101181


**Robust Scaling** 
- 각 값의 중앙값을 빼고 Q1과 Q3의 차이(사분위수, IQR)로 나누는 방법
- 이상치의 영향을 덜 받음.

In [235]:
from sklearn.preprocessing import RobustScaler

train_copy, test_copy = get_df()

scaler = RobustScaler()
train_copy[cols] = scaler.fit_transform(train_copy[cols])
test_copy[cols] = scaler.transform(test_copy[cols])

display(train_copy[cols].head())

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
0,-0.15,-0.008711,0.0,4386.0,0.0,0.0
1,1.05,0.941438,-0.333333,0.0,0.0,0.0
2,0.55,-0.439627,0.0,0.0,1669.0,-0.4
3,1.05,-0.654423,0.0,0.0,1902.0,0.0
4,0.2,0.144966,0.0,0.0,0.0,2.8


---
#### **5. 검증 데이터 나누기**

- 검증 데이터는 학습 데이터의 일부를 사용

`X_train, X_val, y_train, y_val = train_test_split(train, y_train, test_size=0.2, random_state=0)`


- 데이터 분할 이후 `X_train.shape, X_val.shape`의 컬럼 수는 일치해야 함.
- 데이터 분할 이후 `y_train.shape, y_val.shape`에서 컬럼에 1이 나타나지 않아야 함.
    - y 데이터는 정답 레이블이므로 컬럼 개수는 1개여야 한다.
    - 1차원 구조이므로, 컬럼에는 숫자가 나타나면 안 됨.


(23440,), (5861,) => series 형태 의미

In [236]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(train, y_train, test_size=0.2, random_state=1)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((23440, 15), (5861, 15), (23440,), (5861,))

---
#### **6. 머신러닝 학습 및 평가**


**Random Forest**
- 여러 개의 의사결정 나무를 기반으로 한 앙상블 학습 알고리즘


1. `rf = RandomForestClassifier() # 모델 선택하기`
2. `rf.fit(X_train, y_train) # 학습 진행`
3. `pred = rf.predict_proba(X_val) # 예측 진행`

In [237]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=0) # random_state=0을 설정해야 결과값 고정됨
rf.fit(X_train, y_train) # 학습 진행
pred = rf.predict_proba(X_val) # 예측 진행

print(rf.classes_)
pred[:10]

['<=50K' '>50K']


array([[0.86, 0.14],
       [0.53, 0.47],
       [1.  , 0.  ],
       [0.64, 0.36],
       [0.88, 0.12],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.91, 0.09],
       [0.82, 0.18],
       [0.91, 0.09]])

**평가지표**
- 머신러닝 모델 학습 후, 제대로 학습 및 예측이 되었는지에 대한 평가가 필요.
- 정확도(Accuracy), 정밀도(Precision), 재현율(Recall), F1 score, ROC_AUC..

In [238]:
# ROC_AUC
from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_val, pred[:,1])
print('roc_auc:', roc_auc)

roc_auc: 0.9157221689432705


In [239]:
# Accuracy (정확도)
from sklearn.metrics import accuracy_score

pred = rf.predict(X_val)
accuracy = accuracy_score(y_val, pred)
print('accuracy_score:', accuracy)

accuracy_score: 0.86316328271626


In [240]:
# F1 score
from sklearn.metrics import f1_score

f1 = f1_score(y_val, pred, pos_label='>50K') # 양성 클래스 지정
print('f1_score:',f1)

f1_score: 0.6920122887864824


**LightBGM**

- 랜덤 포레스트보다 더 높은 성능을 보임.

In [241]:
pip install lightgbm -q

Note: you may need to restart the kernel to use updated packages.


In [242]:
import lightgbm as lgb

lgbmc = lgb.LGBMClassifier(random_state=0, verbose=-1) # verbose=-1 은 로그 메세지 숨김
lgbmc.fit(X_train, y_train)
pred = lgbmc.predict_proba(X_val)

roc_auc = roc_auc_score(y_val, pred[:,1])
print('roc_auc:', roc_auc)

pred = lgbmc.predict(X_val)
accuracy = accuracy_score(y_val, pred)
print('accuracy:', accuracy)

f1 = f1_score(y_val, pred, pos_label='>50K')
print('f1_score:', f1)

roc_auc: 0.9283022976825719
accuracy: 0.8740829210032418
f1_score: 0.716589861751152


---
#### **7. 예측 및 결과 파일 생성**
- 이전 단계까지는 검증 데이터 예측, 최종 단계에서는 test 데이터 예측
- 최종 csv 파일에서는 X_test 또는 y_test의 행 수와 동일해야 함.

In [243]:
## LightGBM을 활용한 예측
pred = lgbmc.predict_proba(test)
pred

array([[0.91420997, 0.08579003],
       [0.96821076, 0.03178924],
       [0.97920447, 0.02079553],
       ...,
       [0.90836567, 0.09163433],
       [0.9877401 , 0.0122599 ],
       [0.99154646, 0.00845354]])

In [244]:
## RandomForest를 활용한 예측
pred2 = rf.predict_proba(test)
pred2

array([[0.99, 0.01],
       [0.99, 0.01],
       [0.97, 0.03],
       ...,
       [0.84, 0.16],
       [0.97, 0.03],
       [0.94, 0.06]])

In [None]:
result = pd.DataFrame({'pred':pred[:,1]})
result.to_csv('result.csv', index=False)

In [None]:
new = pd.read_csv('result.csv')
new

Unnamed: 0,pred
0,0.085790
1,0.031789
2,0.020796
3,0.836385
4,0.051083
...,...
3252,0.014245
3253,0.377803
3254,0.091634
3255,0.012260


In [248]:
new.shape

(3257, 1)

---
### **머신러닝 평가지표**

#### **이진 분류 평가지표**
- 정확도 (Accuracy)
- 정밀도 (Precision)
- 재현율 (Recall)
- f1-score
- ROC-AUC

In [1]:
import pandas as pd

In [3]:
# 이진 분류 데이터
y_true = pd.DataFrame([1, 1, 1, 0, 0, 1, 1, 1, 1, 0]) # 실제 값
y_pred = pd.DataFrame([1, 0, 1, 1, 0, 0, 0, 1, 1, 0]) # 예측값

y_true_str = pd.DataFrame(['A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B']) # 실제 값
y_pred_str = pd.DataFrame(['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B']) # 예측값

**정확도 (Accuracy)**
- 전체 데이터 중 올바르게 예측된 데이터의 비율

In [4]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_true, y_pred)
print('정확도:', accuracy)

accuracy = accuracy_score(y_true_str, y_pred_str)
print('정확도:', accuracy)

정확도: 0.6
정확도: 0.6


**정밀도 (Precision)**
- 양성으로 예측된 데이터 중 실제로 양성인 데이터의 비율

In [6]:
from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print('정밀도:', precision)

precision = precision_score(y_true_str, y_pred_str, pos_label='A')
print('정밀도:', precision)

정밀도: 0.8
정밀도: 0.8


**재현율 (Recall)**
- 실제 양성인 데이터 중 모델이 양성으로 올바르게 예측한 비율

In [7]:
from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print('재현율:', recall)

recall = recall_score(y_true_str, y_pred_str, pos_label='A')
print('재현율:', recall)

재현율: 0.5714285714285714
재현율: 0.5714285714285714


**F1 score**
- 정밀도와 재현율의 조화 평균
- 불균형 데이터를 평가하는 데 좋음.

In [8]:
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print('f1-score:', f1)

f1 = f1_score(y_true_str, y_pred_str, pos_label='A')
print('f1-score:', f1)

f1-score: 0.6666666666666666
f1-score: 0.6666666666666666


**ROC-AUC**
- 모델의 분류 성능을 평가하는 지표

In [10]:
from sklearn.metrics import roc_auc_score

y_true = pd.DataFrame([0, 1, 0, 1, 1, 0, 0, 0, 1, 1]) # 실제 값
y_pred_proba = pd.DataFrame([0.4, 0.9, 0.1, 0.3, 0.8, 0.6, 0.4, 0.2, 0.7, 0.6]) # 예측값 중 양성(1)일 확률

roc_auc = roc_auc_score(y_true, y_pred_proba)
print('ROC-AUC:', roc_auc)

# 문자 형태도 동일하게 적용

ROC-AUC: 0.86


---
#### **다중 분류 평가지표**
- 이진 분류 평가지표와 유사
- 정밀도, 재현율, f1-score는 평균을 계산하는 방식, 즉 파라미터가 필요.
    - Macro 평균 : 각 클래스에 대한 평균을 계산
    - Micro 평균 : 각 클래스에 대한 점수를 계산
    - Weighted 평균 : 각 클래스에 대한 가중 평균을 계산

In [11]:
y_true = pd.DataFrame([1, 2, 3, 3, 2, 1, 3, 3, 2, 1]) # 실제 값
y_pred = pd.DataFrame([1, 2, 1, 3, 2, 1, 1, 2, 2, 1]) # 예측값

y_true_str = pd.DataFrame(['A', 'B', 'C', 'C', 'B', 'A', 'C', 'C', 'B', 'A']) # 실제 값
y_pred_str = pd.DataFrame(['A', 'B', 'A', 'C', 'B', 'A', 'A', 'B', 'B', 'A']) # 예측값

In [12]:
# 정밀도 (Precision)
from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred, average='macro')
print('정밀도:', precision)

precision = precision_score(y_true_str, y_pred_str, average='macro')
print('정밀도:', precision)

정밀도: 0.7833333333333333
정밀도: 0.7833333333333333


In [13]:
# 재현율 (Recall)
from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred, average='micro')
print('재현율:', recall)

recall = recall_score(y_true_str, y_pred_str, average='micro')
print('재현율:', recall)

재현율: 0.7
재현율: 0.7


In [14]:
# f1-score
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred, average='weighted')
print('f1-score:', f1)

f1 = f1_score(y_true_str, y_pred_str, average='weighted')
print('f1-score:', f1)

f1-score: 0.6421428571428571
f1-score: 0.6421428571428571


---
#### **회귀 평가지표**
- 대부분 오차 측정 : 0에 가까울수록 성능이 좋은 모델
    - R-squared(결정계수)는 1에 가까울수록 좋음

In [15]:
# 회귀 데이터
import pandas as pd

y_true = pd.DataFrame([1, 2, 5, 2, 4, 4, 7, 9]) # 실제 값
y_pred = pd.DataFrame([1.14, 2.53, 4.87, 3.08, 4.21, 5.53, 7.51, 10.32]) # 예측값

**MSE**
- Mean Squared Error
- 실제 값과 예측값의 차이를 제곱한 값의 평균

In [17]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_true, y_pred)
print('MSE:', mse)

MSE: 0.7339125000000001


**MAE**
- Mean Absolute Error
- 실제 값과 예측값의 차이를 절댓값으로 계산한 값의 평균

In [18]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
print('MAE:', mae)

MAE: 0.68125


**결정계수 (R-squared)**
- 회귀식이 얼마나 잘 예측하는지 나타내는 지표
- R²

In [19]:
from sklearn.metrics import r2_score

r2 = r2_score(y_true, y_pred)
print('R²:', r2)

R²: 0.8859941747572815


**RMSE**
- Root Mean Squared Error
- 실제 값과 예측값의 차이를 제곱하여 평균한 값

In [20]:
from sklearn.metrics import root_mean_squared_error

rmse = root_mean_squared_error(y_true, y_pred)
print('RMSE:', rmse)

RMSE: 0.8566869323154171


**MSLE**
- Mean Squared Log Error
- 실제 값과 예측값의 로그를 취한 후 차이를 제곱해 평균한 값

In [21]:
from sklearn.metrics import mean_squared_log_error

msle = mean_squared_log_error(y_true, y_pred)
print("MSLE:", msle)

MSLE: 0.027278486182156947


**RMSLE**
- Root Mean Squared Error
- 실제 값과 예측값의 로그를 취한 후 차이를 제곱해 평균한 값의 제곱근

In [22]:
from sklearn.metrics import root_mean_squared_log_error

rmsle = root_mean_squared_log_error(y_true, y_pred)
print('RMSLE:', rmsle)

RMSLE: 0.16516199981278062


**MAPE**
- Mean Absolute Percentage Error
- 예측값과 실제 값 사이의 오차를 백분율로 나타낸 지표

In [23]:
mape = (abs((y_true - y_pred) / y_true)).mean() * 100
print("MAPE:", mape)

MAPE: 0    20.319048
dtype: float64


---
