# [LAB#03] 나이브 베이즈 (naive bayes)

- 분석 데이터: "**제7차 한국인 인체치수 측정 데이터**"
- 나이브 베이즈를 직접 구현 (naive bayes from scratch)

<img src="objective.png" width="700px">

In [16]:
import numpy as np
from scipy.stats import norm
from scipy.stats import skewnorm
import pandas as pd

## 데이터 준비

In [18]:
df = pd.read_csv("2015_7th_korbody.csv",
                 thousands=',')

features = ["ⓞ_02_성별",
            "①_003_키",
            "①_031_몸무게",
            "①_104_손너비"]

sdf = df[features].copy()
sdf = sdf.sample(frac=1).reset_index(drop=True)  # Shuffle the data

sdf.columns = ["sex", "height", "weight", "hand_width"]

sdf.loc[sdf["sex"] == "남", "sex"] = 0
sdf.loc[sdf["sex"] == "여", "sex"] = 1

sdf.dropna(inplace=True)

In [19]:
sdf.head()

Unnamed: 0,sex,height,weight,hand_width
0,1,1580.0,65.0,79.0
1,0,1733.0,63.6,77.0
2,0,1683.0,59.6,75.0
3,1,1612.0,52.6,75.0
4,1,1595.0,50.5,70.0


## 데이터 나누기

In [20]:
iend_train = 51
sdf_train = sdf.loc[:iend_train]
sdf_test = sdf.loc[iend_train:]

sdf_train = sdf_train.reset_index(drop=True) 
sdf_test = sdf_test.reset_index(drop=True)

In [21]:
sdf_train_male = sdf_train[sdf_train["sex"] == 0]
sdf_train_female = sdf_train[sdf_train["sex"] == 1]

sdf_test_male = sdf_test[sdf_test["sex"] == 0]
sdf_test_female = sdf_test[sdf_test["sex"] == 1]

In [22]:
print("[Train]")
print("- Num. People:", sdf_train.shape[0])
print("- Num. Males:", sdf_train_male.shape[0])
print("- Num. Females:", sdf_train_female.shape[0])

print("[Test]")
print("- Num. People:", sdf_test.shape[0])
print("- Num. Males:", sdf_test_male.shape[0])
print("- Num. Females:", sdf_test_female.shape[0])

[Train]
- Num. People: 52
- Num. Males: 27
- Num. Females: 25
[Test]
- Num. People: 6361
- Num. Males: 3164
- Num. Females: 3197


## 모델 훈련
나이브 베이즈에서 모델을 훈련하는 것은 개별 피쳐의 분포를 결정하고 분포의 파라미터를 추정하는 것으로 정의할 수 있음.

- 피쳐의 분포 결정
- 피쳐 분포의 파라미터 추정/적합

In [23]:
mh_mu, mh_std = norm.fit(sdf_train_male["height"])
fh_mu, fh_std = norm.fit(sdf_train_female["height"])

mw_a, mw_loc, mw_scale = skewnorm.fit(sdf_train_male["weight"])
fw_a, fw_loc, fw_scale = skewnorm.fit(sdf_train_female["weight"])

mhw_a, mhw_loc, mhw_scale = skewnorm.fit(sdf_train_male["hand_width"])
fhw_a, fhw_loc, fhw_scale = skewnorm.fit(sdf_train_female["hand_width"])

prob_male = sdf_train_male.shape[0] / sdf_train.shape[0]
prob_female = sdf_train_female.shape[0] / sdf_train.shape[0]

In [24]:
fhw_a, fhw_loc, fhw_scale

(0.1333551313233386, 74.44185939605829, 4.154223752137083)

In [25]:
fw_a, fw_loc, fw_scale

(-5.19722475496709, 68.41645233339759, 14.045343318585413)

## 모델 테스트

In [26]:
# Calculate probabilities for height
x = sdf_test["height"]
prob_height_male = norm.pdf(x, mh_mu, mh_std)

x = sdf_test["height"]
prob_height_female = norm.pdf(x, fh_mu, fh_std)

# Calculate probabilities for weight
x = sdf_test["weight"]
prob_weight_male = skewnorm.pdf(x, mw_a, mw_loc, mw_scale)

x = sdf_test["weight"]
prob_weight_female = skewnorm.pdf(x, fw_a, fw_loc, fw_scale)

# Calculate probabilities for hand width
x = sdf_test["hand_width"]
prob_hw_male = skewnorm.pdf(x, mhw_a, mhw_loc, mhw_scale)

x = sdf_test["hand_width"]
prob_hw_female = skewnorm.pdf(x, fhw_a, fhw_loc, fhw_scale)

In [27]:
prob_male_x = prob_weight_male * prob_weight_male * prob_hw_male * prob_male
prob_female_x = prob_weight_female * prob_weight_female * prob_hw_female * prob_female
prob_y_x = np.stack([prob_male_x, prob_female_x], axis=1)
classified = np.argmax(prob_y_x, axis=1)

In [28]:
res = (classified == sdf_test["sex"])
print("[Result] Accuracy:", res.mean())

[Result] Accuracy: 0.8176387360477912
