In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('data.csv')

In [3]:
data

Unnamed: 0,Day,Outlook,Temperature,Humidity,Wind,Play Tennis
0,1,Sunny,Hot,High,Weak,No
1,2,Sunny,Hot,High,Strong,No
2,3,Overcast,Hot,High,Weak,Yes
3,4,Rain,Mild,High,Weak,Yes
4,5,Rain,Cool,Normal,Weak,Yes
5,6,Rain,Cool,Normal,Strong,No
6,7,Overcast,Cool,Normal,Strong,Yes
7,8,Sunny,Mild,High,Weak,No
8,9,Sunny,Cool,Normal,Weak,Yes
9,10,Rain,Mild,Normal,Weak,Yes


### Naive Bayes Classifier

$v_{NB} = \underset{v_j\in V}{\text{argmax }}P(v_j)\prod_i P(a_i|v_j)$

위에서 $v_{NB}$는 다음처럼 구할 수 있다.

$$
\begin{aligned}
v_{NB} = &\underset{v_j\in\left \{ yes, no \right \}}{\text{argmax }}P(v_j)\prod_i P(a_i|v_j)\\
= &\underset{v_j\in\left \{ yes, no \right \}}{\text{argmax }}P(v_j) \cdot P(\text{Outlook=sunny}|v_j)\cdot P(\text{Temperature=cool}|v_j) \\
 \cdot &P(\text{Humidity=high}|v_j) \cdot P(\text{Wind=strong}|v_j)
\end{aligned}
$$

그리고 $p(v_j)$는 data를 통해 확인할 수 있다.

In [9]:
play_tennis_prob = data['Play Tennis'].value_counts(normalize=True)

play_tennis_yes = np.round(play_tennis_prob['Yes'], 2)
play_tennis_no = np.round(play_tennis_prob['No'], 2)

play_tennis_yes, play_tennis_no


(0.64, 0.36)

이제 새로운 instance에 대해 확률 계산 (overcast의 경우, no인 경우가 없어서 제외해줘야함)

In [17]:
prob_sunny_given_yes = data['Outlook'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Sunny']
prob_sunny_given_no = data['Outlook'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Sunny']
prob_overcast_given_yes = data['Outlook'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Overcast']
prob_rain_given_yes = data['Outlook'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Rain']
prob_rain_given_no = data['Outlook'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Rain']
prob_hot_given_yes = data['Temperature'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Hot']
prob_hot_given_no = data['Temperature'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Hot']
prob_mild_given_yes = data['Temperature'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Mild']
prob_mild_given_no = data['Temperature'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Mild']
prob_cool_given_yes = data['Temperature'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Cool']
prob_cool_given_no = data['Temperature'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Cool']
prob_high_given_yes = data['Humidity'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['High']
prob_high_given_no = data['Humidity'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['High']
prob_normal_given_yes = data['Humidity'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Normal']
prob_normal_given_no = data['Humidity'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Normal']
prob_strongwind_given_yes = data['Wind'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Strong']
prob_strongwind_given_no = data['Wind'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Strong']
prob_weakwind_given_yes = data['Wind'][data['Play Tennis'] == 'Yes'].value_counts(normalize=True)['Weak']
prob_weakwind_given_no = data['Wind'][data['Play Tennis'] == 'No'].value_counts(normalize=True)['Weak']


우리가 구할것 : <outlook=sunny, temperature=cool, humidity=high, wind=strong> 이때, 과연 테니스를 플레이할것인가?

이제 위에서 모두 구한 확률을 단순히 곱해주기만 하면된다.

$$
P(\text{yes})P(\text{sunny}|\text{yes})P(\text{cool}|\text{yes})P(\text{high}|\text{yes})P(\text{strong}|\text{yes})
$$

$$
P(\text{no})P(\text{sunny}|\text{no})P(\text{cool}|\text{no})P(\text{high}|\text{no})P(\text{strong}|\text{no})
$$

In [19]:
Yes_prob = play_tennis_yes * prob_sunny_given_yes * prob_cool_given_yes * prob_high_given_yes * prob_strongwind_given_yes   
No_prob = play_tennis_no * prob_sunny_given_no * prob_cool_given_no *prob_high_given_no * prob_strongwind_given_no

Yes_prob, No_prob

(0.005267489711934155, 0.020736)

No의 확률이 더 높으므로 우리는 No!라고 생각할 수 있다.