# Bayesian Classifiers

> ref. https://wikidocs.net/22892

## Naive Bayesian Classifiers

- 조건부확률: $P(X | Y) = {{P(X, Y)} \over {P(Y)}}$
- 베이즈 정리:
  - $P(X | Y)$를 알고 있다면 베이즈 정리를 이용해 $P(Y | X)$를 구할 수 있다.
  - $P(Y | X) = {{P(X | Y) P(Y)} \over {P(X)}}$
- 베이즈 정리를 이용해 분류기를 만들 수 있다:
  - 가령 $X$는 텍스트, $Y$는 해당 텍스트의 정상/스팸 여부라면:
    - $P(정상 | T)$ = 주어진 텍스트 $T$가 정상일 확률
    - $P(스팸 | T)$ = 주어진 텍스트 $T$가 스팸일 확률
  - $P(정상 | T) = {{P(T | 정상) P(정상)} \over {P(T)}}$
  - $P(스팸 | T) = {{P(T | 스팸) P(스팸)} \over {P(T)}}$
  - 이때 분모가 동일하게 $P(T)$이므로 생략할 수 있다 (두 확률을 비교하는 데 영향을 미치지 않음):
    - $P(정상 | T) = P(T | 정상) P(정상)$
    - $P(스팸 | T) = P(T | 스팸) P(스팸)$
- $Y$를 결정하는 요소들을 곱해준다:
  - 텍스트라면 단어를 토큰화하여 요소로 사용할 수 있을 것:
    - $P(w_1 | 정상)$은 정상 텍스트가 주어졌을 때 해당 텍스트 안에 특정 단어($w_1$)가 있을 확률.
  - $P(정상 | T) = P(w_1 | 정상) P(w_2 | 정상) P(w_3 | 정상)$
  - $P(스팸 | T) = P(w_1 | 스팸) P(w_2 | 스팸) P(w_3 | 스팸)$

In [2]:
import pandas as pd

path = 'dataset/bank-marketing.csv'
data = pd.read_csv(path, delimiter=';')

data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [4]:
from sklearn.model_selection import train_test_split

x_columns = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
y_column = ['y']

x_train, x_test, y_train, y_test = train_test_split(
    data[x_columns],
    data[y_column],
)

In [6]:
x_train

Unnamed: 0,age,job,marital,education,default,housing,loan
13867,34,services,divorced,basic.9y,no,no,no
18528,30,technician,single,university.degree,no,no,no
17244,55,retired,married,professional.course,no,yes,no
21232,32,management,married,university.degree,no,no,no
24926,29,technician,single,high.school,no,yes,no
...,...,...,...,...,...,...,...
24066,30,admin.,married,university.degree,no,no,no
7066,41,services,married,university.degree,no,yes,yes
1754,46,blue-collar,married,basic.4y,no,yes,no
17194,48,admin.,single,high.school,no,yes,yes


In [5]:
y_train

Unnamed: 0,y
13867,no
18528,no
17244,yes
21232,no
24926,no
...,...
24066,no
7066,no
1754,no
17194,no
