## 나이브 베이즈 분류기(Naive Bayes Classification)

- 베이즈 정리를 적용한 확률적 분류 알고리즘
- 모든 특성들이 독립임을 가정(naive 가정)
- 입력 특성에 따라 3개의 분류기 존재
    - 가우시안 나이브 베이즈 분류기
    - 베르누이 나이브 베이즈 분류기
    - 다항 나이브 베이즈 분류기

### 나이브 베이즈 분류기의 확률 모델
- 나이브 베이즈는 조건부 확률 모델
- N개의 특성을 나타내는 벡터 x를 입력 받아 k개의 가능한 확률적 결과를 출력
\begin{equation}
p(C_k | x_1,...,x_n)
\end{equation}

- 위의 식에 베이즈 정리를 적용하면 다음과 같음
\begin{equation}
p(C_k | \textbf{x}) = \frac{p(C_k)p(\textbf{x}|C_k)}{p(\textbf{x})}
\end{equation}
- 위의 식에서 분자만이 출력 값에 영향을 받기 때문에 분모 부분을 상수로 취급할 수 있음

\begin{equation}
\begin{split}
p(C_k | \textbf{x}) & \propto p(C_k)p(\textbf{x}|C_k) \\
& \propto p(C_k, x_1, ..., x_n)
\end{split}
\end{equation}

- 위의 식을 연쇄 법칙을 사용해 다음과 같이 쓸 수 있음
\begin{equation}
\begin{split}
p(C_k, x_1, ..., x_n) & = p(C_k)p(x_1, ..., x_n | C_k) \\
& = p(C_k)p(x_1 | C_k)p(x_2, ..., x_n | C_k, x_1) \\
& = p(C_k)p(x_1 | C_k)p(x_2 | C_k, x_1)p(x_3, ..., x_n | C_k, x_1, x_2) \\
& = p(C_k)p(x_1 | C_k)p(x_2 | C_k, x_1)...p(x_n | C_k, x_1, x_2, ..., x_{n-1})
\end{split}
\end{equation}
-  나이브 베이즈 분류기는 모든 특성이 독립이라고 가정하기 때문에 위의 식을 다음과 같이 쓸 수 있음
\begin{equation}
\begin{split}
p(C_k, x_1, ..., x_n) & \propto p(C_k)p(x_1|C_k)p(x_2|C_k)...p(x_n|C_k) \\
& \propto p(C_k) \prod_{i=1}^{n} p(x_i|C_k)
\end{split}
\end{equation}
- 위의 식을 통해 나온 값들 중 가장 큰 값을 갖는 클래스가 예측 결과
\begin{equation}
\hat{y} = \underset{k}{\arg\max} \; p(C_k) \prod_{i=1}^{n} p(x_i|C_k)
\end{equation}

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.datasets import fetch_covtype, fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn import metrics

In [3]:
prior = [0.45, 0.3, 0.15, 0.1]
likelihood = [[0.3, 0.3, 0.4],[0.7, 0.2, 0.1], [0.15, 0.5, 0.35], [0.6, 0.2, 0.2]]

idx = 0
for c, xs in zip(prior, likelihood):
    result = 1.

    for x in xs:
        result *= x
    result *= c

    idx += 1

    print(f'{idx}번째 클래스의 가능성: {result}')

1번째 클래스의 가능성: 0.0162
2번째 클래스의 가능성: 0.0042
3번째 클래스의 가능성: 0.0039375
4번째 클래스의 가능성: 0.0024000000000000002


## 산림 토양 데이터
- 산림 지역 토양의 특징 데이터
- 토양이 어떤 종류에 속하는지 예측
- https://archive.ics.uci.edu/ml/datasets/Covertype 에서 데이터에 대한 자세한 설명 확인 가능

In [4]:
covtype = fetch_covtype()
print(covtype.DESCR)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`

In [7]:
pd.DataFrame(covtype.data)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581007,2396.0,153.0,20.0,85.0,17.0,108.0,240.0,237.0,118.0,837.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581008,2391.0,152.0,19.0,67.0,12.0,95.0,240.0,237.0,119.0,845.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581009,2386.0,159.0,17.0,60.0,7.0,90.0,236.0,241.0,130.0,854.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581010,2384.0,170.0,15.0,60.0,5.0,90.0,230.0,245.0,143.0,864.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
covtype.target

array([5, 5, 2, ..., 3, 3, 3])

In [9]:
covtype_X = covtype.data
covtype_y = covtype.target

In [11]:
covtype_X_train, covtype_X_test, covtype_y_train, covtype_y_test = train_test_split(covtype_X, covtype_y, test_size=0.2)

In [12]:
print('전체 데이터 크기:{}'.format(covtype_X.shape))
print('학습 데이터 크기:{}'.format(covtype_X_train.shape))
print('평가 데이터 크기: {}'.format(covtype_X_test.shape))

전체 데이터 크기:(581012, 54)
학습 데이터 크기:(464809, 54)
평가 데이터 크기: (116203, 54)


#### 전처리

##### 전처리 전 데이터

In [13]:
covtype_df= pd.DataFrame(data=covtype_X)
covtype_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,...,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0
mean,2959.365301,155.656807,14.103704,269.428217,46.418855,2350.146611,212.146049,223.318716,142.528263,1980.291226,...,0.044175,0.090392,0.077716,0.002773,0.003255,0.000205,0.000513,0.026803,0.023762,0.01506
std,279.984734,111.913721,7.488242,212.549356,58.295232,1559.25487,26.769889,19.768697,38.274529,1324.19521,...,0.205483,0.286743,0.267725,0.052584,0.056957,0.01431,0.022641,0.161508,0.152307,0.121791
min,1859.0,0.0,0.0,0.0,-173.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2809.0,58.0,9.0,108.0,7.0,1106.0,198.0,213.0,119.0,1024.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2996.0,127.0,13.0,218.0,30.0,1997.0,218.0,226.0,143.0,1710.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3163.0,260.0,18.0,384.0,69.0,3328.0,231.0,237.0,168.0,2550.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3858.0,360.0,66.0,1397.0,601.0,7117.0,254.0,254.0,254.0,7173.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [14]:
covtype_train_df = pd.DataFrame(data=covtype_X_train)
covtype_train_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,...,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0
mean,2959.508108,155.683737,14.10795,269.660327,46.458492,2350.467784,212.134397,223.311879,142.531156,1980.232134,...,0.04407,0.090661,0.07789,0.002754,0.003251,0.000207,0.000525,0.026719,0.023853,0.015094
std,280.023953,111.932438,7.491241,212.723773,58.324352,1558.500769,26.787546,19.763753,38.288351,1322.895837,...,0.20525,0.287127,0.267999,0.052405,0.056923,0.01437,0.022906,0.16126,0.152591,0.121929
min,1860.0,0.0,0.0,0.0,-173.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2809.0,58.0,9.0,108.0,7.0,1106.0,198.0,213.0,119.0,1024.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2996.0,127.0,13.0,218.0,30.0,1998.0,218.0,226.0,143.0,1710.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3164.0,261.0,18.0,390.0,69.0,3329.0,231.0,237.0,168.0,2550.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3858.0,360.0,66.0,1397.0,601.0,7117.0,254.0,254.0,254.0,7173.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [15]:
covtype_test_df = pd.DataFrame(data=covtype_X_test)
covtype_test_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,...,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0
mean,2958.794076,155.549091,14.086719,268.499781,46.260312,2348.861931,212.192654,223.346067,142.516691,1980.527594,...,0.044594,0.089318,0.07702,0.002848,0.00327,0.000198,0.000465,0.027142,0.023399,0.014922
std,279.828281,111.839238,7.476241,211.848624,58.178585,1562.273668,26.699207,19.788523,38.219354,1329.385652,...,0.206412,0.285203,0.266625,0.053295,0.057092,0.014067,0.021552,0.162498,0.151167,0.121242
min,1859.0,0.0,0.0,0.0,-166.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2809.0,58.0,9.0,108.0,7.0,1104.0,199.0,213.0,119.0,1024.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2995.0,127.0,13.0,218.0,29.0,1989.0,218.0,226.0,143.0,1708.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3162.0,260.0,18.0,382.0,69.0,3325.0,231.0,237.0,168.0,2546.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3851.0,360.0,65.0,1390.0,598.0,7092.0,254.0,254.0,254.0,7111.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### 전처리 과정

In [16]:
scaler = StandardScaler()
covtype_X_train_scale = scaler.fit_transform(covtype_X_train)
covtype_X_test_scale = scaler.transform(covtype_X_test)

#### 전처리 후 데이터
- 평균은 0에 가깝게, 표준편차는 1에 가깝게 정규화

In [17]:
covtype_train_df = pd.DataFrame(data=covtype_X_train_scale)
covtype_train_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,...,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0,464809.0
mean,-4.819918e-16,2.0178530000000002e-17,8.78072e-17,1.273999e-16,1.1633230000000001e-17,-1.306789e-16,-1.615659e-16,-4.751433e-16,1.471657e-16,7.710646e-17,...,-1.103628e-16,-7.019684e-17,5.808971999999999e-19,-1.5286769999999998e-19,-5.05992e-18,-1.1969540000000001e-17,1.1281630000000001e-17,-4.1488290000000003e-17,-2.1646060000000002e-17,2.3281750000000002e-17
std,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,...,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001
min,-3.926483,-1.390874,-1.883261,-1.267656,-3.762729,-1.508161,-7.919151,-11.29907,-3.722576,-1.496894,...,-0.2147123,-0.3157527,-0.2906362,-0.05254925,-0.05710867,-0.01437286,-0.02291773,-0.1656864,-0.1563191,-0.123797
25%,-0.5374836,-0.8727036,-0.681857,-0.759955,-0.6765362,-0.798504,-0.5276486,-0.5217577,-0.6145781,-0.7228333,...,-0.2147123,-0.3157527,-0.2906362,-0.05254925,-0.05710867,-0.01437286,-0.02291773,-0.1656864,-0.1563191,-0.123797
50%,0.1303172,-0.2562597,-0.1478995,-0.2428519,-0.2821893,-0.2261585,0.2189678,0.1360128,0.0122451,-0.2042734,...,-0.2147123,-0.3157527,-0.2906362,-0.05254925,-0.05710867,-0.01437286,-0.02291773,-0.1656864,-0.1563191,-0.123797
75%,0.7302665,0.9408924,0.5195474,0.5657092,0.3864858,0.6278683,0.7042684,0.6925879,0.6651859,0.4306979,...,-0.2147123,-0.3157527,-0.2906362,-0.05254925,-0.05710867,-0.01437286,-0.02291773,-0.1656864,-0.1563191,-0.123797
max,3.208629,1.825355,6.927037,5.299553,9.5079,3.058412,1.562877,1.552749,2.911302,3.925308,...,4.657394,3.167036,3.440728,19.02977,17.51048,69.57557,43.63433,6.035499,6.39717,8.077738


In [18]:
covtype_test_df = pd.DataFrame(data=covtype_X_test_scale)
covtype_test_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
count,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,...,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0,116203.0
mean,-0.00255,-0.001203,-0.002834,-0.005456,-0.003398,-0.00103,0.002175,0.00173,-0.000378,0.000223,...,0.002556,-0.004678,-0.003245,0.001806,0.00034,-0.000599,-0.00263,0.002627,-0.002976,-0.001412
std,0.999302,0.999168,0.997999,0.995887,0.997502,1.002422,0.996703,1.001254,0.998199,1.004907,...,1.005663,0.993302,0.994875,1.016995,1.002965,0.97895,0.940904,1.007682,0.99067,0.99437
min,-3.930054,-1.390874,-1.883261,-1.267656,-3.64271,-1.508161,-7.919151,-11.299075,-3.722576,-1.496894,...,-0.214712,-0.315753,-0.290636,-0.052549,-0.057109,-0.014373,-0.022918,-0.165686,-0.156319,-0.123797
25%,-0.537484,-0.872704,-0.681857,-0.759955,-0.676536,-0.799787,-0.490318,-0.521758,-0.614578,-0.722833,...,-0.214712,-0.315753,-0.290636,-0.052549,-0.057109,-0.014373,-0.022918,-0.165686,-0.156319,-0.123797
50%,0.126746,-0.25626,-0.1479,-0.242852,-0.299335,-0.231933,0.218968,0.136013,0.012245,-0.205785,...,-0.214712,-0.315753,-0.290636,-0.052549,-0.057109,-0.014373,-0.022918,-0.165686,-0.156319,-0.123797
75%,0.723124,0.931958,0.519547,0.528102,0.386486,0.625302,0.704268,0.692588,0.665186,0.427674,...,-0.214712,-0.315753,-0.290636,-0.052549,-0.057109,-0.014373,-0.022918,-0.165686,-0.156319,-0.123797
max,3.183631,1.825355,6.793548,5.266646,9.456463,3.042371,1.562877,1.552749,2.911302,3.878441,...,4.657394,3.167036,3.440728,19.029767,17.510477,69.575573,43.634332,6.035499,6.39717,8.077738
