# Naivni Bajesov algoritam

## Biblioteke

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline

*Ordinal encoder* - klasa koja kategoričke atribute iz ulaznog skupa podataka pretvara u redne atribute predstvaljeni nenegativnim celim brojevima.

## Podaci

In [2]:
df = pd.read_csv("data/balloons.csv")

In [3]:
df.head()

Unnamed: 0,color,size,act,age,inflated
0,YELLOW,SMALL,STRETCH,ADULT,T
1,YELLOW,SMALL,STRETCH,ADULT,T
2,YELLOW,SMALL,STRETCH,CHILD,F
3,YELLOW,SMALL,DIP,ADULT,F
4,YELLOW,SMALL,DIP,CHILD,F


In [4]:
df.describe()

Unnamed: 0,color,size,act,age,inflated
count,76,76,76,76,76
unique,2,2,2,2,2
top,YELLOW,SMALL,STRETCH,ADULT,F
freq,40,40,38,38,41


In [5]:
features = df.columns[:-1]
print(features)

Index(['color', 'size', 'act', 'age'], dtype='object')


## Treniranje modela

In [6]:
X = df[features]
y = df["inflated"]

In [7]:
print(X.shape)
print(y.shape)

(76, 4)
(76,)


Uvek je korisno da se provere dimenzije karakteristika i klasa kako bismo utvrdili da su dimenzije usklađene.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=13, stratify=y)

In [9]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(50, 4)
(26, 4)
(50,)
(26,)


### Preprocesiranje ulaznih podataka

In [10]:
oe = OrdinalEncoder()
oe.fit(X_train)
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [11]:
pd.DataFrame(X_train, columns=features).head()

Unnamed: 0,color,size,act,age
0,1.0,0.0,0.0,0.0
1,1.0,1.0,1.0,1.0
2,0.0,1.0,1.0,1.0
3,1.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0


### Treniranje

In [12]:
model = CategoricalNB()
model.fit(X_train, y_train)

CategoricalNB()

In [13]:
classes = model.classes_

In [14]:
model.class_count_

array([27., 23.])

In [15]:
model.category_count_

[array([[13., 14.],
        [ 7., 16.]]),
 array([[17., 10.],
        [ 7., 16.]]),
 array([[19.,  8.],
        [ 6., 17.]]),
 array([[ 9., 18.],
        [17.,  6.]])]

### Performanse modela na trening skupu

In [16]:
y_train_pred = model.predict(X_train)
pd.DataFrame(confusion_matrix(y_train, y_train_pred), columns=classes, index=classes)

Unnamed: 0,F,T
F,21,6
T,6,17


### Performase modela na test skupu

In [17]:
y_test_pred = model.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, y_test_pred), columns=classes, index=classes)

Unnamed: 0,F,T
F,12,2
T,2,10


## Pajplajn
Pajplajn možemo da koristimo da automatizujemo proces preprocesiranja ulaznih podataka i treniranja modela. Kada kreiramo pajplajn, navodimo niz transformacija, a poslednji elemenat niza mora da bude klasifikacioni model.

In [18]:
pipe = Pipeline([("ordinal encoder", OrdinalEncoder()), ("classifier", CategoricalNB())])

In [19]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('ordinal encoder', OrdinalEncoder()),
                ('classifier', CategoricalNB())])

In [20]:
pipe["ordinal encoder"]

OrdinalEncoder()

In [23]:
y_test_pred = pipe.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, y_test_pred), columns=classes, index=classes)

Unnamed: 0,F,T
F,12,2
T,2,10
