# 4. Naive Bayes: Un Ejemplo

Haremos un ejemplo para ilustrar el clasificador Naive Bayes.

En este ejemplo, clasificaremos textos según hablen de China ('zh') o Japón ('ja').

In [1]:
import numpy as np

## Datos de Entrenamiento

Supongamos que tenemos los siguientes datos de entrenamiento:

In [2]:
training = [
    ('chinese beijing chinese', 'zh'),
    ('chinese chinese shangai', 'zh'),
    ('chinese macao', 'zh'),
    ('tokyo japan chinese', 'ja'),
]

In [3]:
X_train = [doc for doc, _ in training]
y_train = [cls for _, cls in training]

In [4]:
X_train

['chinese beijing chinese',
 'chinese chinese shangai',
 'chinese macao',
 'tokyo japan chinese']

In [5]:
classes = ['zh', 'ja']

In [6]:
features = ['chinese', 'beijing', 'shangai', 'macao', 'tokyo', 'japan']

## Clasificador Naive Bayes

### Distribución a Priori ("prior")

Calculemos la distribución a priori (probabilidad de cada clase) usando máxima verosimilitud:

$$P(Y = y) = \frac{Count(Y = y)}{\sum_{y'} Count(Y = y')}$$

In [7]:
from collections import Counter

class_count = Counter(y_train)
class_count

Counter({'zh': 3, 'ja': 1})

In [8]:
prior_prob = {}
for c in classes:
    prior_prob[c] = class_count[c] / len(y_train)
    
    print(f'P({c}) = {prior_prob[c]:0.2f}')

P(zh) = 0.75
P(ja) = 0.25


In [9]:
prior_prob

{'zh': 0.75, 'ja': 0.25}

### Distribuciones Condicionales

Calculemos las distribuciones condicionales, esto es, la probabilidad de cada feature para cada clase.

Usaremos máxima verosimilitud y suavizado "add-one":

$$P(X_i = x|Y = y) = \frac{Count(X_i = x, Y = y) + 1}{\sum_{x'} Count(X_i = x', Y = y)+ |V|}$$

Primero calculamos los conteos:

In [10]:
feature_count = {}

for doc, cls in training:
    tokens = doc.split()  # lista de palabras
    for feature in tokens:
        if (feature, cls) not in feature_count:
            feature_count[feature, cls] = 0
        feature_count[feature, cls] = feature_count[feature, cls] + 1

O más cortito con `defaultdict`:

In [11]:
from collections import defaultdict
feature_count = defaultdict(int)

for doc, cls in training:
    tokens = doc.split()  # lista de palabras
    for feature in tokens:
        feature_count[feature, cls] += 1

In [12]:
dict(feature_count)

{('chinese', 'zh'): 5,
 ('beijing', 'zh'): 1,
 ('shangai', 'zh'): 1,
 ('macao', 'zh'): 1,
 ('tokyo', 'ja'): 1,
 ('japan', 'ja'): 1,
 ('chinese', 'ja'): 1}

Ahora calculamos las distribuciones:

In [13]:
V = len(features)

cond_prob = {}
for c in classes:
    cond_prob[c] = {}
    
    count_sum = sum(feature_count[f, c] for f in features)
    denom = count_sum + V

    for f in features:
        num = feature_count[f, c] + 1
        cond_prob[c][f] = num / denom

        print(f'P({f}|{c}) = {num} / {denom} ~ {cond_prob[c][f]:0.2f}')

P(chinese|zh) = 6 / 14 ~ 0.43
P(beijing|zh) = 2 / 14 ~ 0.14
P(shangai|zh) = 2 / 14 ~ 0.14
P(macao|zh) = 2 / 14 ~ 0.14
P(tokyo|zh) = 1 / 14 ~ 0.07
P(japan|zh) = 1 / 14 ~ 0.07
P(chinese|ja) = 2 / 9 ~ 0.22
P(beijing|ja) = 1 / 9 ~ 0.11
P(shangai|ja) = 1 / 9 ~ 0.11
P(macao|ja) = 1 / 9 ~ 0.11
P(tokyo|ja) = 2 / 9 ~ 0.22
P(japan|ja) = 2 / 9 ~ 0.22


### Predicción

Dado un documento, calculemos su clasificación. Para ello, calcularemos la probabilidad de cada clase, o mejor dicho algo propocional a esos valores (nos ahorramos el denominador $P(X=x)$).

$$P(Y=y|X=x) \propto P(Y=y) \prod_{i} P(X_i = x_i|Y=y)$$

In [14]:
doc = 'chinese chinese chinese tokyo japan'.split()

In [15]:
zh_prob = prior_prob['zh']
for w in doc:
    zh_prob = zh_prob * cond_prob['zh'][w]

print(f'P(zh|doc) ~ {zh_prob:0.4f}')

P(zh|doc) ~ 0.0003


In [16]:
ja_prob = prior_prob['ja']
for w in doc:
    ja_prob = ja_prob * cond_prob['ja'][w]

print(f'P(ja|doc) ~ {ja_prob:0.4f}')

P(ja|doc) ~ 0.0001


**¿Cuál es la clasificación?**

Valores probabilísticos:

In [17]:
zh_prob / (zh_prob + ja_prob), ja_prob / (zh_prob + ja_prob)

(0.6897586117634673, 0.31024138823653263)

## Naive Bayes con Scikit-learn

Veamos cómo podemos clasificar documentos en **scikit-learn** usando Naive Bayes.

### Bolsas de Palabras (Bag of Words)

Representaremos a los documentos de manera vectorial usando bolsas de palabras:

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

Entrenamos (sin etiquetas) para que el vectorizador asigne una columna a cada feature posible:

In [19]:
vect.fit(X_train)

In [20]:
vect.get_feature_names_out()

array(['beijing', 'chinese', 'japan', 'macao', 'shangai', 'tokyo'],
      dtype=object)

Veamos cómo se vectorizan los datos de entrenamiento:

In [21]:
X2 = vect.transform(X_train)

In [22]:
X2  # shape?

<4x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [23]:
X2.todense()

matrix([[1, 2, 0, 0, 0, 0],
        [0, 2, 0, 0, 1, 0],
        [0, 1, 0, 1, 0, 0],
        [0, 1, 1, 0, 0, 1]], dtype=int64)

Internamente, el vectorizador guarda el mapeo de features a columnas:

In [24]:
vect.vocabulary_

{'chinese': 1, 'beijing': 0, 'shangai': 4, 'macao': 3, 'tokyo': 5, 'japan': 2}

Ahora vectorizamos un nuevo documento:

In [25]:
doc = 'chinese chinese chinese tokyo japan'

In [26]:
X_test = vect.transform([doc])

In [27]:
X_test.todense()

matrix([[0, 3, 1, 0, 0, 1]], dtype=int64)

In [28]:
# qué pasa si vectorizo esto?
doc = 'buenos aires'
X_test = vect.transform([doc])
X_test.todense()

matrix([[0, 0, 0, 0, 0, 0]], dtype=int64)

### Multinomial Naive Bayes

Instanciamos y entrenamos [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes):

In [29]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X2, y_train)

Ahora predecimos:

In [30]:
mnb.predict(X_test)

array(['zh'], dtype='<U2')

También podemos obtener las probabilidades:

In [31]:
mnb.predict_proba(X_test)

array([[0.25, 0.75]])

### Parámetros Internos

Veamos cómo es internamente el modelo Naive Bayes en scikit-learn.

In [32]:
mnb.classes_

array(['ja', 'zh'], dtype='<U2')

In [33]:
mnb.class_count_

array([1., 3.])

In [34]:
mnb.feature_count_

array([[0., 1., 1., 0., 0., 1.],
       [1., 5., 0., 1., 1., 0.]])

In [35]:
np.exp(mnb.class_log_prior_)

array([0.25, 0.75])

In [36]:
np.exp(mnb.feature_log_prob_)

array([[0.11111111, 0.22222222, 0.22222222, 0.11111111, 0.11111111,
        0.22222222],
       [0.14285714, 0.42857143, 0.07142857, 0.14285714, 0.14285714,
        0.07142857]])

## Ejercicios

1. Aplicar Naive Bayes al problema de reconocimiento de dígitos manuscritos.

## Referencias

- [Naive Bayes classifier (Wikipedia)](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

Python:
- [defaultdict](https://docs.python.org/2/library/collections.html#collections.defaultdict)

Scikit-learn:
- [Working With Text Data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
- [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes)