<a href="https://colab.research.google.com/github/gmauricio-toledo/NLP-MCD/blob/main/06-NaiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Naive Bayes</h1>

En esta notebook usaremos el algoritmo Naive Bayes para tareas de clasificación de texto.

Para esto, usaremos el corpus `20newsgroups`.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Cargamos el corpus. Nos fijaremos en dos categorias: *space* y *baseball*.

In [2]:
train_data = fetch_20newsgroups(subset='train',
                                remove=('headers', 'footers', 'quotes'),
                                categories=['sci.space', 'rec.sport.baseball']
                                )
test_data = fetch_20newsgroups(subset='test',
                                remove=('headers', 'footers', 'quotes'),
                                categories=['sci.space', 'rec.sport.baseball']
                               )

X_train_raw = train_data.data
y_train = train_data.target

X_test_raw = test_data.data
y_test = test_data.target

La clase negativa es *baseball* y la clase positiva es *space*.

In [3]:
train_data.target_names

['rec.sport.baseball', 'sci.space']

Obtengamos los conteos con el modelo `BOW`.

In [4]:
cv = CountVectorizer(max_features=2000,stop_words='english')
X_train_bow = cv.fit_transform(X_train_raw)
X_test_bow = cv.transform(X_test_raw)

print(X_train_bow.shape)
print(X_test_bow.shape)

(1190, 2000)
(791, 2000)


In [5]:
print(y_train.shape)
print(y_test.shape)

(1190,)
(791,)


Instanciamos y entrenamos el clasificador [Naive Bayes Multinomial](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

In [6]:
nb = MultinomialNB()
nb.fit(X_train_bow, y_train)

Realizamos las predicciones y evaluamos el desempeño del modelo

In [7]:
y_pred = nb.predict(X_test_bow)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[380  17]
 [ 38 356]]
              precision    recall  f1-score   support

           0       0.91      0.96      0.93       397
           1       0.95      0.90      0.93       394

    accuracy                           0.93       791
   macro avg       0.93      0.93      0.93       791
weighted avg       0.93      0.93      0.93       791



Accedemos a las probabilidades de cada palabra, en cada clase.

In [None]:
nb.feature_log_prob_.shape

(2, 2000)

Definimos una función que nos muestre las palabras más determinantes en cada clase

In [8]:
def show_top_words(classifier, vectorizer, categories, top_n):
  feature_names = np.asarray(vectorizer.get_feature_names_out())
  for i, category in enumerate(categories):
    prob_sorted = classifier.feature_log_prob_[i, :].argsort()[::-1]
    print(f"{category}:\n\t{' '.join(feature_names[prob_sorted[:top_n]])}")

Veamos las palabras más determinantes en cada clase, en orden:

In [9]:
show_top_words(nb, cv, train_data.target_names, 12)

rec.sport.baseball:
	year game good team think don 00 just like games better baseball
sci.space:
	space nasa launch like earth data orbit time shuttle just satellite lunar


Probemos con un documento de prueba ajeno:

https://en.wikipedia.org/wiki/Main_Page

In [31]:
new_text = '''
First Horizon Park, formerly known as First Tennessee Park, is a baseball park in downtown Nashville, Tennessee, United States. The home of the Triple-A Nashville Sounds of the International League, it opened on April 17, 2015, and can seat up to 10,000 people. It replaced the Sounds' former home, Herschel Greer Stadium, where the team played from its founding in 1978 through 2014.

The park was built on the site of the former Sulphur Dell, a minor league ballpark in use from 1885 to 1963. It is located between Third and Fifth Avenues on the east and west (home plate, the pitcher's mound, and second base are directly in line with Fourth Avenue to the stadium's north and south) and between Junior Gilliam Way and Harrison Street on the north and south. The Nashville skyline can be seen from the stadium to the south.

The design of the park incorporates elements of Nashville's baseball and musical heritage and the use of imagery inspired by Sulphur Dell, the city's former baseball players and teams, and country music. Its most distinctive feature is its guitar-shaped scoreboard—a successor to the original guitar scoreboard at Greer Stadium. The ballpark's wide concourse wraps entirely around the stadium and provides views of the field from every location.

Though primarily a venue for the Nashville Sounds, collegiate and high school baseball teams based in the area, such as the Vanderbilt Commodores and Belmont Bruins, have played some games at the ballpark. Nashville SC, a soccer team of the United Soccer League Championship, played its home matches at the facility from 2018 to 2019. It has also hosted other events, including celebrity softball games and various food and drink festivals.
'''

cv_new_text = cv.transform([new_text])
nb.predict(cv_new_text)

array([0])

Veamos un clasificador *baseline* y comparemos el rendimiento

In [10]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_bow, y_train)
y_pred = lr.predict(X_test_bow)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[373  24]
 [ 53 341]]
              precision    recall  f1-score   support

           0       0.88      0.94      0.91       397
           1       0.93      0.87      0.90       394

    accuracy                           0.90       791
   macro avg       0.90      0.90      0.90       791
weighted avg       0.90      0.90      0.90       791



Veamos las palabras más determinantes para predecir la clase positiva (space). Comparemos con el Naive Bayes.

In [11]:
fts_importance = lr.coef_.reshape(-1,)

sorted(zip(fts_importance, cv.get_feature_names_out()),key=lambda x: x[0], reverse=True)[:10]

[(2.141400452192958, 'space'),
 (1.0236870490586576, 'orbit'),
 (0.9172908250150296, 'nasa'),
 (0.9032415599082049, 'solar'),
 (0.9016603334626981, 'quite'),
 (0.8896229078367699, 'stuff'),
 (0.8428052843006433, 'launch'),
 (0.8349719826729955, 'spacecraft'),
 (0.8162961492352656, 'real'),
 (0.7188628154702895, 'sounds')]

Comparemos con otro clasificador: Support Vector Machine

In [12]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train_bow, y_train)
y_pred = svm.predict(X_test_bow)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[384  13]
 [109 285]]
              precision    recall  f1-score   support

           0       0.78      0.97      0.86       397
           1       0.96      0.72      0.82       394

    accuracy                           0.85       791
   macro avg       0.87      0.85      0.84       791
weighted avg       0.87      0.85      0.84       791



Entrenemos un modelo usando embeddings de un modelo basado en transformadores, [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta).

In [21]:
!gdown 1qvEuXjb65m6Vh5gn6-XScqtVQz9b1Gme
!gdown 14grKYRMZs96xd-tzEvDpEFc4FBO6Vm2H

Downloading...
From: https://drive.google.com/uc?id=1qvEuXjb65m6Vh5gn6-XScqtVQz9b1Gme
To: /content/imdb-test-space_baseball-roberta.npy
100% 2.43M/2.43M [00:00<00:00, 22.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=14grKYRMZs96xd-tzEvDpEFc4FBO6Vm2H
To: /content/imdb-train-space_baseball-roberta.npy
100% 3.66M/3.66M [00:00<00:00, 85.0MB/s]


In [22]:
train_embeddings = np.load("imdb-train-space_baseball-roberta.npy")
test_embeddings = np.load("imdb-test-space_baseball-roberta.npy")

print(train_embeddings.shape)
print(test_embeddings.shape)

(1190, 768)
(791, 768)


In [24]:
from sklearn.svm import SVC

svm = SVC(C=2.5)
svm.fit(train_embeddings, y_train)
y_pred = svm.predict(test_embeddings)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[383  14]
 [ 54 340]]
              precision    recall  f1-score   support

           0       0.88      0.96      0.92       397
           1       0.96      0.86      0.91       394

    accuracy                           0.91       791
   macro avg       0.92      0.91      0.91       791
weighted avg       0.92      0.91      0.91       791



## Con $n$-gramas

In [27]:
cv_grams = CountVectorizer(max_features=2000,stop_words='english', ngram_range=(2,2))
X_train_tfidf_grams = cv_grams.fit_transform(X_train_raw)
X_test_tfidf_grams = cv_grams.transform(X_test_raw)

print(X_train_tfidf_grams.shape)
print(X_test_tfidf_grams.shape)

(1190, 2000)
(791, 2000)


In [28]:
cv_grams.get_feature_names_out()

array(['00 00', '00 000', '00 01', ..., 'york times', 'york yankees',
       'zoo toronto'], dtype=object)

In [29]:
nb_grams = MultinomialNB()
nb_grams.fit(X_train_tfidf_grams, y_train)
y_pred = nb_grams.predict(X_test_tfidf_grams)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[386  11]
 [157 237]]
              precision    recall  f1-score   support

           0       0.71      0.97      0.82       397
           1       0.96      0.60      0.74       394

    accuracy                           0.79       791
   macro avg       0.83      0.79      0.78       791
weighted avg       0.83      0.79      0.78       791



In [30]:
show_top_words(nb_grams, cv_grams, train_data.target_names, 10)

rec.sport.baseball:
	00 00 new york red sox cubs suck st louis suck cubs don know 02 03 home runs don think
sci.space:
	nasa gov space station space shuttle sci space anonymous ftp remote sensing commercial space washington dc national space gamma ray
