# Reto 2 - Clasificador Bayesiano Ingenuo
**Nombre:** Juan Manuel Gutiérrez Gómez **Código:** 2260563

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt

In [3]:
data_source_url = "Rest_Mex_2022_Sentiment_Analysis_Track_Train.csv"
tourist_opinions = pd.read_csv(data_source_url)

In [4]:
tourist_opinions.head()

Unnamed: 0,Title,Opinion,Polarity,Attraction
0,Pésimo lugar,"Piensen dos veces antes de ir a este hotel, te...",1,Hotel
1,No vayas a lugar de Eddie,Cuatro de nosotros fuimos recientemente a Eddi...,1,Restaurant
2,Mala relación calidad-precio,seguiré corta y simple: limpieza\n- bad. Tengo...,1,Hotel
3,Minusválido? ¡No te alojes aquí!,Al reservar un hotel con multipropiedad Mayan ...,1,Hotel
4,Es una porqueria no pierdan su tiempo,"No pierdan su tiempo ni dinero, venimos porque...",1,Hotel


In [14]:
!pip install wordcloud
from wordcloud import WordCloud

text = tourist_opinions.Opinion

# Circle mask
x, y = np.ogrid[:300, :300]
mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

wc = WordCloud(height = 500, width = 500, background_color = "white",
               repeat = True, mask = mask,
               contour_width = 3, contour_color = "black")
wc.generate(text)

plt.axis("off")
plt.imshow(wc, interpolation = "bilinear")

plt.show() 



TypeError: expected string or bytes-like object

## Objetivo 1: Opinión vs Atracción

**Conjunto de Características**

Extraemos las características que analizaremos con el siguiente script:

In [5]:
features = tourist_opinions.iloc[:, 1].values
labels = tourist_opinions.iloc[:, 3].values
labels

array(['Hotel', 'Restaurant', 'Hotel', ..., 'Attractive', 'Attractive',
       'Attractive'], dtype=object)

Una vez que dividimos los datos en características y conjunto de entrenamiento, podemos preprocesarlos para limpiarlos. Para ello, utilizaremos expresiones regulares. Para obtener más información sobre las expresiones regulares:

In [6]:
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

### TF-IDF
Aplicamos el algoritmo de TF-IDF. La idea detrás del enfoque TF-IDF es que las palabras que aparecen menos en todos los documentos y más en un documento individual contribuyen más a la clasificación.

In [7]:
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('spanish'))
processed_features = vectorizer.fit_transform(processed_features).toarray()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JuanMa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


En aras de probar la salida de nuestro clasificador, dividiremos los datos en un conjunto de entrenamiento y prueba:

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)

### Bayesiano Ingenuo Gaussiano

In [9]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train);

In [10]:
predictions = model.predict(X_test)

In [11]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))

[[1005   34   16]
 [  80 2840  328]
 [ 143   87 1510]]
              precision    recall  f1-score   support

  Attractive       0.82      0.95      0.88      1055
       Hotel       0.96      0.87      0.91      3248
  Restaurant       0.81      0.87      0.84      1740

    accuracy                           0.89      6043
   macro avg       0.86      0.90      0.88      6043
weighted avg       0.89      0.89      0.89      6043

0.8861492636107894


### Bayesiano Ingenuo Multinomial

In [11]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train);

In [12]:
predictions = model.predict(X_test)

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))

[[ 999   47    9]
 [   2 3223   23]
 [   3  248 1489]]
              precision    recall  f1-score   support

  Attractive       1.00      0.95      0.97      1055
       Hotel       0.92      0.99      0.95      3248
  Restaurant       0.98      0.86      0.91      1740

    accuracy                           0.95      6043
   macro avg       0.96      0.93      0.95      6043
weighted avg       0.95      0.95      0.94      6043

0.945060400463346


## Objetivo 2: Opinión vs Sentimiento

**Conjunto de Características**

Extraemos las características que analizaremos con el siguiente script:

In [14]:
features = tourist_opinions.iloc[:, 1].values
labels = tourist_opinions.iloc[:, 2].values
labels

array([1, 1, 1, ..., 5, 5, 5], dtype=int64)

Una vez que dividimos los datos en características y conjunto de entrenamiento, podemos preprocesarlos para limpiarlos. Para ello, utilizaremos expresiones regulares. Para obtener más información sobre las expresiones regulares:

In [15]:
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

### TF-IDF
Aplicamos el algoritmo de TF-IDF. La idea detrás del enfoque TF-IDF es que las palabras que aparecen menos en todos los documentos y más en un documento individual contribuyen más a la clasificación.

In [16]:
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('spanish'))
processed_features = vectorizer.fit_transform(processed_features).toarray()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JuanMa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


En aras de probar la salida de nuestro clasificador, dividiremos los datos en un conjunto de entrenamiento y prueba:

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)

### Bayesiano Ingenuo Gaussiano

In [18]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train);

In [19]:
predictions = model.predict(X_test)

In [20]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))

[[  55   24   17    4    4]
 [  62   46   23   12    2]
 [ 156  118   87   48   13]
 [ 280  272  229  231  151]
 [ 707  637  359  582 1924]]
              precision    recall  f1-score   support

           1       0.04      0.53      0.08       104
           2       0.04      0.32      0.07       145
           3       0.12      0.21      0.15       422
           4       0.26      0.20      0.23      1163
           5       0.92      0.46      0.61      4209

    accuracy                           0.39      6043
   macro avg       0.28      0.34      0.23      6043
weighted avg       0.70      0.39      0.48      6043

0.38772133046500085


### Bayesiano Ingenuo Multinomial

In [21]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train);

In [22]:
predictions = model.predict(X_test)

In [23]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))

[[   2    2   18   30   52]
 [   0    0   19   49   77]
 [   1    0   33  151  237]
 [   0    0   10  296  857]
 [   0    0    8  265 3936]]
              precision    recall  f1-score   support

           1       0.67      0.02      0.04       104
           2       0.00      0.00      0.00       145
           3       0.38      0.08      0.13       422
           4       0.37      0.25      0.30      1163
           5       0.76      0.94      0.84      4209

    accuracy                           0.71      6043
   macro avg       0.44      0.26      0.26      6043
weighted avg       0.64      0.71      0.65      6043

0.7061062386232004
