# Práctica de Laboratorio en NLP - Tópicos
Tema: Extracción de tópicos

Introducción:

En esta práctica, trabajaremos en un proyecto de análisis de reclamaciones de un restaurante. La empresa desea saber de qué se quejan más los comensales, para identificar áreas de mejora. 

## 1. Desarrolla la extracción de tópicos con LDA

## 2. Pruébalo en la interfaz de ChatGPT y pídele los topics. ¿Cambian mucho respecto al LDA? Comenta las diferencias.


In [1]:
### Importación de librerías

import spacy
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

nlp = spacy.load("es_core_news_lg")

In [2]:
sentencias = [
    "La comida llegó fría y tardó mucho en ser servida.",
    "El servicio fue lento y el personal parecía desinteresado.",
    "Pedimos un plato vegetariano y nos trajeron uno con carne.",
    "Las mesas estaban sucias y no habían sido limpiadas correctamente.",
    "No había suficiente iluminación en el área de comedor.",
    "El restaurante estaba abarrotado y no se respetaron las reservas.",
    "La música estaba demasiado alta y no se podía mantener una conversación.",
    "La bebida que pedimos nunca llegó a la mesa.",
    "Los platos estaban mal sazonados y carecían de sabor.",
    "El menú tenía errores de ortografía y gramática.",
    "El restaurante tenía un olor desagradable.",
    "No había opciones vegetarianas en el menú.",
    "Se nos cobró de más en la factura.",
    "El baño del restaurante estaba sucio y sin papel higiénico.",
    "El vino que pedimos estaba agrio y parecía estar mal almacenado.",
    "La carne de mi plato estaba cruda en el centro.",
    "Las sillas eran incómodas y difíciles de mover.",
    "El personal fue grosero y poco atento.",
    "El restaurante estaba demasiado caliente y sin aire acondicionado.",
    "No se nos proporcionaron utensilios de mesa ni servilletas.",
    "El postre que pedimos estaba rancio y no era comestible.",
    "El menú era limitado y no ofrecía opciones para alergias alimentarias.",
    "El restaurante tenía una plaga de insectos.",
    "Los platos se veían descuidados y mal presentados.",
    "La factura tenía errores en los cálculos de los precios."
]




In [3]:
import re
import nltk
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('spanish')


def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

norm_corpus=[]

for document in sentencias:
    norm_corpus.append(normalize_document(document))

norm_corpus


['comida llegó fría tardó ser servida .',
 'servicio lento personal parecía desinteresado .',
 'pedimos plato vegetariano trajeron carne .',
 'mesas sucias sido limpiadas correctamente .',
 'suficiente iluminación área comedor .',
 'restaurante abarrotado respetaron reservas .',
 'música demasiado alta podía mantener conversación .',
 'bebida pedimos nunca llegó mesa .',
 'platos mal sazonados carecían sabor .',
 'menú errores ortografía gramática .',
 'restaurante olor desagradable .',
 'opciones vegetarianas menú .',
 'cobró factura .',
 'baño restaurante sucio papel higiénico .',
 'vino pedimos agrio parecía mal almacenado .',
 'carne plato cruda centro .',
 'sillas incómodas difíciles mover .',
 'personal grosero atento .',
 'restaurante demasiado caliente aire acondicionado .',
 'proporcionaron utensilios mesa servilletas .',
 'postre pedimos rancio comestible .',
 'menú limitado ofrecía opciones alergias alimentarias .',
 'restaurante plaga insectos .',
 'platos veían descuidados

In [4]:
### Creación del vectorizador y generación de la matriz Tf-idf

tfifd_vec = TfidfVectorizer()

TFIDF = tfifd_vec.fit_transform(norm_corpus)

TFIDF

<25x88 sparse matrix of type '<class 'numpy.float64'>'
	with 110 stored elements in Compressed Sparse Row format>

In [5]:
# Visualización de la matriz

# Obtenemos el vocabulario para poner las etiquetas de las columnaas
vocab = tfifd_vec.get_feature_names_out()

print("Palabras en el vocabulario: ", len(vocab))
print(vocab)

Palabras en el vocabulario:  88
['abarrotado' 'acondicionado' 'agrio' 'aire' 'alergias' 'alimentarias'
 'almacenado' 'alta' 'atento' 'baño' 'bebida' 'caliente' 'carecían'
 'carne' 'centro' 'cobró' 'comedor' 'comestible' 'comida' 'conversación'
 'correctamente' 'cruda' 'cálculos' 'demasiado' 'desagradable'
 'descuidados' 'desinteresado' 'difíciles' 'errores' 'factura' 'fría'
 'gramática' 'grosero' 'higiénico' 'iluminación' 'incómodas' 'insectos'
 'lento' 'limitado' 'limpiadas' 'llegó' 'mal' 'mantener' 'menú' 'mesa'
 'mesas' 'mover' 'música' 'nunca' 'ofrecía' 'olor' 'opciones' 'ortografía'
 'papel' 'parecía' 'pedimos' 'personal' 'plaga' 'plato' 'platos' 'podía'
 'postre' 'precios' 'presentados' 'proporcionaron' 'rancio' 'reservas'
 'respetaron' 'restaurante' 'sabor' 'sazonados' 'ser' 'servicio' 'servida'
 'servilletas' 'sido' 'sillas' 'sucias' 'sucio' 'suficiente' 'tardó'
 'trajeron' 'utensilios' 'vegetarianas' 'vegetariano' 'veían' 'vino'
 'área']


In [6]:
# Y construimos un dataframe para mostrar el resultado: por cada documento las ocurrencias de cada token
pd.DataFrame(TFIDF.toarray(), columns=vocab)

Unnamed: 0,abarrotado,acondicionado,agrio,aire,alergias,alimentarias,almacenado,alta,atento,baño,...,sucio,suficiente,tardó,trajeron,utensilios,vegetarianas,vegetariano,veían,vino,área
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.415749,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.492489,0.0,0.0,0.492489,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
5,0.536162,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.415749,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Implementación de LDA

In [7]:
# Creación de un objeto de clase LDA con sus componentes
lda_model = LatentDirichletAllocation(n_components = 3, max_iter = 20, random_state = 20)

# Extracción de los tópicos
X_topics = lda_model.fit_transform(TFIDF)

topic_words = lda_model.components_
topic_words

array([[0.33451763, 0.33447347, 0.77847526, 0.33447347, 0.33438676,
        0.33438676, 0.77847526, 0.74639283, 0.33466421, 0.3343944 ,
        0.82276382, 0.33447347, 0.80521907, 0.76431802, 0.33479153,
        1.07751319, 0.33452126, 0.86088602, 0.33437673, 0.74639283,
        0.77777765, 0.33479153, 0.85951727, 0.69699865, 0.33469415,
        0.33476896, 0.33445881, 0.33450489, 1.27749411, 1.46208147,
        0.33437673, 0.86945273, 0.33466421, 0.3343944 , 0.33452126,
        0.33450489, 0.33469418, 0.33445881, 0.33438676, 0.77777765,
        0.76630833, 1.08686101, 0.74639283, 0.76015949, 0.76574619,
        0.77777765, 0.33450489, 0.74639283, 0.82276382, 0.33438676,
        0.33469415, 0.33589615, 0.86945273, 0.3343944 , 0.72644534,
        1.78937098, 0.33450941, 0.33469418, 0.76431802, 0.7514824 ,
        0.74639283, 0.86088602, 0.85951727, 0.33476896, 0.33464187,
        0.86088602, 0.33451763, 0.33451763, 0.33560106, 0.80521907,
        0.80521907, 0.33437673, 0.33445881, 0.33

## Palabras en cada tópico

In [8]:
#  Número de palabras a extraer de cada tópico

n_top_words = 5

for i, topic_dist in enumerate(topic_words):
    
    sorted_topic_dist = np.argsort(topic_dist)
    
    topic_words = np.array(vocab)[sorted_topic_dist]
    
    topic_words = topic_words[:-n_top_words:-1]
    print ("Tópico", str(i+1), topic_words)

Tópico 1 ['pedimos' 'factura' 'errores' 'mal']
Tópico 2 ['restaurante' 'vegetarianas' 'desagradable' 'olor']
Tópico 3 ['personal' 'restaurante' 'atento' 'grosero']


## Distribución de documentos por tópico

In [9]:
doc_topic = lda_model.transform(TFIDF)  

for n in range(doc_topic.shape[0]):
    topic_doc = doc_topic[n].argmax()
    print ("Documento", n+1, " -- Tópico:" ,topic_doc)

Documento 1  -- Tópico: 2
Documento 2  -- Tópico: 2
Documento 3  -- Tópico: 0
Documento 4  -- Tópico: 0
Documento 5  -- Tópico: 1
Documento 6  -- Tópico: 2
Documento 7  -- Tópico: 0
Documento 8  -- Tópico: 0
Documento 9  -- Tópico: 0
Documento 10  -- Tópico: 0
Documento 11  -- Tópico: 1
Documento 12  -- Tópico: 1
Documento 13  -- Tópico: 0
Documento 14  -- Tópico: 2
Documento 15  -- Tópico: 0
Documento 16  -- Tópico: 1
Documento 17  -- Tópico: 2
Documento 18  -- Tópico: 2
Documento 19  -- Tópico: 1
Documento 20  -- Tópico: 1
Documento 21  -- Tópico: 0
Documento 22  -- Tópico: 2
Documento 23  -- Tópico: 1
Documento 24  -- Tópico: 1
Documento 25  -- Tópico: 0


# Cómo visualizar los tópicos en el espacio

In [1]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting FuzzyTM>=0.4.0 (from gensim->pyLDAvis)
  Downloading FuzzyTM-2.0.5-py3-none-any.whl (29 kB)
Collecting pyfume (from FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading pyFUME-0.2.25-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.1/67.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting simpful (from pyfume->FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading simpful-2.11.0-py3-none-any.whl (32 kB)
Collecting fst-pso (from pyfume->FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downloading fst-pso-1.8.1.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting miniful (from fst-pso->pyfume->FuzzyTM>=0.4.0->gensim->pyLDAvis)
  Downlo

In [10]:
# Importa las bibliotecas necesarias
import pyLDAvis.sklearn
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import numpy as np


# Prepara los datos para PyLDAvis
panel = pyLDAvis.sklearn.prepare(lda_model, TFIDF, TFIDFVectorizer, mds='tsne')

# Visualiza el modelo LDA
pyLDAvis.display(panel)


ModuleNotFoundError: No module named 'pyLDAvis.sklearn'