<a href="https://colab.research.google.com/github/griisnc/NLP_Python/blob/main/Spanish_Topic_Modeling_with_LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This code was made by Griselda Navarrete

# Topic Modeling with LDA on Spanish Documents

This code demonstrates how to perform topic modeling on a set of Spanish documents using Latent Dirichlet Allocation (LDA). It leverages the nltk and sklearn libraries in Python to achieve this.

Workflow:



 1.  ##  Preprocessing:

      It begins by installing the nltk library if it's not already present.

      It then downloads Spanish stop words from nltk to filter out common, uninformative words.

      The documents are transformed into a term-document matrix using CountVectorizer, which essentially creates a numerical representation of the text where each row represents a document and each column represents a word. Spanish stop words are excluded during this step.

 2.  ## LDA Model:

      An LDA model is initialized with a specified number of topics (in this case, 2) and a random seed for reproducibility.

      The model is then trained on the term-document matrix to identify the underlying topics within the documents.

3.   ## Topic Interpretation:

        The code extracts the most representative keywords for each topic based on their weights within the model.

        It prints these keywords, providing insight into the themes captured by each topic.

4.  ## Document Assignment:

      The model is used to predict the topic probabilities for each document.

      Each document is then assigned to the topic with the highest probability.
      
      This assignment is printed, showing the topic affiliation for each document.





In essence, this code takes a collection of Spanish text documents, identifies the main topics discussed within them, and categorizes each document according to its dominant topic. This technique is widely used in text analysis, information retrieval, and natural language processing to understand the thematic structure of large text corpora.

In [None]:
# Importa las bibliotecas necesarias.
from IPython import get_ipython
from IPython.display import display

# Instala la biblioteca NLTK (si no está instalada).
!pip install nltk

# Importa NLTK y descarga las stop words en español.
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Importa las clases para CountVectorizer y LDA.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# Define una lista de documentos de ejemplo en español.
documents = [
    "El gato juega con la pelota en el jardín",
    "El perro duerme en la casa",
    "La pelota está en el jardín cerca del perro",
    "El gato y el perro juegan juntos",
    "La casa tiene un jardín muy grande",
]

In [None]:
# Paso 1: Transformar los documentos en una matriz de términos (Bag of Words)
# Use nltk's Spanish stop words
spanish_stop_words = stopwords.words('spanish')
vectorizer = CountVectorizer(stop_words=spanish_stop_words)
X = vectorizer.fit_transform(documents)

In [None]:
# Paso 2: Aplicar LDA para identificar temas
lda = LatentDirichletAllocation(n_components=2, random_state=42)  # Queremos 2 temas
lda.fit(X)

In [None]:
# Paso 3: Mostrar las palabras clave de cada tema
words = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
    print(f"Tema {idx + 1}:")
    print(", ".join([words[i] for i in topic.argsort()[-5:]]))  # Top 5 palabras
    print()

Tema 1:
duerme, juegan, juntos, casa, perro

Tema 2:
cerca, juega, gato, pelota, jardín



In [None]:
# Paso 4: Asignar temas a los documentos
doc_topics = lda.transform(X)
for i, doc in enumerate(documents):
    print(f"Documento {i + 1}: pertenece al tema {doc_topics[i].argmax() + 1}")

Documento 1: pertenece al tema 2
Documento 2: pertenece al tema 1
Documento 3: pertenece al tema 2
Documento 4: pertenece al tema 1
Documento 5: pertenece al tema 1
