# Topic Modeling with LDA

This notebook demonstrates how to perform topic modeling on text data using Latent Dirichlet Allocation (LDA). We'll work with the Cohere/movies dataset from Hugging Face to discover underlying topics in movie-related content.

## Parameters Explanation

- **alpha**: Controls document-topic density. Lower alpha = documents contain fewer topics
- **eta**: Controls topic-word density. Lower eta = topics contain fewer words
- **n_iters**: Number of training iterations
- **n_topics**: Number of topics to extract from the corpus

In [4]:
alpha: float = 0.1
eta: float = 0.01
n_iters: int = 10000
n_topics: int = 30

## Model Training

We'll now train our LDA model using the parameters specified earlier. LDA is a probabilistic model that works by:
1. Initially assigning random topics to each word in each document
2. Iteratively improving these assignments through sampling
3. Converging towards a stable distribution of topics

PS: to run the code below, you would need to run first the notebook: `01_data_preparation.ipynb`, which saves in the `data` folder a `doc_term_matrix.pkl`.

In [5]:
import pickle

# Load the document-term matrix
with open('./data/doc_term_matrix.pkl', 'rb') as file:
    X = pickle.load(file)
print('Document-term matrix shape:', X.shape)

Document-term matrix shape: (4799, 4270)


## Model Training

We'll now train our LDA model using the parameters specified earlier. LDA is a probabilistic model that works by:
1. Initially assigning random topics to each word in each document
2. Iteratively improving these assignments through sampling
3. Converging towards a stable distribution of topics

In [6]:
import numpy as np
from lda import LDA
import pickle
import matplotlib.pyplot as plt

# create topic model
topic_model = LDA(n_topics, n_iters, alpha, eta)
topic_model.fit(X)

INFO:lda:n_documents: 4799
INFO:lda:vocab_size: 4270
INFO:lda:n_words: 113150
INFO:lda:n_topics: 30
INFO:lda:n_iter: 10000
INFO:lda:<0> log likelihood: -1562499
INFO:lda:<10> log likelihood: -1036278
INFO:lda:<20> log likelihood: -1010096
INFO:lda:<30> log likelihood: -998962
INFO:lda:<40> log likelihood: -990713
INFO:lda:<50> log likelihood: -985038
INFO:lda:<60> log likelihood: -981660
INFO:lda:<70> log likelihood: -977954
INFO:lda:<80> log likelihood: -974699
INFO:lda:<90> log likelihood: -972498
INFO:lda:<100> log likelihood: -971117
INFO:lda:<110> log likelihood: -968598
INFO:lda:<120> log likelihood: -966196
INFO:lda:<130> log likelihood: -966546
INFO:lda:<140> log likelihood: -964363
INFO:lda:<150> log likelihood: -963142
INFO:lda:<160> log likelihood: -963264
INFO:lda:<170> log likelihood: -962168
INFO:lda:<180> log likelihood: -960734
INFO:lda:<190> log likelihood: -960996
INFO:lda:<200> log likelihood: -960100
INFO:lda:<210> log likelihood: -959922
INFO:lda:<220> log likeliho

<lda.lda.LDA at 0x12013daf0>

### Save Your Model
Saving the model allows you to reuse it later without retraining, which can save time and computational resources. 

In [8]:
import os

# Create the models directory if it doesn't exist
os.makedirs('./models', exist_ok=True)

# save the model
name = 'model_0.pkl'
file = open('./models/{name}'.format(name=name), 'wb')
pickle.dump(topic_model, file)
file.close()

## Running Multiple Models

To run multiple models sequentially, you can use the `topic_modeller.py` script with a configuration file (`models_configs.jsonl`).

1. **Edit the `models_configs.jsonl` file**: Add configurations for all the models you want to run. Each configuration should specify the number of topics (`n_topics`), number of iterations (`n_iters`), and other relevant parameters. Make sure each configuration has a different `run_id`. Each model is saved under the run_id name.

2. **Run the script**: Use the following command to run the script:
```sh
python src/topic_modeller.py -c configs/model_configs.jsonl     
```

Note: The models will be run in sequence, not in parallel. All models will be saved in the `/models` folder.