# Topic Modeling with LDA

This notebook demonstrates how to perform topic modeling on text data using Latent Dirichlet Allocation (LDA). We'll work with the Cohere/movies dataset from Hugging Face to discover underlying topics in movie-related content.

We'll now train our LDA model using the parameters specified earlier. LDA is a probabilistic model that works by:
1. Initially assigning random topics to each word in each document
2. Iteratively improving these assignments through sampling
3. Converging towards a stable distribution of topics.

In this notebook, we will work with the LDA library that runs the above step under the hook.

## Parameters Explanation

- **alpha**: Controls document-topic density. Lower alpha = documents contain fewer topics
- **eta**: Controls topic-word density. Lower eta = topics contain fewer words
- **n_iters**: Number of training iterations
- **n_topics**: Number of topics to extract from the corpus


## Prerequisite

To run the code below, you would need to run first the notebook: `01_data_preparation.ipynb`, which saves in the `data` folder a `doc_term_matrix.pkl`.


In [5]:
import numpy as np
from lda import LDA
import pickle
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Define variables
alpha: float = 0.1
eta: float = 0.01
n_iters: int = 10000
n_topics: int = 30

# Load the document-term matrix
with open('./data/doc_term_matrix.pkl', 'rb') as file:
    X = pickle.load(file)
print('Document-term matrix shape:', X.shape)

# create topic model
topic_model = LDA(n_topics, n_iters, alpha, eta)
topic_model.fit(X)

INFO:lda:n_documents: 4799
INFO:lda:vocab_size: 4270
INFO:lda:n_words: 113150
INFO:lda:n_topics: 30
INFO:lda:n_iter: 10000
INFO:lda:<0> log likelihood: -1562499


Document-term matrix shape: (4799, 4270)


INFO:lda:<10> log likelihood: -1036859
INFO:lda:<20> log likelihood: -1009551
INFO:lda:<30> log likelihood: -997092
INFO:lda:<40> log likelihood: -989110
INFO:lda:<50> log likelihood: -983677
INFO:lda:<60> log likelihood: -979969
INFO:lda:<70> log likelihood: -976171
INFO:lda:<80> log likelihood: -974125
INFO:lda:<90> log likelihood: -972745
INFO:lda:<100> log likelihood: -970283
INFO:lda:<110> log likelihood: -968486
INFO:lda:<120> log likelihood: -966324
INFO:lda:<130> log likelihood: -965653
INFO:lda:<140> log likelihood: -965132
INFO:lda:<150> log likelihood: -963517
INFO:lda:<160> log likelihood: -962577
INFO:lda:<170> log likelihood: -961632
INFO:lda:<180> log likelihood: -962084
INFO:lda:<190> log likelihood: -961140
INFO:lda:<200> log likelihood: -961082
INFO:lda:<210> log likelihood: -960442
INFO:lda:<220> log likelihood: -959509
INFO:lda:<230> log likelihood: -959107
INFO:lda:<240> log likelihood: -958831
INFO:lda:<250> log likelihood: -957979
INFO:lda:<260> log likelihood: -

<lda.lda.LDA at 0x10eb26150>

### Save Your Model
Saving the model allows you to reuse it later without retraining, which can save time and computational resources. 

In [6]:
import os

# Create the models directory if it doesn't exist
os.makedirs('./models', exist_ok=True)

# save the model
name = 'model_0.pkl'
file = open('./models/{name}'.format(name=name), 'wb')
pickle.dump(topic_model, file)
file.close()

## Running Multiple Models

To run multiple models sequentially, you can use the `topic_modeller.py` script with a configuration file (`models_configs.jsonl`).

1. **Edit the `models_configs.jsonl` file**: Add configurations for all the models you want to run. Each configuration should specify the number of topics (`n_topics`), number of iterations (`n_iters`), and other relevant parameters. Make sure each configuration has a different `run_id`. Each model is saved under the run_id name.

2. **Run the script**: Use the following command to run the script:
```sh
python src/topic_modeller.py -c configs/model_configs.jsonl     
```

Note: The models will be run in sequence, not in parallel. All models will be saved in the `/models` folder.