# LDA Coherence Analysis (c_v)

This notebook loads **LDA-ready** data, creates multiple LDA models with varying topic numbers, computes the **c_v coherence** score for each, and plots the results. The resulting figure and logs help determine an appropriate number of topics.

Note: This notebook demonstrates how to systematically vary the number of LDA topics
and measure the c_v coherence score for each configuration.

## Key Libraries

- **[matplotlib](https://matplotlib.org/stable/index.html)** for plotting
- **[gensim](https://radimrehurek.com/gensim/)** for dictionary creation, LDA modeling, and coherence computation

## Imports

In [1]:
import json
import os
import matplotlib.pyplot as plt
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel
from pathlib import Path

## Helper Functions

These functions provide a way to systematically train and evaluate multiple LDA models for different topic numbers, then plot coherence scores.

In [2]:
def compute_coherence_values_c_v(dictionary, corpus, texts, start=2, limit=10, step=2):
    """
    Compute c_v coherence for various numbers of topics.
    Returns a list of LDA models and their coherence values.
    """
    model_list = []
    coherence_values = []

    for num_topics in range(start, limit + 1, step):
        model = LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            random_state=42,
            update_every=1,
            passes=10,
            alpha='auto',
            per_word_topics=True
        )
        model_list.append(model)
        coherencemodel = CoherenceModel(
            model=model,
            texts=texts,
            dictionary=dictionary,
            coherence='c_v'
        )
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

def plot_coherence_c_v(coherence_values, start=2, limit=10, step=2):
    """
    Plots c_v coherence scores for each tested topic number and saves the figure.
    """
    x = range(start, limit + 1, step)
    plt.figure(figsize=(8, 5))
    plt.plot(x, coherence_values, marker='o', label='c_v coherence')
    for i, score in enumerate(coherence_values):
        plt.text(x[i], score + 0.002, f"{score:.3f}", ha='center')
    plt.title('LDA Model Coherence (c_v)')
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence Score")
    plt.grid(True)
    plt.legend()
    plt.tight_layout()
    plt.show()

## Main Execution

Loads `lda_ready_data.json`, prepares a Gensim corpus, trains LDA models for different topic numbers, computes c_v coherence, and plots the results.

In [17]:
def main():
    """Main function to load data, prepare it for LDA, tain multiple models,
    compute coherence scores, and save a coherence plot."""
    processed_data_path = Path(r"C:\Users\laure\Desktop\dissertation_notebook\Data\lda_ready_data.json")
    with open(processed_data_path, 'r', encoding='utf-8') as f:
        processed_data = json.load(f)

    all_docs = []
    for post in processed_data:
        combined_text = post.get('combined_processed', '').strip()
        if combined_text:
            all_docs.append(combined_text.split())
        for c in post.get('comments', []):
            c_text = c.get('comment_processed', '').strip()
            if c_text:
                all_docs.append(c_text.split())

    dictionary = Dictionary(all_docs)
    dictionary.filter_extremes(no_below=5, no_above=0.5)
    corpus = [dictionary.doc2bow(doc) for doc in all_docs]

    start, limit, step = 2, 20, 2

    model_list, coherence_scores = compute_coherence_values_c_v(
        dictionary=dictionary,
        corpus=corpus,
        texts=all_docs,
        start=start,
        limit=limit,
        step=step
    )

    plot_coherence_c_v(
        coherence_values=coherence_scores,
        start=start,
        limit=limit,
        step=step
    )

    print("c_v Coherence Scores:")
    for num_t, score in zip(range(start, limit + 1, step), coherence_scores):
        print(f" Number of Topics = {num_t}: c_v = {score:.4f}")

if __name__ == "__main__":
    main()

c_v Coherence Scores:
 Number of Topics = 2: c_v = 0.2967
 Number of Topics = 4: c_v = 0.3328
 Number of Topics = 6: c_v = 0.3709
 Number of Topics = 8: c_v = 0.3432
 Number of Topics = 10: c_v = 0.3701
 Number of Topics = 12: c_v = 0.3854
 Number of Topics = 14: c_v = 0.3595
 Number of Topics = 16: c_v = 0.3695
 Number of Topics = 18: c_v = 0.3809
 Number of Topics = 20: c_v = 0.4091


## References

**Reference:**  
Hunter, J. D. (2007) *Matplotlib: A 2D Graphics Environment* [computer program]. *Computing in Science & Engineering*, 9(3), pp. 90–95 (v3.7.3).  
Available from: [https://matplotlib.org/](https://matplotlib.org/) [Accessed 12 January 2025].

**Git Repo:**  
- [Matplotlib GitHub](https://github.com/matplotlib/matplotlib)

**Reference:**  
Řehůřek, R., and Sojka, P. (2010) *Software Framework for Topic Modelling with Large Corpora*. In *Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks* [computer program].  
Available from: [https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/) [Accessed 12 January 2025].

**Git Repo:**  
- [Gensim GitHub](https://github.com/RaRe-Technologies/gensim)