# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Kenza Driss*
* *Maximilien Hoffbeck*
* *Jaeyi Jeong*
* *Yoojin Kim*


---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.clustering import LDA
import json
import random

## Exercise 4.8: Topics extraction

In [2]:
spark = SparkSession.builder \
    .appName("LDA Course Topics") \
    .getOrCreate()

with open("preprocessed_courses.txt", "r") as f:
    lines = f.readlines()
local_data = [json.loads(line.strip()) for line in lines]

schema = StructType([
    StructField("courseId", StringType(), True),
    StructField("tokens", ArrayType(StringType()), True)
])
df = spark.createDataFrame(local_data, schema=schema)

cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=10000, minDF=2.0)
cv_model = cv.fit(df)
result = cv_model.transform(df)

k = 10
lda = LDA(k=k, maxIter=10, seed=42, featuresCol="features")
lda_model = lda.fit(result)

topics = lda_model.describeTopics()
vocab = cv_model.vocabulary

def topic_terms(indices):
    return [vocab[i] for i in indices]

topics_list = topics.rdd.map(lambda row: topic_terms(row['termIndices'])).collect()

print("Top 10 topics (with top words):")
for i, terms in enumerate(topics_list):
    print(f"Topic {i+1}: {', '.join(terms)}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/06 14:25:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/06/06 14:26:05 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
25/06/06 14:26:16 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
                                                                                

Top 10 topics (with top words):
Topic 1: theory, linear, problem, algorithm, concept, data, quantum, prerequisite, introduction, equation
Topic 2: acoustic, audio, room, microphone, loudspeaker, processing, rotation, signal, hearing, training
Topic 3: skill, teaching, assessment, end, outcome, technique, exercise, work, keywords, concept
Topic 4: map, cartography, de, laba, landscape, territorial, la, journal, art, press
Topic 5: energy, research, report, epfl, technology, laboratory, field, lab, semester, scientific
Topic 6: drug, credit, risk, pharmacology, problem, theory, set, evaluate, target, assessment
Topic 7: chemical, process, chemistry, protein, flow, water, transfer, concept, treatment, numerical
Topic 8: material, protein, property, application, organic, presentation, fundamental, exercise, nanomaterials, physical
Topic 9: electron, tem, liquid, microscopy, crystal, linear, 3d, reconstruction, lcd, sem
Topic 10: chemical, biology, drug, biological, invited, semester, speak

## Print k = 10 topics extracted using LDA and give them labels.
From the output, we can label the topics as following:
- Topic 1: Theoretical Foundations & Algorithms
- Topic 2: Acoustics & Audio Processing
- Topic 3: Pedagogy & Skill Development
- Topic 4: Urbanism & Territorial Planning
- Topic 5: Energy & Scientific Research
- Topic 6: Pharmacology & Risk Modeling
- Topic 7: Chemical Engineering & Fluid Dynamics
- Topic 8: Materials Science & Nanotech
- Topic 9: Electron Microscopy & Imaging
- Topic 10: Life Sciences & Scientific Seminars

## How does it compare with LSI?
In our analysis, both LSI and LDA were applied to the corpus to uncover latent topics. While LSI uses a linear algebraic approach based on SVD, LDA is a probabilistic generative model that assumes documents are mixtures of topics and topics are distributions over words.

The topics extracted with LSI tended to be harder to interpret, often containing a mix of technical and generic terms from multiple domains. This made it difficult to assign clear, semantic labels to many of the latent dimensions identified by LSI.

In contrast, LDA produced more coherent and interpretable topics. For example, some topics clearly aligned with specific domains such as “Acoustics and Audio Processing”, “Urbanism and Territorial Planning”, or “Pharmacology and Risk Modeling”. This thematic clarity made it significantly easier to label the topics and understand the underlying structure of the course offerings.

While LSI can be useful for dimensionality reduction and reveals abstract concept spaces, LDA is better suited for topic interpretation and organization in textual datasets like course descriptions. The results suggest that LDA provides more actionable insights when the goal is to extract and label distinct themes from the data.

## Exercise 4.9: Dirichlet hyperparameters

In [3]:
def run_lda(alpha, beta, k=10, max_iter=10, seed=42):
    lda = LDA(k=k, maxIter=max_iter, seed=seed, optimizer="em", 
              featuresCol="features", docConcentration=[alpha], topicConcentration=beta)
    model = lda.fit(result)
    topics = model.describeTopics()
    vocab = cv_model.vocabulary
    return topics.rdd.map(lambda row: [vocab[i] for i in row['termIndices']]).collect()

alphas = [1.1, 2.0, 5.0, 10.0]
beta_fixed = 1.01
print("Varying α with fixed β = 1.01\n")
for alpha in alphas:
    print(f"--- α = {alpha} ---")
    topics_alpha = run_lda(alpha=alpha, beta=beta_fixed)
    for i, terms in enumerate(topics_alpha):
        print(f"Topic {i+1}: {', '.join(terms)}")
    print("\n")

betas = [1.1, 2.0, 5.0, 10.0]
alpha_fixed = 6
print("Varying β with fixed α = 6\n")
for beta in betas:
    print(f"--- β = {beta} ---")
    topics_beta = run_lda(alpha=alpha_fixed, beta=beta)
    for i, terms in enumerate(topics_beta):
        print(f"Topic {i+1}: {', '.join(terms)}")
    print("\n")

Varying α with fixed β = 1.01

--- α = 1.1 ---


                                                                                

Topic 1: cell, material, process, presentation, assessment, data, concept, outcome, keywords, end
Topic 2: circuit, laser, theory, time, quantum, power, introduction, noise, prerequisite, risk
Topic 3: energy, concept, mass, end, principle, theory, assessment, outcome, teaching, week
Topic 4: exercise, data, technique, prerequisite, assessment, introduction, end, keywords, control, linear
Topic 5: process, report, end, research, teaching, outcome, assessment, reaction, concept, control
Topic 6: skill, end, exercise, outcome, concept, report, work, teaching, activity, assessment
Topic 7: problem, prerequisite, report, skill, data, application, work, assessment, solution, outcome
Topic 8: process, material, theory, energy, structure, exercise, concept, application, end, prerequisite
Topic 9: data, end, assessment, activity, policy, teaching, management, skill, outcome, class
Topic 10: architecture, optic, optical, assessment, end, exercise, data, keywords, concept, outcome


--- α = 2.0 

                                                                                

Topic 1: cell, material, presentation, process, assessment, concept, data, outcome, end, keywords
Topic 2: laser, circuit, time, theory, quantum, introduction, concept, prerequisite, noise, application
Topic 3: energy, concept, mass, theory, end, assessment, outcome, principle, teaching, prerequisite
Topic 4: exercise, data, prerequisite, assessment, technique, end, outcome, keywords, teaching, skill
Topic 5: process, end, report, outcome, teaching, assessment, research, concept, work, application
Topic 6: skill, end, concept, outcome, exercise, report, work, teaching, assessment, activity
Topic 7: problem, prerequisite, data, skill, report, assessment, end, outcome, work, application
Topic 8: process, material, theory, concept, energy, exercise, application, structure, end, prerequisite
Topic 9: data, end, assessment, activity, policy, teaching, outcome, exercise, skill, treatment
Topic 10: architecture, optic, optical, assessment, end, exercise, keywords, data, concept, outcome


---

                                                                                

Topic 1: cell, material, concept, assessment, end, presentation, outcome, process, data, keywords
Topic 2: laser, concept, end, outcome, prerequisite, material, application, theory, assessment, time
Topic 3: concept, energy, end, assessment, theory, outcome, teaching, exercise, prerequisite, keywords
Topic 4: exercise, data, assessment, prerequisite, end, outcome, concept, skill, teaching, keywords
Topic 5: process, end, outcome, assessment, teaching, report, concept, application, keywords, work
Topic 6: skill, end, concept, assessment, outcome, work, exercise, teaching, report, activity
Topic 7: problem, data, end, assessment, prerequisite, outcome, skill, work, concept, teaching
Topic 8: material, concept, process, exercise, end, application, assessment, prerequisite, keywords, theory
Topic 9: end, data, assessment, exercise, teaching, activity, outcome, skill, concept, process
Topic 10: architecture, assessment, end, concept, keywords, exercise, outcome, prerequisite, teaching, data

                                                                                

Topic 1: concept, material, end, assessment, outcome, data, presentation, keywords, exercise, process
Topic 2: concept, end, outcome, assessment, material, prerequisite, application, teaching, data, work
Topic 3: concept, end, assessment, outcome, exercise, teaching, prerequisite, theory, keywords, energy
Topic 4: exercise, assessment, data, concept, end, outcome, prerequisite, skill, teaching, keywords
Topic 5: end, outcome, process, assessment, concept, teaching, exercise, application, keywords, prerequisite
Topic 6: concept, end, assessment, outcome, skill, teaching, exercise, work, prerequisite, keywords
Topic 7: end, assessment, outcome, data, prerequisite, concept, teaching, skill, exercise, work
Topic 8: end, concept, material, exercise, assessment, outcome, application, process, prerequisite, keywords
Topic 9: end, exercise, assessment, data, teaching, concept, outcome, skill, activity, prerequisite
Topic 10: architecture, assessment, end, concept, outcome, exercise, keywords, 

                                                                                

Topic 1: end, concept, assessment, outcome, exercise, data, prerequisite, teaching, keywords, skill
Topic 2: end, concept, assessment, outcome, exercise, prerequisite, teaching, data, keywords, material
Topic 3: concept, end, assessment, outcome, exercise, teaching, prerequisite, keywords, skill, data
Topic 4: end, concept, assessment, exercise, outcome, prerequisite, data, teaching, keywords, skill
Topic 5: end, assessment, concept, outcome, exercise, teaching, prerequisite, keywords, application, data
Topic 6: end, concept, assessment, outcome, exercise, teaching, prerequisite, skill, keywords, work
Topic 7: end, assessment, concept, outcome, exercise, teaching, prerequisite, skill, data, keywords
Topic 8: end, concept, assessment, outcome, exercise, prerequisite, teaching, keywords, material, application
Topic 9: end, concept, assessment, outcome, exercise, teaching, prerequisite, data, skill, keywords
Topic 10: concept, end, assessment, outcome, exercise, prerequisite, teaching, ke

## How do these value impact the results?
From the output, we notice that decreasing α results in more sharply defined topics, with each document dominated by only a few topics. This leads to clearer and more interpretable topics. Increasing α makes topics overlap more, introducing redundancy and reducing clarity.

Similarly, reducing β yields more focused topics with distinct keywords, while increasing β smooths the topic-word distributions, causing topics to share many general terms. This results in less distinctive and less useful topics.

It seems that lower values of α and β (around 1.1–2.0) produced the most coherent and informative topics.

## Exercise 4.10: EPFL's taught subjects

In [4]:
final_k = 10
final_alpha = [1.1]
final_beta = 0.6

lda_final = LDA(k=final_k, maxIter=10, seed=42, optimizer="online",
                featuresCol="features", docConcentration=final_alpha, topicConcentration=final_beta)
lda_model_final = lda_final.fit(result)

final_topics = lda_model_final.describeTopics()
vocab = cv_model.vocabulary

def topic_terms(indices):
    return [vocab[i] for i in indices]

topics_words = final_topics.rdd.map(lambda row: topic_terms(row['termIndices'])).collect()

# Print labeled topics
for i, words in enumerate(topics_words):
    print(f"Topic {i+1}: {', '.join(words)}")

Topic 1: concept, end, exercise, prerequisite, assessment, teaching, theory, outcome, data, process
Topic 2: audio, assessment, linear, processing, theory, introduction, technique, teaching, exercise, prerequisite
Topic 3: processing, power, signal, application, energy, imaging, introduction, exercise, technique, keywords
Topic 4: skill, exercise, ch, tool, research, concept, note, application, space, map
Topic 5: research, report, scientific, work, presentation, skill, epfl, data, plan, laboratory
Topic 6: credit, risk, assessment, information, evaluate, end, skill, application, introduction, form
Topic 7: protein, numerical, flow, structure, concept, finite, fracture, element, problem, interaction
Topic 8: exercise, skill, application, property, end, assessment, chemical, introduction, outcome, keywords
Topic 9: linear, outcome, information, skill, theory, keywords, prerequisite, week, technique, activity
Topic 10: drug, cell, end, prerequisite, teaching, introduction, keywords, conc

## Explain why you chose these values and write your labels for the topics.

After experimenting with various hyperparameter settings, we selected the combination k = 10, α = 1.1, and β = 0.6 for its balance of topic coherence and coverage. We tried higher values for k, but reducing k minimized redundancy between topics, and a lower β value yielded tighter, more distinct word groupings within each topic.

The resulting topics aligned well with major academic themes at EPFL, including Biotechnology, and Scientific Research Methods. The topics were both interpretable and diverse.

We chose the following labels:
- Introductory Concepts & Pedagogy
- Audio Signal Processing
- Signal & Imaging Systems
- Spatial Analysis & Applied Skills
- Research Methods & Scientific Work
- Risk Analysis & Evaluation Systems
- Structural Mechanics & Biomechanics
- Materials & Chemical Properties
- Mathematical Modeling & Methods
- Life Sciences & Biotech

## Exercise 4.11: Wikipedia structure

In [11]:
import re

def clean_tokens(tokens):
    return [t for t in tokens if re.match("^[a-zA-Z]{2,}$", t)]
    
raw_rdd = spark.sparkContext.textFile("ix-data/wikipedia-for-schools.txt")
parsed_rdd = raw_rdd.map(lambda line: json.loads(line)).map(lambda d: (d["title"], clean_tokens(d["tokens"])))

schema = StructType([
    StructField("title", StringType(), True),
    StructField("tokens", ArrayType(StringType()), True)
])
wiki_df = spark.createDataFrame(parsed_rdd, schema=schema)

cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=20000, minDF=5.0)
cv_model = cv.fit(wiki_df)
vectorized_df = cv_model.transform(wiki_df)

lda = LDA(k=20, maxIter=10, seed=42, optimizer="online",
          featuresCol="features", docConcentration=[0.5], topicConcentration=0.3)
lda_model = lda.fit(vectorized_df)

topics = lda_model.describeTopics()
vocab = cv_model.vocabulary

def map_terms(indices):
    return [vocab[i] for i in indices]

topics_words = topics.rdd.map(lambda row: map_terms(row["termIndices"])).collect()

for i, words in enumerate(topics_words):
    print(f"Topic {i+1}: {', '.join(words)}")

                                                                                

Topic 1: nitrogen, star, wars, film, lucas, nitrate, lego, films, gunpowder, osaka
Topic 2: test, bradman, cricket, court, pakistan, qin, india, england, sri, innings
Topic 3: eagle, gruffydd, gwynedd, crocodile, water, llywelyn, ap, owain, crocodiles, nile
Topic 4: alphabet, greek, letters, church, alphabets, south, bacon, part, somalia, malwa
Topic 5: planet, star, stars, planets, acid, orbit, sun, comet, system, objects
Topic 6: water, nuclear, high, gas, oil, energy, number, made, production, carbon
Topic 7: apple, putin, intel, russian, computer, mac, macintosh, russia, software, computers
Topic 8: aircraft, mk, narcissus, tintin, oxford, mosquito, fighter, air, squadron, phantom
Topic 9: time, years, john, england, english, made, world, work, music, book
Topic 10: space, moon, orbit, launch, mission, lunar, rocket, landing, spacecraft, earth
Topic 11: open, wimbledon, tennis, deer, final, masters, won, murray, singles, rama
Topic 12: shiva, boer, british, ganesha, boers, deities,

## Report the values for k, α and β that you chose a priori and why you picked them.
We chose the following LDA hyperparameters for the Wikipedia-for-Schools dataset:
- k = 20
- α = 0.5
- β = 0.3

These values were selected to reflect the breadth and specificity of the Wikipedia content. A larger number of topics (k=20) accounts for the wide variety of domains covered. A lower α encourages focused topic distributions per article, and a lower β helps generate sharper, more coherent topics.

## Are you convinced by the results? Give labels to the topics if possible. 
Yes, after filtering irrelevant tokens, the results are substantially better after, and many of the topics align clearly with recognizable themes. Here are the labels we chose:
- Movies & Entertainment
- Cricket & South Asian Sports
- Welsh History & Geography
- Languages & Alphabets
- Astronomy & Space Objects
- Energy & Natural Resources
- Computing & Technology
- Military Aircraft & Aviation
- Literature & History
- Space Exploration
- Tennis & Sports Tournaments
- Indian Deities & Colonial History
- Calendar & Dates
- Christianity & Religion
- Biology & Species
- Spiders, Natural Phenomena & Mythology
- Mathematics & Physics
- Politics & Global History
- Genetics & Bioinformatics
- Mathematics & Abstract Concepts