# Semester 3 Coding Portfolio Topic 2 Summative:
# Natural Language Processing

In this notebook, you are asked to do original work with little guidance, based on the skills you learned in the formative part (as well as lectures and workshops).
This section is graded not just on passing automated tests, but also on quality, originality, and effort (see assessment criteria in the assignment description).

In [1]:
# TODO: Please enter your student number here
STUDENT_NUMBER = ...

# SUMMATIVE ASSESSMENT

For this summative assignment, we ask you to find a dataset from an internet source of choice. You will then create an NLP pipeline including preprocessing, NLP analysis, and classification.
Your anlysis for this notebook should have two parts: An initial NLP analysis (as done in formative notebook 1), and a classification of these results (as done in formative notebook 2). Chose one method for each of these two steps.

You should chose ONE of the following:
 - Sentiment Analysis
 - LDA
 - BertTopic

You should ALSO chose ONE of the following:
 - Decision Tree / Random Forest
 - LLM-based text classification


The general assessment criteria for all summative assignments are mentioned in the assignment description on Canvas. Each notebook also has a few specific criteria we look for; make sure you fulfil them in your approach to this assignment.
In general, make sure this notebook represents a complete project: Write an explanation of what you are hoping to achieve with your analysis, document your code well, and present results in a comprehensive way.
The assessment criteria for this notebook vary slightly depending on which methods you chose to implement:

## Sentiment Analysis
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Effectively pre-processed the text data, including steps such as tokenization, stopword removal, lemmatization, and handling special characters.
 - Selected an appropriate sentiment analysis model or algorithm for their dataset and correctly implemented the sentiment analysis model, ensuring it is properly trained and tested.
 - Provided a clear and insightful interpretation of the sentiment analysis results, explaining the significance and implications of their findings.

## LDA 
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Correctly created a document-term matrix or equivalent representation suitable for LDA.
 - Selected appropriate parameters for the LDA model, such as the number of topics and hyperparameters.
 - Correctly implemented the LDA model, ensuring it is properly trained on the dataset.
 - Provided a clear and insightful interpretation of the topics, explaining the significance and relevance of the discovered topics.

## BertTopic
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Correctly generated text embeddings using a suitable model for input into BERTopic.
 - Correctly implemented the BERTopic model, ensuring it is properly trained on the dataset.
 - Accurately extracted and represented topics from the BERTopic model.
 - Provided a clear and insightful interpretation of the topics, explaining the significance and relevance of the discovered topics.

## Decision Tree / Random Forest
 - Formulated a relevant and appropriate classification objective for the NLP task.
 - Pre-processed the text data appropriately, including vecterization and other necessary steps.
 - Properly trained and tested the decision tree or random forest model.
 - Accurately print or visualize the results, or provide insightful interpretation of the findings.

## LLM-based text classification
 - Formulated a relevant and appropriate classification objective for the NLP task.
 - Correctly prepared the data for the LLM, ensuring it is suitable for model input.
 - Properly ran the LLM and tested the LLM output.
 - Accurately print or visualize the results, or provide insightful interpretation of the findings.

Pick a dataset of your choice. Please ensure your dataset is a csv file under 100MB named sem3_topic2_nlp_summative_data.csv

In [2]:
# Do NOT modify the contents of this cell. Start your customization in the next one!
import pandas as pd

custom_data_path = "sem3_topic2_nlp_summative_data.csv"
custom_df = pd.read_csv(custom_data_path)

<table>
<tr>
<td style="vertical-align: top; padding-right: 20px;">

<h2>BACKSTORY</h2>

<p>
Religion has always been an interesting topic for me. My parents never baptised me, choosing instead to let me decide my own beliefs when I was old enough. Still, my mother is strongly Orthodox, so growing up we celebrated Christmas on the night of the 6th to the 7th of January.
</p>

<p>
Now, living in Amsterdam, I have a Christian boyfriend. This year I will be joining his family for a Catholic Christmas celebration. They are very religious people (typical Lebanese-Italian family), and as a respectful girlfriend I decided to use this summative project as an opportunity to impress my “mother-in-law” with my growing Bible knowledge.
</p>

<p>
So in this notebook I will be <b>(1) exploring different topics that emerge throughout Bible verses</b>, and <b>(2) training a model to classify each verse into its discovered topic.
</b>

<p>
the image was generated with ChatGPT
</p>

</td>

<td>
    <img src="cross-bible.png" width="750">
</td>
</tr>
</table>


**RQ: Can topic modelling reveal meaningful themes in Bible verses, and can we automatically classify each verse into its discovered topic using a machine-learning model?**

**DEVELOPING A PIPELINE**


below I revisit Fromative ipynb 1 to figure out NLP pipeline

## Preprocessing the text for NLP
Preprocessing can involve some combination of the following steps. Which steps to use depends on what you want to do.

1. *Remove unwanted or empty messages.* We start by cleaning the data, removing messages that are unlikely to contain any useful text.

2. *Text Cleaning.*
The first step is to clean the text. We remove any irrelevant items like HTML tags, URLs, and codes when dealing with web data. We also get rid of special characters, numbers, or punctuation that might not be necessary for analysis.

3. *Case Normalization.*
Next, we normalize the case by converting all the text to lower case. This ensures that words like 'House', 'house', and 'HOUSE' are all treated as the same word, preventing the model from treating them as different entities.

4. *Tokenization.*
Then we move to tokenization. This is where we break down the text into smaller pieces, or tokens. Tokens can be words, phrases, or even sentences. In English, this might seem as simple as splitting by spaces, but it can get complicated with languages that don’t use spaces or have complex morphology.

5. *Stop Words Removal.*
After tokenization, we often remove stop words. These are common words like 'is', 'and', 'the', which appear frequently in the text but usually don’t carry significant meaning for the analysis.

6. *Lemmatization.*
Now, we refine our tokens using ste lemmatization. This strips the words down to their root form. For example, 'running', 'runs', and 'ran' might all be reduced to 'run'.

In [3]:
#core
import json
import pandas as pd
import numpy as np

#text preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#vectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#LDA topic modelling
from sklearn.decomposition import LatentDirichletAllocation

#BERTopic
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

#classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

#evaluation
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
)

#visuals
import matplotlib.pyplot as plt
import seaborn as sns

#NLTK 
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")


[nltk_data] Downloading package stopwords to /Users/mac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

(0) **DATA PREPROCESSING**


My original dataset is json so I need to convert into csv to satisfy assingment criteria

In [4]:
# load JSON
with open("ASV.json", "r") as f:
    data = json.load(f)

rows = []

for book in data["books"]:
    book_name = book["name"]
    for chapter in book["chapters"]:
        chapter_num = chapter["chapter"]
        for verse in chapter["verses"]:
            verse_num = verse["verse"]
            text = verse["text"]
            rows.append({
                "book": book_name,
                "chapter": chapter_num,
                "verse": verse_num,
                "text": text
            })

df = pd.DataFrame(rows)

# IMPORTANT: use quoting to prevent broken columns
df.to_csv("sem3_topic2_nlp_summative_data.csv", index=False, quoting=1)  # quoting=1 == csv.QUOTE_ALL


Upon manual inspection my generated csv looked broken, so I check df head and info to see that the verses actually loaded correctly. 

In [5]:
df = pd.read_csv("sem3_topic2_nlp_summative_data.csv")
df.head()
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31102 entries, 0 to 31101
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   book     31102 non-null  object
 1   chapter  31102 non-null  int64 
 2   verse    31102 non-null  int64 
 3   text     31102 non-null  object
dtypes: int64(2), object(2)
memory usage: 972.1+ KB


In [6]:
df = pd.read_csv("sem3_topic2_nlp_summative_data.csv")
print(df.head())
print(df.iloc[0]["text"])


      book  chapter  verse                                               text
0  Genesis        1      1  In the beginning God created the heavens and t...
1  Genesis        1      2  And the earth was waste and void; and darkness...
2  Genesis        1      3  And God said, Let there be light: and there wa...
3  Genesis        1      4  And God saw the light, that it was good: and G...
4  Genesis        1      5  And God called the light Day, and the darkness...
In the beginning God created the heavens and the earth. 


(1) **PREPROCESSING FOR BERTOPIC**

In [7]:
df = pd.read_csv("sem3_topic2_nlp_summative_data.csv")

#drop rows missing/empty text
df['text'] = df['text'].astype(str)  # ! string type !
df = df[df['text'].str.strip() != ""]  #remove empty str
df = df.dropna(subset=['text'])        #remove NaNs

#cleaning
def clean_text(text):
    text = text.lower()                         #lowercase
    text = re.sub(r"http\S+", "", text)         #- URLs ?
    text = re.sub(r"[^a-zA-Z\s]", "", text)     #- punctuation & numbers
    text = re.sub(r"\s+", " ", text).strip()    #collapse multiple spaces
    return text

df['clean_text'] = df['text'].apply(clean_text)

df.head()

Unnamed: 0,book,chapter,verse,text,clean_text
0,Genesis,1,1,In the beginning God created the heavens and t...,in the beginning god created the heavens and t...
1,Genesis,1,2,And the earth was waste and void; and darkness...,and the earth was waste and void and darkness ...
2,Genesis,1,3,"And God said, Let there be light: and there wa...",and god said let there be light and there was ...
3,Genesis,1,4,"And God saw the light, that it was good: and G...",and god saw the light that it was good and god...
4,Genesis,1,5,"And God called the light Day, and the darkness...",and god called the light day and the darkness ...


(1.2) **PREPARE DOCS FOR BERTOPIC**

In [8]:
# Copy the text column as a list of documents
documents = df['text'].astype(str).tolist()

print("Number of documents:", len(documents))
print("Example document:", documents[0])


Number of documents: 31086
Example document: In the beginning God created the heavens and the earth. 


(2) **BERTOPIC**

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic

docs = df["clean_text"].tolist()

vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
    vectorizer_model=vectorizer_model,
    nr_topics=24,        # fix the number of topics
    verbose=True,
)

topics, probs = topic_model.fit_transform(docs)


2025-11-25 23:16:44,706 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/972 [00:00<?, ?it/s]

2025-11-25 23:17:19,812 - BERTopic - Embedding - Completed ✓
2025-11-25 23:17:19,813 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-25 23:17:32,547 - BERTopic - Dimensionality - Completed ✓
2025-11-25 23:17:32,547 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling paralle

In [10]:
topic_info = topic_model.get_topic_info()


In [11]:
topic_info = topic_model.get_topic_info()
topic_info


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,16531,-1_shall_unto_thou_jehovah,"[shall, unto, thou, jehovah, thy, thee, god, y...",[jehovah thy god will raise up unto thee a pro...
1,0,7105,0_jehovah_unto_said_israel,"[jehovah, unto, said, israel, david, king, cam...","[then the word of jehovah came unto me saying,..."
2,1,3893,1_shall_ye_thy_thou,"[shall, ye, thy, thou, god, unto, hath, christ...",[but as for you all come on now again and i sh...
3,2,918,2_son_sons_years_reigned,"[son, sons, years, reigned, thousand, children...",[josiah was eight years old when he began to r...
4,3,534,3_husband_woman_wife_said,"[husband, woman, wife, said, unto, nakedness, ...",[let the husband render unto the wife her due ...
5,4,342,4_border_suburbs_encamped_journeyed,"[border, suburbs, encamped, journeyed, land, c...",[and the border went out unto the side of ekro...
6,5,315,5_gold_silver_fine_oil,"[gold, silver, fine, oil, shekels, flour, meal...",[his oblation was one silver platter the weigh...
7,6,254,6_cubits_tabernacle_thereof_gate,"[cubits, tabernacle, thereof, gate, breadth, c...",[and the breadth of the entrance was ten cubit...
8,7,206,7_wine_fruit_vineyard_tree,"[wine, fruit, vineyard, tree, drink, new, rain...",[and no man putteth new wine into old wineskin...
9,8,163,8_begat_obed_azariah_boaz,"[begat, obed, azariah, boaz, nahshon, elishama...","[and obed begat jesse and jesse begat david, a..."


loop thru all topics and print top words

In [12]:
for topic_id in topic_model.get_topic_info()["Topic"]:
    print(f"\n--- Topic {topic_id} ---")
    print(topic_model.get_topic(topic_id))



--- Topic -1 ---
[('shall', 0.026538321777265575), ('unto', 0.025059630514455814), ('thou', 0.02356718384345921), ('jehovah', 0.023416968604317252), ('thy', 0.022849549704680738), ('thee', 0.020465309384086427), ('god', 0.019922177208220097), ('ye', 0.018951934544126088), ('said', 0.01610795369776584), ('man', 0.01558994385263116)]

--- Topic 0 ---
[('jehovah', 0.03631820673863034), ('unto', 0.031462851124732766), ('said', 0.02693044418700315), ('israel', 0.02534956646716096), ('david', 0.023895908665458782), ('king', 0.02263004998852045), ('came', 0.021768839051480998), ('shall', 0.019269463021900933), ('saying', 0.01795960457727351), ('house', 0.016879904004367122)]

--- Topic 1 ---
[('shall', 0.039032298521076), ('ye', 0.028775048592669657), ('thy', 0.026558419588387623), ('thou', 0.024643875721986577), ('god', 0.02344229326095673), ('unto', 0.021094584086515183), ('hath', 0.02092884878104888), ('christ', 0.019252060319127157), ('thee', 0.01821223366061896), ('man', 0.0179611442704

In [13]:
topic_model.visualize_hierarchy()


Stage 1 — Embedding and Clustering

Do not remove stopwords

BERTopic creates topics in two phases.
The first phase uses sentence embeddings to cluster documents.
During this step, removing stopwords is harmful because:

Removing stopwords destroys sentence meaning

Embeddings expect natural, full sentences

Context is lost if you remove words like “and”, “of”, “to”, etc.

BERTopic performs worse when the input text is heavily cleaned

So the text going into the embedding step should be nearly raw, except for light cleaning like lowercasing and trimming whitespace.

Stage 2 — Topic Representation (c-TF-IDF)

Stopwords can be removed here

After the clusters are created, BERTopic uses c-TF-IDF to extract the top representative words for each topic.
This step benefits from removing stopwords because it makes the topic words more meaningful and less generic.

However, you do not remove stopwords manually.
BERTopic allows you to specify stopwords directly so it handles them only during the representation step.

**FIX IMPORT**

Check topic 0 to ensure stop words removed now. 


Upon checking for the first time even though I expected to see no stop words I got those old ways of writing pronouns which are in itself not meaningful at all


[('thou', 0.014937373550020365),
 ('hast', 0.014405130728016153),
 ('thee', 0.0130547974710153),
 ('precepts', 0.012812065086448337),
 ('thy', 0.012758845182956701),
 ('art', 0.01129255999103892),
 ('didst', 0.010644017168714534),
 ('thyself', 0.010014345661687142),
 ('shalt', 0.008669805195656523),
 ('thine', 0.008116947841383953)]


 So i decided to modify the stop words list to fit my dataset better. 

(3) **TOPIC ANALYSIS**

topic_model.get_topics() returns BERTopic’s internal dictionary of topics: each key is a topic ID (e.g., -1, 0, 1, …) and each value is a list of tuples capturing the top n words for that topic and their c-TF-IDF scores. So when you iterate or inspect that dictionary, you’re looking at the set of learned topics, including the special “outlier” topic -1.


-1 -> for docs that didnt fit well into any learned topic.

**CONCLUSION**


So after getting my topics with Bertopic I got about 394 of them. For a person who doesn't understand the Bible it seems too big, so I decided to take extra steps and nail down more 'meningful' topics and help me to interpret them

(3) **BIBLE TOPIC INTERPRETATION**