# Semester 3 Coding Portfolio Topic 2 Summative:
# Natural Language Processing

In this notebook, you are asked to do original work with little guidance, based on the skills you learned in the formative part (as well as lectures and workshops).
This section is graded not just on passing automated tests, but also on quality, originality, and effort (see assessment criteria in the assignment description).

In [1]:
# TODO: Please enter your student number here
STUDENT_NUMBER = ...

# SUMMATIVE ASSESSMENT

For this summative assignment, we ask you to find a dataset from an internet source of choice. You will then create an NLP pipeline including preprocessing, NLP analysis, and classification.
Your anlysis for this notebook should have two parts: An initial NLP analysis (as done in formative notebook 1), and a classification of these results (as done in formative notebook 2). Chose one method for each of these two steps.

You should chose ONE of the following:
 - Sentiment Analysis
 - LDA
 - BertTopic

You should ALSO chose ONE of the following:
 - Decision Tree / Random Forest
 - LLM-based text classification


The general assessment criteria for all summative assignments are mentioned in the assignment description on Canvas. Each notebook also has a few specific criteria we look for; make sure you fulfil them in your approach to this assignment.
In general, make sure this notebook represents a complete project: Write an explanation of what you are hoping to achieve with your analysis, document your code well, and present results in a comprehensive way.
The assessment criteria for this notebook vary slightly depending on which methods you chose to implement:

## Sentiment Analysis
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Effectively pre-processed the text data, including steps such as tokenization, stopword removal, lemmatization, and handling special characters.
 - Selected an appropriate sentiment analysis model or algorithm for their dataset and correctly implemented the sentiment analysis model, ensuring it is properly trained and tested.
 - Provided a clear and insightful interpretation of the sentiment analysis results, explaining the significance and implications of their findings.

## LDA 
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Correctly created a document-term matrix or equivalent representation suitable for LDA.
 - Selected appropriate parameters for the LDA model, such as the number of topics and hyperparameters.
 - Correctly implemented the LDA model, ensuring it is properly trained on the dataset.
 - Provided a clear and insightful interpretation of the topics, explaining the significance and relevance of the discovered topics.

## BertTopic
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Correctly generated text embeddings using a suitable model for input into BERTopic.
 - Correctly implemented the BERTopic model, ensuring it is properly trained on the dataset.
 - Accurately extracted and represented topics from the BERTopic model.
 - Provided a clear and insightful interpretation of the topics, explaining the significance and relevance of the discovered topics.

## Decision Tree / Random Forest
 - Formulated a relevant and appropriate classification objective for the NLP task.
 - Pre-processed the text data appropriately, including vecterization and other necessary steps.
 - Properly trained and tested the decision tree or random forest model.
 - Accurately print or visualize the results, or provide insightful interpretation of the findings.

## LLM-based text classification
 - Formulated a relevant and appropriate classification objective for the NLP task.
 - Correctly prepared the data for the LLM, ensuring it is suitable for model input.
 - Properly ran the LLM and tested the LLM output.
 - Accurately print or visualize the results, or provide insightful interpretation of the findings.

Pick a dataset of your choice. Please ensure your dataset is a csv file under 100MB named sem3_topic2_nlp_summative_data.csv

In [2]:
# Do NOT modify the contents of this cell. Start your customization in the next one!
import pandas as pd

custom_data_path = "sem3_topic2_nlp_summative_data.csv"
custom_df = pd.read_csv(custom_data_path)

<table>
<tr>
<td style="vertical-align: top; padding-right: 20px;">

<h2>BACKSTORY</h2>

<p>
Religion has always been an interesting topic for me. My parents never baptised me, choosing instead to let me decide my own beliefs when I was old enough. Still, my mother is strongly Orthodox, so growing up we celebrated Christmas on the night of the 6th to the 7th of January.
</p>

<p>
Now, living in Amsterdam, I have a Christian boyfriend. This year I will be joining his family for a Catholic Christmas celebration. They are very religious people, and as a respectful girlfriend I decided to use this summative project as an opportunity to impress my “mother-in-law” with my growing Bible knowledge.
</p>

<p>
So in this notebook I will be <b>(1) exploring different topics that emerge throughout Bible verses</b>, and <b>(2) training a model to classify each verse into its discovered topic.
</b>

<p>
the image was generated with ChatGPT
</p>

</td>

<td>
    <img src="cross-bible.png" width="850">
</td>
</tr>
</table>


**RQ: Can topic modelling reveal meaningful themes in Bible verses, and can we automatically classify each verse into its discovered topic using a machine-learning model?**

In [8]:
import json
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [9]:
#load json data & convert into df
with open("ASV.json", "r") as f:
    data = json.load(f)

rows = []
for book in data["books"]:
    for chapter in book["chapters"]:
        for verse in chapter["verses"]:
            rows.append({
                "book": book["name"],
                "chapter": chapter["chapter"],
                "verse": verse["verse"],
                "text": verse["text"]
            })

df = pd.DataFrame(rows)
df.to_csv("sem3_topic2_nlp_summative_data.csv", index=False, quoting=1) #save to csv

print(f"Total verses: {len(df)}")
print(f"Total books: {df['book'].nunique()}")
df.head()

Total verses: 31102
Total books: 66


Unnamed: 0,book,chapter,verse,text
0,Genesis,1,1,In the beginning God created the heavens and t...
1,Genesis,1,2,And the earth was waste and void; and darkness...
2,Genesis,1,3,"And God said, Let there be light: and there wa..."
3,Genesis,1,4,"And God saw the light, that it was good: and G..."
4,Genesis,1,5,"And God called the light Day, and the darkness..."


In [None]:
#mapping each book to its genre
book_to_genre = {
    # Law/Pentateuch - first 5 books
    "Genesis": "Law", "Exodus": "Law", "Leviticus": "Law", 
    "Numbers": "Law", "Deuteronomy": "Law",
    
    # History/Old Testament
    "Joshua": "History", "Judges": "History", "Ruth": "History",
    "1 Samuel": "History", "2 Samuel": "History", 
    "1 Kings": "History", "2 Kings": "History",
    "1 Chronicles": "History", "2 Chronicles": "History",
    "Ezra": "History", "Nehemiah": "History", "Esther": "History",
    
    # Poetry & Wisdom
    "Job": "Wisdom", "Psalms": "Poetry", "Proverbs": "Wisdom",
    "Ecclesiastes": "Wisdom", "Song of Solomon": "Poetry",
    
    # Prophecy/Major Prophets
    "Isaiah": "Prophecy", "Jeremiah": "Prophecy", 
    "Lamentations": "Poetry", "Ezekiel": "Prophecy", "Daniel": "Prophecy",
    
    # Prophecy/Minor Prophets
    "Hosea": "Prophecy", "Joel": "Prophecy", "Amos": "Prophecy",
    "Obadiah": "Prophecy", "Jonah": "Prophecy", "Micah": "Prophecy",
    "Nahum": "Prophecy", "Habakkuk": "Prophecy", "Zephaniah": "Prophecy",
    "Haggai": "Prophecy", "Zechariah": "Prophecy", "Malachi": "Prophecy",
    
    # Gospels
    "Matthew": "Gospel", "Mark": "Gospel", "Luke": "Gospel", "John": "Gospel",
    
    # History/New Testament
    "Acts": "History",
    
    # Epistles/Letters
    "Romans": "Epistle", "1 Corinthians": "Epistle", "2 Corinthians": "Epistle",
    "Galatians": "Epistle", "Ephesians": "Epistle", "Philippians": "Epistle",
    "Colossians": "Epistle", "1 Thessalonians": "Epistle", "2 Thessalonians": "Epistle",
    "1 Timothy": "Epistle", "2 Timothy": "Epistle", "Titus": "Epistle", "Philemon": "Epistle",
    "Hebrews": "Epistle", "James": "Epistle", "1 Peter": "Epistle", "2 Peter": "Epistle",
    "1 John": "Epistle", "2 John": "Epistle", "3 John": "Epistle", "Jude": "Epistle",
    
    # Apocalyptic
    "Revelation": "Apocalyptic"
}

#add new genre column to df
df["genre"] = df["book"].map(book_to_genre)

#check for any unmapped books
print("Genre distribution:")
print(df["genre"].value_counts())