# Semester 3 Coding Portfolio Topic 2 Summative:
# Natural Language Processing

In this notebook, you are asked to do original work with little guidance, based on the skills you learned in the formative part (as well as lectures and workshops).
This section is graded not just on passing automated tests, but also on quality, originality, and effort (see assessment criteria in the assignment description).

In [None]:
# TODO: Please enter your student number here
STUDENT_NUMBER = ...

# SUMMATIVE ASSESSMENT

For this summative assignment, we ask you to find a dataset from an internet source of choice. You will then create an NLP pipeline including preprocessing, NLP analysis, and classification.
Your anlysis for this notebook should have two parts: An initial NLP analysis (as done in formative notebook 1), and a classification of these results (as done in formative notebook 2). Chose one method for each of these two steps.

You should chose ONE of the following:
 - Sentiment Analysis
 - LDA
 - BertTopic

You should ALSO chose ONE of the following:
 - Decision Tree / Random Forest
 - LLM-based text classification


The general assessment criteria for all summative assignments are mentioned in the assignment description on Canvas. Each notebook also has a few specific criteria we look for; make sure you fulfil them in your approach to this assignment.
In general, make sure this notebook represents a complete project: Write an explanation of what you are hoping to achieve with your analysis, document your code well, and present results in a comprehensive way.
The assessment criteria for this notebook vary slightly depending on which methods you chose to implement:

## Sentiment Analysis
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Effectively pre-processed the text data, including steps such as tokenization, stopword removal, lemmatization, and handling special characters.
 - Selected an appropriate sentiment analysis model or algorithm for their dataset and correctly implemented the sentiment analysis model, ensuring it is properly trained and tested.
 - Provided a clear and insightful interpretation of the sentiment analysis results, explaining the significance and implications of their findings.

## LDA 
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Correctly created a document-term matrix or equivalent representation suitable for LDA.
 - Selected appropriate parameters for the LDA model, such as the number of topics and hyperparameters.
 - Correctly implemented the LDA model, ensuring it is properly trained on the dataset.
 - Provided a clear and insightful interpretation of the topics, explaining the significance and relevance of the discovered topics.

## BertTopic
 - Selected an appropriate dataset and prepared it for analysis, including cleaning and formatting the data.
 - Correctly generated text embeddings using a suitable model for input into BERTopic.
 - Correctly implemented the BERTopic model, ensuring it is properly trained on the dataset.
 - Accurately extracted and represented topics from the BERTopic model.
 - Provided a clear and insightful interpretation of the topics, explaining the significance and relevance of the discovered topics.

## Decision Tree / Random Forest
 - Formulated a relevant and appropriate classification objective for the NLP task.
 - Pre-processed the text data appropriately, including vecterization and other necessary steps.
 - Properly trained and tested the decision tree or random forest model.
 - Accurately print or visualize the results, or provide insightful interpretation of the findings.

## LLM-based text classification
 - Formulated a relevant and appropriate classification objective for the NLP task.
 - Correctly prepared the data for the LLM, ensuring it is suitable for model input.
 - Properly ran the LLM and tested the LLM output.
 - Accurately print or visualize the results, or provide insightful interpretation of the findings.

Pick a dataset of your choice. Please ensure your dataset is a csv file under 100MB named sem3_topic2_nlp_summative_data.csv

In [None]:
# Do NOT modify the contents of this cell. Start your customization in the next one!
import pandas as pd

custom_data_path = "sem3_topic2_nlp_summative_data.csv"
custom_df = pd.read_csv(custom_data_path)

<table>
<tr>
<td style="vertical-align: top; padding-right: 20px;">

<h2>BACKSTORY</h2>

<p>
Religion has always been an interesting topic for me. My parents never baptised me, choosing instead to let me decide my own beliefs when I was old enough. Still, my mother is strongly Orthodox, so growing up we celebrated Christmas on the night of the 6th to the 7th of January.
</p>

<p>
Now, living in Amsterdam, I have a Christian boyfriend. This year I will be joining his family for a Catholic Christmas celebration. They are very religious people (typical Lebanese-Italian family), and as a respectful girlfriend I decided to use this summative project as an opportunity to impress my “mother-in-law” with my growing Bible knowledge.
</p>

<p>
So in this notebook I will be <b>(1) exploring different topics that emerge throughout Bible verses</b>, and <b>(2) training a model to classify each verse into its discovered topic.
</b>

<p>
the image was generated with ChatGPT
</p>

</td>

<td>
    <img src="cross-bible.png" width="750">
</td>
</tr>
</table>


**DEVELOPING A PIPELINE**


below I revisit Fromative ipynb 1 to figure out NLP pipeline

## Preprocessing the text for NLP
Preprocessing can involve some combination of the following steps. Which steps to use depends on what you want to do.

1. *Remove unwanted or empty messages.* We start by cleaning the data, removing messages that are unlikely to contain any useful text.

2. *Text Cleaning.*
The first step is to clean the text. We remove any irrelevant items like HTML tags, URLs, and codes when dealing with web data. We also get rid of special characters, numbers, or punctuation that might not be necessary for analysis.

3. *Case Normalization.*
Next, we normalize the case by converting all the text to lower case. This ensures that words like 'House', 'house', and 'HOUSE' are all treated as the same word, preventing the model from treating them as different entities.

4. *Tokenization.*
Then we move to tokenization. This is where we break down the text into smaller pieces, or tokens. Tokens can be words, phrases, or even sentences. In English, this might seem as simple as splitting by spaces, but it can get complicated with languages that don’t use spaces or have complex morphology.

5. *Stop Words Removal.*
After tokenization, we often remove stop words. These are common words like 'is', 'and', 'the', which appear frequently in the text but usually don’t carry significant meaning for the analysis.

6. *Lemmatization.*
Now, we refine our tokens using ste lemmatization. This strips the words down to their root form. For example, 'running', 'runs', and 'ran' might all be reduced to 'run'.

In [2]:
#core
import json
import pandas as pd
import numpy as np

#text preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#vectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#LDA topic modelling
from sklearn.decomposition import LatentDirichletAllocation

#BERTopic
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

#classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

#evaluation
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
)

#visuals
import matplotlib.pyplot as plt
import seaborn as sns

#NLTK 
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")


[nltk_data] Downloading package stopwords to /Users/mac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

(0) **DATA PREPROCESSING**


My original dataset is json so I need to convert into csv to satisfy assingment criteria

In [3]:
# load JSON
with open("ASV.json", "r") as f:
    data = json.load(f)

rows = []

for book in data["books"]:
    book_name = book["name"]
    for chapter in book["chapters"]:
        chapter_num = chapter["chapter"]
        for verse in chapter["verses"]:
            verse_num = verse["verse"]
            text = verse["text"]
            rows.append({
                "book": book_name,
                "chapter": chapter_num,
                "verse": verse_num,
                "text": text
            })

df = pd.DataFrame(rows)

# IMPORTANT: use quoting to prevent broken columns
df.to_csv("sem3_topic2_nlp_summative_data.csv", index=False, quoting=1)  # quoting=1 == csv.QUOTE_ALL


In [4]:
df = pd.read_csv("sem3_topic2_nlp_summative_data.csv")
df.head()
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31102 entries, 0 to 31101
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   book     31102 non-null  object
 1   chapter  31102 non-null  int64 
 2   verse    31102 non-null  int64 
 3   text     31102 non-null  object
dtypes: int64(2), object(2)
memory usage: 972.1+ KB


In [5]:
df = pd.read_csv("sem3_topic2_nlp_summative_data.csv")
print(df.head())
print(df.iloc[0]["text"])


      book  chapter  verse                                               text
0  Genesis        1      1  In the beginning God created the heavens and t...
1  Genesis        1      2  And the earth was waste and void; and darkness...
2  Genesis        1      3  And God said, Let there be light: and there wa...
3  Genesis        1      4  And God saw the light, that it was good: and G...
4  Genesis        1      5  And God called the light Day, and the darkness...
In the beginning God created the heavens and the earth. 
