# Classic NLP - Labs

This exercise aims to categorize all texts into distinct groups that share similar themes or subjects. The grouping is determined by analyzing the frequency of certain words used in the reports, by applying the Latent Dirichlet Allocation (LDA) algorithm.

#### Dataset
@dataset {consolidated_climate_dataset,  
    author = {Joseph Pollack},  
    title = {Climate Guard Toxic Agent - Dataset},  
    year = {2024},  
    publisher = {Hugging Face},  
    url = { https://huggingface.co/datasets/Tonic/Climate-Guard-Toxic-Agent }  
}

<br>

Documentation:
* **Latent Dirichlet Allocation** (topic modelling) in [Gensim library](https://radimrehurek.com/gensim/models/ldamodel.html)
* **LDA visualization** in [pyLDAvis library](https://github.com/bmabey/pyLDAvis)
* **Lemmatizer** in [NLTK library](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet)
* **Stopwords** in [NLTK library](https://www.nltk.org/howto/corpus.html)

<br>
@Ricardo Almeida

In [None]:
#!pip install nltk pyLDAvis wordcloud

In [None]:
from gensim import corpora, matutils, models
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import numpy as np
import pandas as pd
import pyLDAvis
from pyLDAvis.gensim_models import prepare
import re
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud

In [None]:
_ = nltk.download('stopwords', quiet=True)
_ = nltk.download('wordnet', quiet=True)

In [None]:
text_column = 'text'

num_passes=50
num_iterations=100

word_min_doc_occurences=0.05 # decimal as percentage of docs, or integer as absolute number of doc occurences
num_topics = 16

### Loading dataset

Text segments dataset

In [None]:
df = pd.read_parquet("data/climate_text_sample.parquet")

In [None]:
df

In [None]:
for t in df[0:3].text:
    print(f"{t} \n")

In [None]:
df = df[['text']]

### Pre-processing text

In [None]:
df = df.copy()

### Task #1

*Task*: Perform the text preprocessing

Consider applying these transformations (not necessarily in this order!):

- Remove digits: removing tokens that are purely numeric or start with a number
- Remove punctuation
- Remove stopwords: removing  words that do not add significant meaning *Suggestion*: you may use `stopwords.words('english')` set from NLTK library
- Blacklist removal: removing specific words
- Tokenization: splits the text into individual words (tokens). *Suggestion*: you may use `RegexpTokenizer(r'\w+')` which already removes punctuation
- Lemmatization: reduces words to their base, dictionary form.  *Suggestion*: you may use `WordNetLemmatizer()`
- Lowercasing: converting all text to lowercase


In [None]:
df[text_column] = df[text_column].str.lower()

In [None]:
word_blacklist = ['climate', 'change']

In [None]:
# pre-process text
def process_text(text):

    # Tokenize the text (split text into words, remove punctuation)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    
    # Remove stopwords
    lemmatizer = WordNetLemmatizer()    
    stop_words_set = set(stopwords.words('english'))
    processed_tokens = [
        lemmatizer.lemmatize(token)  # Reduce words to their base or dictionary form
        for token in tokens
        if token not in stop_words_set and not token.isnumeric()
    ]
    # remove blacklisted words
    processed_tokens = [token for token in processed_tokens if token not in word_blacklist]
    # remove tokens starting with a number
    processed_tokens = [token for token in processed_tokens if not token.isnumeric()]
    
    return processed_tokens

In [None]:
# Apply text preprocessing

df['tokens'] = df[text_column].apply(process_text)

In [None]:
df

In [None]:
# Create a document-term matrix. Use CountVectorizer to convert document into a matrix of token counts
vectorizer = CountVectorizer(analyzer=lambda x: x, min_df=word_min_doc_occurences)
matrix = vectorizer.fit_transform(df['tokens'])

In [None]:
print(np.array(matrix[0]))

In [None]:
# Convert to a gensim-friendly iterable
corpus = matutils.Sparse2Corpus(matrix, documents_columns=False)

# Construct a document-term vocabulary (dictionary)
id2word = dict((v, k) for k, v in vectorizer.vocabulary_.items())

# Create a gensim Dictionary
id2word_dict = corpora.Dictionary(df['tokens'])

In [None]:
matrix[0]

In [None]:
lda = models.LdaModel(corpus=corpus, num_topics=num_topics, id2word=id2word, passes=num_passes, iterations=num_iterations)

In [None]:
def convert_dict_2_gensim_dict(id_to_word_dict):    
    # Invert the mapping (now keys are tokens and values are the token IDs)
    token_to_id = {v: k for k, v in id_to_word_dict.items()}
    
    # Create a gensim Dictionary from the token to ID mapping
    gensim_dict = corpora.dictionary.Dictionary()
    gensim_dict.token2id = token_to_id
    
    # Optional
    #gensim_dict.num_docs = num_documents  # Set the number of documents, you need to define num_documents
    #gensim_dict.dfs = {token_to_id[token]: 1 for token in token_to_id}  # Assuming each token appears in 1 document

    return gensim_dict

In [None]:
pyLDAvis.enable_notebook()

prepare(topic_model=lda, corpus=corpus, dictionary=convert_dict_2_gensim_dict(id2word))

In [None]:
# Extract topics and their word distributions
topics = lda.print_topics(num_words=15)

In [None]:
top_topics = lda.top_topics(corpus)

In [None]:
for i, _ in enumerate(top_topics):
    print(f"{[w[1] for w in top_topics[i][0]]} \n")

### Task #2

*Task*: Visualize topics with wordclouds.

Using `WordCloud` from [wordcloud library](https://github.com/amueller/word_cloud), create a word cloud for each topic, based on the already prepared word dicts.

Then show the wordcloud with `plt.imshow(your_wordcloud)`

In [None]:
def create_wordcloud(t_id, words_dict):
    # Create a WordCloud object
    wc = WordCloud(background_color='white')
    wc.generate_from_frequencies(words_dict)

    # Display the generated word cloud
    plt.figure(figsize=(10, 4))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'\n\nTopic {t_id+1}\n')
    plt.show()

In [None]:
#Visualize the topics using word clouds, one for each topic

for topic_id, topic in enumerate(topics):
    # Process each formatted topic string
    # Extracts word and weight (probability), converts weight to float and collects in dict
    words_dict = {word: float(weight) for weight, word in 
                  [tuple(word_prob.split('*')) for word_prob in topic[1].split(' + ')]}
    words_dict = {key.strip('"'): value for key, value in words_dict.items()}

    #print(words_dict)
    #print()

    create_wordcloud(topic_id, words_dict)

### Task #3

*Task*: Discuss the meaningfulness of the results

* Do the resulting topics make sense? Are they useful?
* Does the (bag-of-words) approach result in a good understanding/split of the statements?
