# Step 1: Load the Reuters dataset

### You can download the Reuters dataset from the NLTK library using the following code:

In [2]:
import nltk
#nltk.download('reuters')
#nltk.download('stopwords')

### Once you have downloaded the dataset, you can load it using the following code:


In [3]:
from nltk.corpus import reuters
documents = reuters.fileids()
#print(documents)

# Step 2: Preprocess the dataset

### You will need to preprocess the dataset by tokenizing the text, removing stop words, and stemming the words.
### Here is some example code to preprocess the dataset:

In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words('english'))

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [stemmer.stem(token) for token in tokens]
    return tokens

corpus = [preprocess(reuters.raw(document_id)) for document_id in documents]


# Step 3: Find bigrams in the corpus

In [9]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(corpus)
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in stop_words)
finder.apply_freq_filter(5)
bigrams = finder.nbest(bigram_measures.pmi, 5000)

###  Add bigrams to the corpus

# Step 3: Create a dictionary and bag of words

### You will need to create a dictionary and a bag of words from the preprocessed corpus. 

In [4]:
from gensim.corpora.dictionary import Dictionary

dictionary = Dictionary(corpus)
bow_corpus = [dictionary.doc2bow(doc) for doc in corpus]


# Step 4: Train the LDA model

### You can use the Gensim library to train the LDA model on the corpus. 

In [5]:
from gensim.models.ldamodel import LdaModel

num_topics = 10
lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=10)


### We can print out the top words in each topic to get an idea of what each topic is about:



In [38]:
# Print topics and top words in each topic
for topic in lda_model.show_topics(num_topics=num_topics):
    print(topic)


(0, '0.048*"said" + 0.030*"lt" + 0.026*"compani" + 0.024*"share" + 0.021*"dlrs" + 0.015*"mln" + 0.013*"inc" + 0.012*"corp" + 0.012*"pct" + 0.011*"offer"')
(1, '0.034*"said" + 0.016*"gold" + 0.014*"produc" + 0.014*"mine" + 0.014*"export" + 0.013*"coffe" + 0.012*"price" + 0.011*"brazil" + 0.011*"sugar" + 0.011*"china"')
(2, '0.137*"vs" + 0.115*"mln" + 0.068*"net" + 0.062*"cts" + 0.057*"loss" + 0.046*"dlrs" + 0.040*"shr" + 0.030*"profit" + 0.025*"qtr" + 0.024*"lt"')
(3, '0.026*"ec" + 0.023*"franc" + 0.021*"said" + 0.016*"european" + 0.013*"french" + 0.011*"credit" + 0.010*"sugar" + 0.010*"communiti" + 0.009*"would" + 0.009*"commiss"')
(4, '0.085*"billion" + 0.070*"bank" + 0.052*"mln" + 0.039*"dlrs" + 0.028*"pct" + 0.027*"stg" + 0.019*"reserv" + 0.019*"loan" + 0.016*"money" + 0.016*"said"')
(5, '0.064*"cts" + 0.037*"lt" + 0.037*"april" + 0.036*"dividend" + 0.033*"record" + 0.022*"pay" + 0.022*"div" + 0.021*"quarter" + 0.021*"prior" + 0.019*"vs"')
(6, '0.036*"said" + 0.016*"trade" + 0.016*"

### To visualize the topics, we can use the pyLDAvis library:

### This will display an interactive visualization of the topics, where each bubble represents a topic and the size of the bubble corresponds to the prevalence of the topic in the corpus. The closer two bubbles are to each other, the more similar their topics are.

### We can also visualize the distribution of topics in the corpus using a histogram:


# Step 5: Calculate evaluation metrics 

### To calculate the evaluation metrics, you will need to assign a topic to each document.
### Here is some example code to assign topics to each document:

In [15]:
document_topics = []
for document_bow in bow_corpus:
    document_topic = max(lda_model[document_bow], key=lambda x: x[1])[0]
    document_topics.append(document_topic)


### Once you have assigned topics to each document, you can calculate the evaluation metrics.

### Note that the PMI metric requires a bit of additional preprocessing to calculate bigram associations in the text.
### That's it! You have now trained an LDA model on the Reuters dataset and calculated evaluation metrics for the topic assignments.