# Topic Exploration

This notebook serves as an interface with the LDA model, and uses several utility functions that are specifically implemented to explore what each topic contains.

Let's start by loading up the model and these auxiliar functions:

In [2]:
import datetime # We probably should add the year to the Article class.

In [3]:
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [4]:
from utils.exploration import get_articles_in_topic, get_titles_in_topic
from utils.exploration import get_keywords_in_topic, summary
from utils.exploration import summarize_topic, topic_top_n

And now, the LDA model:

In [None]:
from gensim.models.ldamodel import LdaModel
lda = LdaModel.load("LDA_gensim_90_final.model")

## Getting the top words in a topic

The function `topic_top_n(lda, topic_id, n=10, verbose=False)` grabs the lda model and a topic id and returns the top `n` words and their probabilities:

In [4]:
!rm topics.md

with open('topics.md', 'a') as fp:
    for t in range(90):
       
        num_articles = len(get_articles_in_topic(t, min_prob=0.5))
        
        if num_articles < 5:
            continue
            
        fp.write(f'\n# Topic {t}\n\n')
        fp.write(f'## Articles in topic: {num_articles}\n')
        
        
        fp.write('## Topic word probabilities:\n')
        fp.write('| Word | Probability |\n')
        fp.write('|---|---|\n')

        for word in topic_top_n(lda, t, n=20):

            fp.write(f"| {word[0]} | {word[1]} | \n")
            
        
        fp.write('\n## Top articles:\n')
            
        articles = get_articles_in_topic(t, min_prob=0.5, n=5)
        
        for article, _ in articles:
            try:
                year = datetime.datetime.strptime(article.date, "%Y/%m/%d").year
            except:
                year = article.date
                
            fp.write(f"* {article.author} ({year}). {article.title}\n")

        fp.write('\n\\newpage')
        
!pandoc topics.md -o topics.pdf --pdf-engine=xelatex

NameError: name 'get_articles_in_topic' is not defined

## Getting all articles in a topic

You can get all `Article`s in a topic using `get_articles_in_topic(topic_id, min_prob=0.1, n=None)`. Set an `n` if you want to cap the results.

In [None]:
top_5_articles = get_articles_in_topic(17, min_prob=0.5, n=5)
top_5_articles

Notice that it also returns the probability. An example of how to use it:

In [None]:
for article, _ in top_5_articles:
    print(article.title)

## Getting titles and keywords

For this, we have the functions `get_titles_in_topic(topic_id, min_prob=0.1, n=None)` and `get_keywords_in_topic(topic_id, min_prob=0.1, n=None)`. Some examples:

In [None]:
get_titles_in_topic(17, n=5)

If the `Article` class has no value for the `keyword` attribute, then a default `"NO KEYWORDS FOUND"` is put into the list.

In [None]:
get_keywords_in_topic(17, n=5)

## Summarize an article

If you have an `Article` object, you can use the `summary(lda, article, probability=None, topics=False)` to get a quick summary of it.

In [None]:
# Let's pick the last one from the list we've created:
article, prob = top_5_articles[-1]
summary(lda, article, probability=prob, topics=True)

## Summarizing an entire topic

For this, we have implemented `summarize_topic(lda, topic_id, min_prob=0.1, n=None)`.

In [None]:
for i in range(90):
    summarize_topic(lda, i, n=5)

## Using pyLDAvis

In [None]:
from utils.loaders import loadCorpusList

corpusPath = '../data/corpus'
corpusList = loadCorpusList(corpusPath)
corpusList = [a for a in corpusList if a.lang == "es"]

In [None]:
from utils.exploration import prepare_bag_of_words
corpus = [lda.id2word.doc2bow(prepare_bag_of_words(a)) for a in corpusList]

In [None]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, lda.id2word, sort_topics=False)
pyLDAvis.display(lda_display)