# Exploration utilities

In this notebook, we implement some functions that would be useful to have once we are exploring the topics that our LDA (k=90) found.

This auxiliar functions include:
- Top words per topic (for k=0, ..., 89)
- Topics in an article (from the article ID)
- Articles per topic (option for abstracts) (for k=0, ..., 89)
- Word ranking inside topic.
- Titles of articles in a topic.
- All keywords in a topic.
- Summary of a topic (in terms of its articles)
- A function that opens the article itself, be it PDF or HTML.
- Imprimir todo en tablas para imprimir. (into markdown).
- Visualizaciones.

Topics < Categorías.
N de artículos por tópico por año. (Suponiendo _estabilidad_ temporal).

Pasar esto a la carpeta utils.
- Guardar una pre-corrección y mostrar la corrección a ojo.

Some articles belong to ALL topics.

## Loading up the corpus

In [1]:
import json
import re
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, saveCorpus

corpusPath = '../data/corpus'

corpusList = loadCorpusList(corpusPath)
corpusList = [a for a in corpusList if a.lang == "es"]

## Loading up the model

In [2]:
from gensim.models.ldamodel import LdaModel

In [3]:
lda = LdaModel.load("LDA_gensim_90_final.model")

## A brief summary of the topics

The `lda` object itself comes with a fairly useful `print_topics` method. Let's see what it outputs:

In [4]:
def prepare_bag_of_words(article):
    """
    A hot fix on some empty strings.
    """
    bow = article.bagOfWords
    bow = bow.split(" ")
    return [w for w in bow if len(w) > 1]

In [5]:
from gensim import corpora

In [6]:
dictionary = corpora.Dictionary([
    prepare_bag_of_words(a) for a in corpusList
])
corpus = [dictionary.doc2bow(prepare_bag_of_words(a)) for a in corpusList]

Run this if you want to get the pyLDAvis:

```python
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, lda.id2word, sort_topics=False)
pyLDAvis.display(lda_display)
```

## Top n words of a given topic

In [7]:
def topic_top_n(topic_id, n=10, verbose=True):
    '''
    This function prints the topic using the
    LDA's method, and returns a list of the n
    top words in the topic (probability-wise).
    '''
    if verbose:
        print(lda.print_topic(topic_id, topn=n))

    return [
        (lda.id2word.get(idx), f"{prob:0.3}") for idx, prob in lda.get_topic_terms(topic_id, topn=n)
    ]

## Getting a table of words per topic

In [8]:
for topic_id in range(90):
#     topic_top_n(topic_id)
    print(topic_id, topic_top_n(topic_id, verbose=False, n=5))

0 [('metro', '0.0538'), ('patrón', '0.0373'), ('tien', '0.02'), ('longitud', '0.013'), ('medir', '0.0125')]
1 [('conjuntar', '0.026'), ('difuso', '0.026'), ('perdón', '0.0202'), ('perdonar', '0.00916'), ('preguntar', '0.00789')]
2 [('teoría', '0.0153'), ('natural', '0.0109'), ('barry', '0.0109'), ('einstein', '0.00922'), ('multiculturalismo', '0.00851')]
3 [('lógico', '0.0149'), ('pensamiento', '0.00891'), ('movimiento', '0.00879'), ('entr', '0.00547'), ('teoría', '0.00453')]
4 [('objeto', '0.0157'), ('referenciar', '0.014'), ('memoria', '0.0135'), ('acto', '0.013'), ('altruista', '0.0119')]
5 [('wittgenstein', '0.038'), ('lenguaje', '0.0141'), ('lógico', '0.0075'), ('proposición', '0.00697'), ('tractatus', '0.00676')]
6 [('ser', '0.00871'), ('mecánico', '0.00769'), ('teoría', '0.00696'), ('físico', '0.00657'), ('orden', '0.00649')]
7 [('juicio', '0.017'), ('justificar', '0.00986'), ('moro', '0.00934'), ('razonar', '0.0083'), ('intención', '0.00829')]
8 [('sócrates', '0.027'), ('platón

78 [('spinoza', '0.0188'), ('husserl', '0.0188'), ('fenomenología', '0.0126'), ('conocimiento', '0.00912'), ('weil', '0.00824')]
79 [('creencia', '0.226'), ('contener', '0.0333'), ('llover', '0.0322'), ('emisión', '0.0266'), ('creer', '0.0247')]
80 [('mcdowell', '0.0416'), ('mundo', '0.0363'), ('conceptual', '0.0243'), ('hegel', '0.0121'), ('objeto', '0.0117')]
81 [('filosofía', '0.0108'), ('ciencia', '0.00942'), ('ser', '0.00853'), ('historia', '0.00721'), ('mundo', '0.00587')]
82 [('ser', '0.0103'), ('religioso', '0.0099'), ('religión', '0.00818'), ('bien', '0.00701'), ('vida', '0.0066')]
83 [('ser', '0.0153'), ('forma', '0.0132'), ('aristóteles', '0.0122'), ('artefacto', '0.0109'), ('formar', '0.0103')]
84 [('lenguaje', '0.0188'), ('teoría', '0.0105'), ('significar', '0.00901'), ('quine', '0.00858'), ('ser', '0.00807')]
85 [('object', '0.0139'), ('act', '0.012'), ('locucionario', '0.00849'), ('wha', '0.00836'), ('fro', '0.00831')]
86 [('virtud', '0.0109'), ('montesquieu', '0.0108'),

In [9]:
articles = {
    art.id: art for art in corpusList
}

## Topics in an article by article_id

In [10]:
import numpy as np

In [11]:
def topics_in_article(article, abstract=False):
    bow = lda.id2word.doc2bow(prepare_bag_of_words(article))
    return lda.get_document_topics(bow)

In [12]:
def summary(article, topics=False):
    # Print the title
    print("-"*50)
    print("\t\t TITLE \t\t")
    print(article.title)
    # Print the abstract
    
    # Print keywords
    print("\t\t KEYWORDS \t\t")
    if hasattr(article, "keywords"):
        print(article.keywords)
    else:
        print("No keywords stored")
    print()
    
    print("\t\t ABSTRACT \t\t")    
    if hasattr(article, "abstract"):
        print(article.abstract)
    else:
        print("No abstract stored")
    print()

    # Print the topics
    if topics:
        print("\t\t TOPICS \t\t")
        topics_in_art = topics_in_article(article)
        for top_id, prob in topics_in_art:
            print(f"Topic {top_id} (w. probability {prob:0.3f})")

In [13]:
for a in corpusList[:20]:
    summary(a, topics=True)

--------------------------------------------------
		 TITLE 		
Ortega y el conocimiento absoluto
		 KEYWORDS 		
Mundo; conocimiento racional; razón; validez universal; conocimiento absoluto; historicismo moderno; razón pura; lógica; fundamentos

		 ABSTRACT 		
No abstract stored

		 TOPICS 		
Topic 14 (w. probability 0.282)
Topic 42 (w. probability 0.486)
Topic 68 (w. probability 0.155)
Topic 81 (w. probability 0.046)
Topic 89 (w. probability 0.023)
--------------------------------------------------
		 TITLE 		
Los pecados del ateísmo
		 KEYWORDS 		
No keywords stored

		 ABSTRACT 		
No abstract stored

		 TOPICS 		
Topic 22 (w. probability 0.011)
Topic 38 (w. probability 0.019)
Topic 45 (w. probability 0.020)
Topic 65 (w. probability 0.383)
Topic 79 (w. probability 0.013)
Topic 81 (w. probability 0.116)
Topic 82 (w. probability 0.164)
Topic 89 (w. probability 0.259)
--------------------------------------------------
		 TITLE 		
El problema de la objetividad en los juicios: el contrast

Topic 11 (w. probability 0.014)
Topic 18 (w. probability 0.023)
Topic 19 (w. probability 0.062)
Topic 24 (w. probability 0.153)
Topic 36 (w. probability 0.095)
Topic 51 (w. probability 0.034)
Topic 81 (w. probability 0.577)
Topic 88 (w. probability 0.023)


In [14]:
all_in = [a for a in corpusList if len(topics_in_article(a)) == 90]    

In [15]:
print(len(corpusList))

1330


In [16]:
print(len(all_in))

254


HM!, the same 254 articles that were being pushed to having no topics in k > 100 topics are still being casted to ALL topics:


## Number of articles per topic

I should create a function `articles(topic, min_prob=0.5)` that returns all the articles of a topic. This can be done by first precomputing and saving a json file `articles_in_topic = {topic_id: [(article_id, prob_of_said_article)]}`.

Let's compute this json file:

In [17]:
# articles_in_topic = {
#     topic_id: [get_]
# }

all_topics_in_article = {
    a.id: topics_in_article(a) for a in corpusList
}

In [18]:
all_topics_in_article["21713"]

[(14, 0.28170452),
 (42, 0.4860075),
 (68, 0.15552716),
 (81, 0.046357162),
 (89, 0.022527479)]

In [19]:
def get_articles_in_topics(min_prob=0.01):
    articles_in_topics = {topic_id: [] for topic_id in range(90)}
    for art_id, topics_and_probs in all_topics_in_article.items():
        for topic_id, prob in topics_and_probs:
            if prob >= min_prob:
                articles_in_topics[topic_id].append((art_id, float(prob)))
    
    # Sorting them by probability
    sorted_articles_in_topics = {}
    for topic_id, articles_and_probs in articles_in_topics.items():
        sorted_articles_in_topics[topic_id] = sorted(articles_and_probs, key=lambda x: x[1], reverse=True)
                
    return sorted_articles_in_topics

In [20]:
articles_in_topics = get_articles_in_topics()

In [21]:
articles_in_topics

{0: [('21857', 0.9985687732696533),
  ('29266', 0.011111111380159855),
  ('29231', 0.011111111380159855),
  ('28973', 0.011111111380159855),
  ('29118', 0.011111111380159855),
  ('28965', 0.011111111380159855),
  ('29227', 0.011111111380159855),
  ('29335', 0.011111111380159855),
  ('28949', 0.011111111380159855),
  ('29088', 0.011111111380159855),
  ('29319', 0.011111111380159855),
  ('28764', 0.011111111380159855),
  ('29175', 0.011111111380159855),
  ('29525', 0.011111111380159855),
  ('29026', 0.011111111380159855),
  ('29533', 0.011111111380159855),
  ('29071', 0.011111111380159855),
  ('29134', 0.011111111380159855),
  ('29143', 0.011111111380159855),
  ('28705', 0.011111111380159855),
  ('29051', 0.011111111380159855),
  ('29552', 0.011111111380159855),
  ('29047', 0.011111111380159855),
  ('28986', 0.011111111380159855),
  ('29010', 0.011111111380159855),
  ('40133', 0.011111111380159855),
  ('1141', 0.011111111380159855),
  ('28945', 0.011111111380159855),
  ('29207', 0.011111

In [25]:
print("Using a minimum probability of 0.01:")
for topic_id, articles_in in articles_in_topics.items():
    print(f"Topic: {topic_id}    # of articles: {len(articles_in)}")

Using a minimum probability of 0.01:
Topic: 0    # of articles: 255
Topic: 1    # of articles: 260
Topic: 2    # of articles: 267
Topic: 3    # of articles: 323
Topic: 4    # of articles: 278
Topic: 5    # of articles: 361
Topic: 6    # of articles: 298
Topic: 7    # of articles: 314
Topic: 8    # of articles: 322
Topic: 9    # of articles: 286
Topic: 10    # of articles: 269
Topic: 11    # of articles: 452
Topic: 12    # of articles: 256
Topic: 13    # of articles: 263
Topic: 14    # of articles: 407
Topic: 15    # of articles: 264
Topic: 16    # of articles: 254
Topic: 17    # of articles: 576
Topic: 18    # of articles: 331
Topic: 19    # of articles: 530
Topic: 20    # of articles: 273
Topic: 21    # of articles: 577
Topic: 22    # of articles: 274
Topic: 23    # of articles: 286
Topic: 24    # of articles: 378
Topic: 25    # of articles: 266
Topic: 26    # of articles: 481
Topic: 27    # of articles: 300
Topic: 28    # of articles: 292
Topic: 29    # of articles: 264
Topic: 30    

Now `articles_in_topics` is just what we wanted. I'll store it under `../data/articles_in_topics.json`.

In [24]:
with open("../data/articles_in_topics.json", "w") as fp:
    json.dump(articles_in_topics, fp)

## The functions, but for utils

###  Getting all articles in a topic

In this function, we can actually use the json file that we saved for quick querying. I'll use the notation `_func_name` when I intend to implement a function that will be stored in the utils module.

In [57]:
def _get_articles_in_topic(topic_id, min_prob=0.1, top_n=None):
    """
    This function returns a list of articles that belong
    with at least {min_prob} to a given topic. If {top_n}
    is an integer, it will return only the {top_n} articles.
    """
    # Loads the precomputed articles (by id) in each topic.
    with open("../data/articles_in_topics.json") as fp:
        a_in_topics = json.load(fp)
    
    # Loads the corpus list
    corpusPath = '../data/corpus'
    corpusList = loadCorpusList(corpusPath)
    a_by_id = {
        a.id: a for a in corpusList
    }
    
    # Filters by probability.
    a_in_topic = a_in_topics[str(topic_id)]
    result = []
    for i, (a_id, prob) in enumerate(a_in_topic):
        if isinstance(top_n, int):
            if i == top_n - 1:
                break
        
        if prob >= min_prob:
            result.append((a_by_id[a_id], prob))
        else:
            # because the list is sorted.
            break

    return result

In [58]:
print(_get_articles_in_topic(1))

[(<utils.Article.Article object at 0x7fe07e0da690>, 0.9987980127334595), (<utils.Article.Article object at 0x7fe07e72d090>, 0.7574461102485657), (<utils.Article.Article object at 0x7fe07e13fe50>, 0.6800773739814758)]


I thought we had implemented a `__repr__` for articles.

Great!, this function is the core of what follows next. Another idea that could speed things up is pre-storing the a_by_id using pickle.

### Titles and keywords for each topic

In [59]:
def _get_titles_in_topic(topic_id, min_prob=0.1, top_n=None):
    articles = _get_articles_in_topic(topic_id, min_prob=min_prob, top_n=top_n)
    titles = []
    for a, _ in articles:
        if hasattr(a, "title"):
            titles.append(a.title)
        else:
            titles.append("NO TITLE FOUND")

    return titles

In [60]:
def _get_keywords_in_topic(topic_id, min_prob=0.1, top_n=None):
    articles = _get_articles_in_topic(topic_id, min_prob=min_prob, top_n=top_n)
    keywords = []
    for a, _ in articles:
        if hasattr(a, "keywords"):
            titles.append(a.keywords)
        else:
            titles.append("NO KEYWORDS FOUND")

    return titles

### A topic summarizer

In [61]:
def _summarize_topic(topic_id, min_prob=0.1, top_n=None):
    articles = _get_articles_in_topic(topic_id, min_prob=min_prob, top_n=top_n)
    for a, _ in articles:
        summary(a)

TODO: reimplement summary in the final so that it could also accept the probability.

In [62]:
_summarize_topic(4)

--------------------------------------------------
		 TITLE 		
Selección natural y moralidad
		 KEYWORDS 		
moralidad; selección natural; sociobiología; altruismo

		 ABSTRACT 		
No abstract stored

--------------------------------------------------
		 TITLE 		
El Sinn noemático y la referencia
		 KEYWORDS 		
Frege; Husserl; sentido; Sinn; Sinn noemático; referencia; teoría causal de la referencia

		 ABSTRACT 		
No abstract stored

--------------------------------------------------
		 TITLE 		
Sobre la facticidad de la memoria
		 KEYWORDS 		
Ciencia; experiencia inmediata; Síndrome de Korsakow; presente; percepción; teoría del rastro

		 ABSTRACT 		
No abstract stored

--------------------------------------------------
		 TITLE 		
La evolución de la moral contractual
		 KEYWORDS 		
No keywords stored

		 ABSTRACT 		
No abstract stored

--------------------------------------------------
		 TITLE 		
Filosofía moral del naturalismo
		 KEYWORDS 		
Cientificísmo; realidad; ciencias natural

### Top  n words

In [51]:
def _topic_top_n(lda, topic_id, n=10, verbose=True):
    '''
    This function prints the topic using the
    LDA's method, and returns a list of the n
    top words in the topic (probability-wise).
    '''
    if verbose:
        print(lda.print_topic(topic_id, topn=n))

    return [
        (lda.id2word.get(idx), f"{prob:0.3}") for idx, prob in lda.get_topic_terms(topic_id, topn=n)
    ]

## Getting the table with all topics using pandas and TeX

In [64]:
import pandas as pd

In [167]:
def write_page(topic_id):
    """
    This function writes a page in tex with the top words
    table and with the : one with the top
    words of the topic, and another one with metadata
    (e.g. amount of articles, summary of top 3 articles)
    """
    
    page = "\\centering\n"
    page += "\\thispagestyle{empty}\n\section*{Topic " + str(topic_id) + "}\\vfill\n"
    
    # Getting the word table
    words_and_probs = topic_top_n(topic_id, n=20, verbose=False)
    words_table = pd.DataFrame(words_and_probs, columns=[f"Topic {topic_id}", "Probability"])
    
    # Filtering bad characters in utf-8
    words_table[f"Topic {topic_id}"].map(lambda x: bytes(x, "utf-8").decode('utf-8','ignore'))
    
    page += words_table.to_latex(index=False)
    page += "\n\\vfill\n"
    articles = _get_articles_in_topic(topic_id, min_prob=0.1)
    amount = len(articles)
    page += f"Found {amount} articles in topic {topic_id}" + "\n"
#     print(f"Found {amount} articles in topic {topic_id}")
    for a, prob in articles[:min(3, len(articles))]:
        page += "\\vfill\n"
        page += "\n"
        page += r"\textbf{" + a.title.replace("&", "\&") + "}" + f" (id: {a.id})" + "\n" + f" (w. prob {prob:1.4f})" + "\n"
        try:
            keywords = a.keywords
            page += "\n\nKEYWORDS:\n"
            page += keywords + "\n"
        except:
            pass
    
    page += "\n\\vfill"
    return page, amount

In [170]:
def write_document():
    """
    This function stores the TeX document
    with all the tables.
    """
    document = r"\documentclass{article}"
    document += "\n\\usepackage[utf8]{inputenc}"
#     document += "\n\\usepackage[LGR]{fontenc}"
    document += "\n\\usepackage{booktabs}"
    document += "\n\n\n\\begin{document}\n\n"
    
    # Computing all pages:
    pages = [write_page(t) for t in range(90) if t not in [54, 67]]
    pages = sorted(pages, key=lambda x: x[1], reverse=True)
    for page, _ in pages:
        document += page
        document += "\n\\newpage\n\n\n"

    document += "\n\n\\end{document}"
    
    with open("../extras/table.tex", "w") as fp:
        fp.write(document)

    print(document)

In [171]:
write_document()

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{booktabs}


\begin{document}

\centering
\thispagestyle{empty}
\section*{Topic 81}\vfill
\begin{tabular}{ll}
\toprule
     Topic 81 & Probability \\
\midrule
    filosofía &      0.0108 \\
      ciencia &     0.00942 \\
          ser &     0.00853 \\
     historia &     0.00721 \\
        mundo &     0.00587 \\
  pensamiento &     0.00568 \\
      crítico &     0.00494 \\
    histórico &     0.00446 \\
       hombre &     0.00419 \\
        forma &     0.00415 \\
 conocimiento &     0.00403 \\
     concepto &     0.00399 \\
     problema &     0.00387 \\
   filosófico &      0.0038 \\
       lógico &     0.00372 \\
  desarrollar &     0.00361 \\
         vida &     0.00357 \\
       teoría &     0.00339 \\
         marx &     0.00328 \\
         obra &     0.00325 \\
\bottomrule
\end{tabular}

\vfill
Found 330 articles in topic 81
\vfill

\textbf{En defensa del marxismo} (id: 19107)
 (w. prob 0.9717)


KEYWORDS:
Althusser;