<a href="https://colab.research.google.com/github/anelglvz/Matematicas_Ciencia_Datos/blob/main/%C3%81lgebra/Textrank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Implementación de TextRank para la obtención de resúmenes

En este Notebook se implementará TextRank para obtener un resumen con las oraciones clave de todo un texto.

# Dependencias

In [None]:
!pip3 install wikipedia

In [None]:
!pip3 install 'txtai[pipeline]'

In [None]:
!pip3 install huggingface_hub==0.24.1

In [5]:
import re

import numpy as np
import scipy.linalg as splinalg

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import wikipedia

from txtai.pipeline import Translation



In [6]:
nltk.download("punkt")
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /Users/pedrom2/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pedrom2/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/pedrom2/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [7]:
# Radicalizador
stemmer = PorterStemmer()

# Palabras de paro
cached_stopwords = stopwords.words('english')
print(cached_stopwords[:10])

# Traductor
translate = Translation()

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [8]:
# Ejemplo lista por comprension
lista = []
for i in range(9):
  lista.append('Hola')

print(lista)

# Otro modo de crearla
otra_lista = ['Hola' for i in range(9)]

print(otra_lista)

['Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola']
['Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola']


# Datos

Los datos que ocuparemos serán el texto de páginas de Wikipedia. Descargaremos el texto ocupando el módulo [```wikipedia```](https://pypi.org/project/wikipedia/) que es un "wrapper" del API de Wikipedia. A este texto lo dividiremos en oraciones, procesaremos cada oración, radicalizaremos cada palabra, y aplicaremos TextRank para obtener las oraciones más importantes de todo el documento.

## Lectura de los datos

Descargamos artículos de Wikipedia.

In [9]:
wiki = wikipedia.page('Expropiación del petróleo en México')
book = wiki.content
print(book)

The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.  In accordance with Article 27 of the Constitution of 1917, President Lázaro Cárdenas declared that all mineral and oil reserves found within Mexico belong to "the nation", i.e., the federal government. The Mexican government established a state-owned petroleum company, Petróleos Mexicanos, or PEMEX.  For a short period, this measure caused an international boycott of Mexican products in the following years, especially by the United States, the United Kingdom, and the Netherlands, but with the outbreak of World War II and the alliance between Mexico and the Allies, the disputes with private companies over compensation were resolved. The anniversary, March 18, is now a Mexican civic holiday.


== Background ==

On August 16, 1935, the Petroleum Workers Union of Mexico (Sindicato de Trabajadores Petroleros de

## Procesamiento

Dividimos el texto en oraciones.

In [10]:
sentences = [x for x in sent_tokenize(book)]
print(f"# oraciones: {len(sentences)}")
for sentence in sentences[:3]:
    print(sentence)
    print()
    print("...Fin de la oración...")
    print()


# oraciones: 98
The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.

...Fin de la oración...

In accordance with Article 27 of the Constitution of 1917, President Lázaro Cárdenas declared that all mineral and oil reserves found within Mexico belong to "the nation", i.e., the federal government.

...Fin de la oración...

The Mexican government established a state-owned petroleum company, Petróleos Mexicanos, or PEMEX.

...Fin de la oración...



convertimos a minúsculas, eliminamos stopwords, eliminamos signos de puntuación y radicalizamos.

In [11]:
sent_low = [[stemmer.stem(re.sub('[^a-z]', "", word.lower())) for word in word_tokenize(sentence) if word not in cached_stopwords and len(word) > 2] for sentence in sentences]
sent_low[0]

['the',
 'mexican',
 'oil',
 'expropri',
 'spanish',
 'expropiacin',
 'petrolera',
 'nation',
 'petroleum',
 'reserv',
 'facil',
 'foreign',
 'oil',
 'compani',
 'mexico',
 'march',
 '']

# TextRank

Construimos la matriz de adyacencias/similitud A entre las oraciones, tomando el número de palabras que están en ambas como la similitud entre las dos oraciones.

In [12]:
A = np.zeros((len(sent_low), len(sent_low)))

for i in range(len(sentences)):
    if i % 100 == 0:
        print(i, end=", ")
        if i % 1000 == 0:
            print()
    for j in range(i+1, len(sentences)):
        # La simillitud entre oraciones va a ser el número de palabras que tienen en común
        A[i][j] = A[j][i] = len([x for x in sent_low[i] if x in sent_low[j]])

0, 


In [13]:
[x for x in sent_low[0] if x in sent_low[1]]

['oil', 'nation', 'reserv', 'oil', 'mexico', '']

Así es como se ve un fragmento de la matriz A.

In [14]:
A[:5, :5]

array([[0., 6., 4., 3., 3.],
       [6., 0., 1., 1., 0.],
       [4., 1., 0., 2., 2.],
       [3., 1., 2., 0., 1.],
       [3., 0., 2., 1., 0.]])

Normalizamos las columnas de A

In [15]:
# Comparamos las oraciones unas con otras, pero no consigo mismas
suma = np.sum(A, axis=0)
A_norm = A.copy()
suma = np.sum(A, axis=0)
for i in range(len(sentences)):
  if suma[i] != 0:
    A_norm[i,:] = A[i,:]/suma[i]
A_norm[:5, :5]

array([[0.        , 0.02439024, 0.01626016, 0.01219512, 0.01219512],
       [0.04225352, 0.        , 0.00704225, 0.00704225, 0.        ],
       [0.03636364, 0.00909091, 0.        , 0.01818182, 0.01818182],
       [0.02631579, 0.00877193, 0.01754386, 0.        , 0.00877193],
       [0.06521739, 0.        , 0.04347826, 0.02173913, 0.        ]])

In [16]:
print(A_norm[0,:].sum())

0.9999999999999999


Se crea el vector de TextRank con unos y se itera hasta que converja. Es decir, hasta que obtengamos $\Pi$ tal que $$\Pi = A~\Pi$$

In [17]:
# Para impresiones mas bonitas
np.set_printoptions(suppress=True)

In [18]:
# Tolerancia para la diferencia al comparar
tol = 1e-7

PI_ = np.ones(A_norm.shape[1])
A_norm_a = A_norm.T.copy()

i = 0
while True:
    pi_ = A_norm_a @ PI_
    print(i, abs(PI_- pi_).sum())
    if np.allclose(PI_, pi_, tol):
        break
    i += 1
    PI_ = pi_

0 47.27840454716925
1 10.542090208594786
2 1.612491930081298
3 0.46520495436131315
4 0.15736201861660912
5 0.06604752393501309
6 0.029370904712919164
7 0.01365881111745884
8 0.006367634061440166
9 0.0029863167393567834
10 0.0014033473511069466
11 0.0006594154849675401
12 0.00030961130939013114
13 0.00014528251738793983
14 6.813716316123042e-05
15 3.194296757432283e-05
16 1.4969970471977179e-05
17 7.0137690248300855e-06
18 3.2854181728370763e-06
19 1.5387144829629579e-06


In [19]:
pi_

array([2.63531801, 1.52119984, 1.17839422, 1.22124493, 0.49278304,
       2.02469551, 1.53191247, 1.3926477 , 0.16069012, 0.65347315,
       0.70703651, 0.92129001, 1.02841676, 1.58547584, 0.93200269,
       1.34979701, 1.34979701, 1.53191249, 1.77830402, 0.37494363,
       0.35351827, 0.71774919, 1.46763642, 1.74616597, 1.16768153,
       1.7890167 , 0.68561117, 0.78202524, 0.5999098 , 0.89986469,
       1.2640956 , 1.32837166, 0.95342803, 0.29995489, 1.02841676,
       0.68561117, 1.26409563, 1.1462562 , 1.49977443, 0.43921965,
       1.54262514, 0.55705907, 1.90685609, 1.17839421, 1.32837166,
       0.03213802, 0.38565629, 1.60690122, 2.78529547, 0.10712674,
       0.66418586, 1.00699143, 0.01071267, 0.01071268, 1.52119984,
       0.13926477, 2.14253498, 1.17839424, 1.59618854, 0.55705909,
       1.42478574, 1.29623367, 1.87471812, 1.47834912, 0.78202529,
       0.2571042 , 0.89986469, 0.20354082, 0.73917456, 0.74988724,
       1.63903923, 0.92129003, 0.9427154 , 0.52492108, 0.72846

Alternativamente, podemos obtener los eigenvectores izquierdos de nuestra matriz A_norm. Los valores de PageRank corresponden al vector de probabilidades del estado estacionario de la matriz A que a su vez es el eigenvector izquierdo con eigenvalor asociado 1.

$$\Pi = \Pi A$$

In [20]:
D, vecs = splinalg.eig(A_norm, left=True, right=False)

In [21]:
D

array([ 1.        +0.j,  0.46807032+0.j,  0.36330176+0.j,  0.28687607+0.j,
        0.22964   +0.j,  0.19934271+0.j,  0.19239694+0.j,  0.17251149+0.j,
       -0.18571612+0.j,  0.15108567+0.j,  0.13043256+0.j,  0.12501392+0.j,
        0.1148962 +0.j,  0.09107226+0.j,  0.09167755+0.j,  0.07738138+0.j,
       -0.15432295+0.j,  0.07356221+0.j,  0.06757776+0.j,  0.06233724+0.j,
        0.05354638+0.j, -0.13339054+0.j, -0.12835692+0.j,  0.04999122+0.j,
        0.05162979+0.j, -0.1238126 +0.j,  0.04289975+0.j, -0.1208941 +0.j,
        0.03253336+0.j, -0.11878348+0.j,  0.02461691+0.j,  0.02145452+0.j,
       -0.11474704+0.j, -0.11010073+0.j, -0.11082583+0.j,  0.01702951+0.j,
       -0.10522701+0.j,  0.01349817+0.j, -0.10197328+0.j,  0.01149895+0.j,
       -0.09833783+0.j,  0.00967066+0.j,  0.00271149+0.j, -0.0943705 +0.j,
       -0.09184081+0.j, -0.0912115 +0.j, -0.10869565+0.j, -0.00405074+0.j,
       -0.08618278+0.j, -0.00282857+0.j, -0.00810469+0.j, -0.00892544+0.j,
       -0.01108322+0.j, -

In [22]:
vecs.shape

(98, 98)

In [23]:
vecs

array([[ 0.23427339, -0.0632663 ,  0.12907107, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.13523098, -0.12351332,  0.03516766, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.10475639,  0.05801119,  0.06806941, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       [ 0.09237609,  0.02665773,  0.09291842, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.04475955, -0.14749318, -0.19657137, ...,  0.        ,
         0.        ,  0.        ]])

In [24]:
pi2_ = vecs[:, 0]
pi2_

array([0.23427339, 0.13523098, 0.10475639, 0.10856572, 0.04380722,
       0.17999053, 0.13618331, 0.12380301, 0.01428496, 0.05809218,
       0.06285384, 0.08190045, 0.09142376, 0.14094497, 0.08285278,
       0.11999369, 0.11999369, 0.13618331, 0.15808692, 0.03333158,
       0.03142692, 0.06380617, 0.13046933, 0.15522993, 0.10380406,
       0.15903925, 0.06094917, 0.06952015, 0.05333053, 0.07999579,
       0.11237504, 0.11808903, 0.08475745, 0.02666526, 0.09142376,
       0.06094917, 0.11237504, 0.1018994 , 0.13332632, 0.03904557,
       0.13713564, 0.0495212 , 0.16951489, 0.10475639, 0.11808903,
       0.00285699, 0.03428391, 0.14284963, 0.24760602, 0.00952331,
       0.05904451, 0.0895191 , 0.00095233, 0.00095233, 0.13523098,
       0.0123803 , 0.19046617, 0.10475639, 0.1418973 , 0.0495212 ,
       0.12666   , 0.11523203, 0.1666579 , 0.13142166, 0.06952015,
       0.02285594, 0.07999579, 0.01809429, 0.06571083, 0.06666316,
       0.14570662, 0.08190045, 0.08380512, 0.04666421, 0.06475

Obtenemos los índices de los k valores más grandes en $\Pi$ y los usamos para obtener las oraciones más relevantes.

In [42]:
k = 4
pi_.argsort()[-k:][::-1]

array([48,  0, 56,  5])

In [43]:
k = 4
pi2_.argsort()[-k:][::-1]

array([48,  0, 56,  5])

In [27]:
summary = [sentences[idx] for idx in pi_.argsort()[-k:][::-1]]

In [28]:
summary

['== Oil Expropriation Day, March 18, 1938 ==\nOn March 18, 1938 President Cárdenas embarked on the expropriation of all oil resources and facilities by the state, nationalizing the U.S. and Anglo-Dutch (Mexican Eagle Petroleum Company) operating companies.',
 'The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.',
 '== Opposition ==\n\n\n=== International ===\nIn retaliation, the oil companies initiated a public relations campaign against Mexico, urging people to stop buying Mexican goods and lobbying to embargo U.S. technology to Mexico.',
 '== Background ==\n\nOn August 16, 1935, the Petroleum Workers Union of Mexico (Sindicato de Trabajadores Petroleros de la República Mexicana) was formed and one of the first actions was the writing of a lengthy draft contract transmitted to the petroleum companies demanding a 40-hour working week, a full salary paid in 

Por último, sólo queda ver qué considero TextRank como las oraciones más importantes.

In [29]:
for bullet in summary:
    print('___________')
    print(bullet)

___________
== Oil Expropriation Day, March 18, 1938 ==
On March 18, 1938 President Cárdenas embarked on the expropriation of all oil resources and facilities by the state, nationalizing the U.S. and Anglo-Dutch (Mexican Eagle Petroleum Company) operating companies.
___________
The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.
___________
== Opposition ==


=== International ===
In retaliation, the oil companies initiated a public relations campaign against Mexico, urging people to stop buying Mexican goods and lobbying to embargo U.S. technology to Mexico.
___________
== Background ==

On August 16, 1935, the Petroleum Workers Union of Mexico (Sindicato de Trabajadores Petroleros de la República Mexicana) was formed and one of the first actions was the writing of a lengthy draft contract transmitted to the petroleum companies demanding a 40-hour working w

(Ahorita no) podemos traducir la salida.

In [None]:
!pip3 install sacremoses

In [31]:
# Aprox 34 seg las primeras 10 oraciones
for bullet in summary:
    print()
    #print(translate(bullet, "es"))







# Función para crear resúmenes

Podemos condensar todo lo anterior en una función que reciba texto y nos regrese las oraciones más relevantes de acuerdo a TextRank.

In [32]:
def summary(text, k, to_spanish = False, tol = 1e-5, d = .15, eig = False):
    print("Paso 1. Obteniendo oraciones")
    sentences = [x for x in sent_tokenize(text)]

    print(f"# oraciones: {len(sentences)}")

    print("Paso 2. Procesando texto")
    sent_low = [[stemmer.stem(re.sub('[^a-z]', "", word.lower())) for word in word_tokenize(sentence) if word not in cached_stopwords and len(word) > 2] for sentence in sentences]
    
    print("Paso 3. Creando matriz de similitud")
    A = np.zeros((len(sent_low), len(sent_low)))

    for i in range(len(sentences)):
        for j in range(i+1, len(sentences)):
            # La simillitud entre oraciones va a ser el número de palabras que tienen en común
            A[i][j] = A[j][i] = len([x for x in sent_low[i] if x in sent_low[j]])

    print("Paso 4. Normalizando matriz de similitud")
    suma = np.sum(A, axis=0)
    A_norm = A.copy()
    suma = np.sum(A, axis=0)
    for i in range(len(sentences)):
      if suma[i] != 0:
        A_norm[i,:] = A[i,:]/suma[i]

    print("Paso 5. Ejecutando TextRank")
    if eig:
        vals, vecs = splinalg.eig(A_norm, left=True, right=False)
        pi_ = vecs[:, 0]
    else:
        A_norm_a = A_norm.T.copy()
        PI_ = np.ones(A_norm.shape[1])

        while True:
            pi_ = A_norm_a.dot(PI_)
            if np.allclose(PI_, pi_, tol):
                break
            PI_ = pi_

    print("\tPaso 5. Terminado")

    if not to_spanish:
        return [sentences[idx] for idx in pi_.argsort()[-k:][::-1]]

    print("Paso 6. Traduciendo")
    return [translate(sentences[idx], "es") for idx in pi_.argsort()[-k:][::-1]]

def print_bullet_points(bullet_points):
    for point in bullet_points:
        print(f"- {point}\n")


In [33]:
wiki = wikipedia.page('Automatic Summarization')
text = wiki.content
bullet_points = summary(text, 5, False, eig = False)

Paso 1. Obteniendo oraciones
# oraciones: 329
Paso 2. Procesando texto
Paso 3. Creando matriz de similitud
Paso 4. Normalizando matriz de similitud
Paso 5. Ejecutando TextRank
	Paso 5. Terminado


In [34]:
print_bullet_points(bullet_points)

- ==== Maximum entropy-based summarization ====
During the DUC 2001 and 2002 evaluation workshops, TNO developed a sentence extraction system for multi-document summarization in the news domain.

- === Document summarization ===
Like keyphrase extraction, document summarization aims to identify the essence of a text.

- ==== TextRank and LexRank ====
The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data.

- The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".

- === Submodular functions as generic tools for summarization ===
The idea of a submodular set function has recently emerged as a powerful modeling tool for various summarization problems.



In [35]:
!wget https://www.gutenberg.org/files/84/84-0.txt -O book.txt

--2024-10-24 22:00:39--  https://www.gutenberg.org/files/84/84-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 448642 (438K) [text/plain]
Saving to: ‘book.txt’


2024-10-24 22:00:40 (2.04 MB/s) - ‘book.txt’ saved [448642/448642]



In [36]:
with open("book.txt") as f:
    book_raw = f.read()

print(book_raw[0:1000])

The Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft Shelley

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Frankenstein
       or, The Modern Prometheus

Author: Mary Wollstonecraft Shelley

Release Date: October 31, 1993 [eBook #84]
[Most recently updated: December 2, 2022]

Language: English

Character set encoding: UTF-8

Produced by: Judith Boss, Christy Phillips, Lynn Hanninen and David Meltzer. HTML version by Al Haines.
Further corrections by Menno de Leeuw.

*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***




Frankenstein;

or, the Modern Prometheus

by 


In [37]:
start = book_raw.rfind("Chapter 1\n")
end = book_raw.rfind('Chapter 2\n')

In [38]:
chapter_n = book_raw[start + len("Chapter 5\n"): end]

In [39]:
chapter_n

'\n\nI am by birth a Genevese, and my family is one of the most\ndistinguished of that republic. My ancestors had been for many years\ncounsellors and syndics, and my father had filled several public\nsituations with honour and reputation. He was respected by all who\nknew him for his integrity and indefatigable attention to public\nbusiness. He passed his younger days perpetually occupied by the\naffairs of his country; a variety of circumstances had prevented his\nmarrying early, nor was it until the decline of life that he became a\nhusband and the father of a family.\n\nAs the circumstances of his marriage illustrate his character, I cannot\nrefrain from relating them. One of his most intimate friends was a\nmerchant who, from a flourishing state, fell, through numerous\nmischances, into poverty. This man, whose name was Beaufort, was of a\nproud and unbending disposition and could not bear to live in poverty\nand oblivion in the same country where he had formerly been\ndistinguish

In [40]:
bullet_points = summary(chapter_n, 5, False, eig = False)

Paso 1. Obteniendo oraciones
# oraciones: 74
Paso 2. Procesando texto
Paso 3. Creando matriz de similitud
Paso 4. Normalizando matriz de similitud
Paso 5. Ejecutando TextRank
	Paso 5. Terminado


In [41]:
print_bullet_points(bullet_points)

- Her father grew worse; her time
was more entirely occupied in attending him; her means of subsistence
decreased; and in the tenth month her father died in her arms, leaving
her an orphan and a beggar.

- One day, when my father had gone by himself to Milan, my mother,
accompanied by me, visited this abode.

- When my father returned from Milan, he found playing with me in the hall of
our villa a child fairer than pictured cherub—a creature who seemed
to shed radiance from her looks and whose form and motions were lighter
than the chamois of the hills.

- The father of their
charge was one of those Italians nursed in the memory of the antique glory
of Italy—one among the _schiavi ognor frementi,_ who exerted
himself to obtain the liberty of his country.

- He passed his younger days perpetually occupied by the
affairs of his country; a variety of circumstances had prevented his
marrying early, nor was it until the decline of life that he became a
husband and the father of a family.



# Ejercicios

## Matriz de similitud entre oraciones

Para la similitud entre las oraciones se uso el número de palabras que aparecen en ambas. **Reemplazar por similitud coseno** y comparar los resultados.

Un muy buen primer acercamiento podría ser usando Latent Semantic Analysis y calcular la similitud coseno entre todos los documentos.

Si tienen una DataFrame con las columnas ```[id_documento_1, id_documento_2, similitud]```, usar la función [```pandas.DataFrame.pivot```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) puede ayudar a crear la matriz de similitud, dicha función toma como argumentos "index", "columns" y "values".




## Oraciones vs. Palabras

En este Notebook utilizamos las oraciones para obtener el resumen, de haber utilizado las palabras, de TextRank obtendríamos las palabras clave del texto.

Implementar TextRank con palabras. Para la matriz de similitud (o adyacencias), se pueden ligar las palabras que son consecutivas o definir una ventana de k palabras consecutivas en cada oración (parecido a skip-gram) y ligar todas estas palabras. En este caso, la matriz A tendría la dimensión del vocabulario (lista de palabras únicas) y tendría un 1 si las palabras están ligadas.

Una alternativa más sería ocupar un embedding de palabras (e.g. word2vec) y calcular la similitud coseno entre los vectores de cada palabra para llenas a A.

Después de eso, todo sería lo mismo.

## Idioma *

Este ejemplo esta hecho para texto en inglés por las stopwords que se usan y el radicalizador (PorterStemmer). Hacer los cambios necesarios para que reciba textos en español.

Esto es, cambiar las stopwords (nltk tiene stopwords en español) y el radicalizador (Pista: ```nltk.stemmer``` tiene más radicalizadores y uno de ellos tienen un algoritmo para el español)

## Resumen sobre un tema *

Aquí usamos sólo un documento para aplicarle TextRank. Podemos tener un corpus de documentos del mismo tema (e.g. noticias sobre el AIFA, etc) y aplicarlo para obtener los puntos importantes de todo el corpus.

A la implementación actual no se le tiene que cambiar nada, sólo concatenar en una sola cadena de texto todo el corpus.

Ejercicio: Construir un corpus con 4 artículos sobre un tema de interés, concatenarlos y pasarlo como parámetro a la función ```summary```.

## Mejorar la función ```summary```

Podemos dividir el código de la función para que funcionen como módulos y permita cierta libertad a la hora de ejecutarse. Por ejemplo, podríamos tener varias funciones que calculen la matriz A de diferentes maneras y que dentro de ```summary``` se ejecute una de tantas de acuerdo a un parámetro de la función.

Ejercicio: Crear funciones para cada paso de ```summary```

# Sobre la obtención de los valores de PageRank

https://nlp.stanford.edu/IR-book/html/htmledition/the-pagerank-computation-1.html

https://nlp.stanford.edu/IR-book/html/htmledition/markov-chains-1.html