<a href="https://colab.research.google.com/github/anelglvz/Matematicas_Ciencia_Datos/blob/main/%C3%81lgebra/Textrank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Implementación de TextRank para la obtención de resúmenes

En este Notebook se implementará TextRank para obtener un resumen con las oraciones clave de todo un texto.

# Dependencias

In [None]:
%%capture
!pip install wikipedia git+https://github.com/neuml/txtai#egg=txtai[pipeline]

In [None]:
# PUEDE ser necesario utilizar una versión anterior de pillow
!pip install Pillow==9.0.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Pillow==9.0.0
  Downloading Pillow-9.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Pillow
  Attempting uninstall: Pillow
    Found existing installation: Pillow 8.4.0
    Uninstalling Pillow-8.4.0:
      Successfully uninstalled Pillow-8.4.0
Successfully installed Pillow-9.0.0


In [None]:
import re

import pandas as pd
import numpy as np
import scipy.linalg as splinalg

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

import wikipedia

from txtai.pipeline import Translation

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


In [None]:
nltk.download("punkt")
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Radicalizador
stemmer = PorterStemmer()

# Palabras de paro
cached_stopwords = stopwords.words('english')
print(cached_stopwords[:10])

# Traductor
translate = Translation()

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


# Datos

Los datos que ocuparemos serán el texto de páginas de Wikipedia. Descargaremos el texto ocupando el módulo [```wikipedia```](https://pypi.org/project/wikipedia/) que es un "wrapper" del API de Wikipedia. A este texto lo dividiremos en oraciones, procesaremos cada oración, radicalizaremos cada palabra, y aplicaremos TextRank para obtener las oraciones más importantes de todo el documento.

## Lectura de los datos

Descargamos artículos de Wikipedia.

In [None]:
wiki = wikipedia.page('Expropiación del petróleo en México')
book = wiki.content
print(book)

The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.  In accordance with Article 27 of the Constitution of 1917, President Lázaro Cárdenas declared that all mineral and oil reserves found within Mexico belong to "the nation", i.e., the federal government. The Mexican government established a state-owned petroleum company, Petróleos Mexicanos, or PEMEX.  For a short period, this measure caused an international boycott of Mexican products in the following years, especially by the United States, the United Kingdom, and the Netherlands, but with the outbreak of World War II and the alliance between Mexico and the Allies, the disputes with private companies over compensation were resolved. The anniversary, March 18, is now a Mexican civic holiday.


== Background ==

On August 16, 1935, the Petroleum Workers Union of Mexico (Sindicato de Trabajadores Petroleros de

## Procesamiento

Dividimos el texto en oraciones.

In [None]:
sentences = [x for x in sent_tokenize(book)]
print(f"# oraciones: {len(sentences)}")
for sentence in sentences[:3]:
    print(sentence)
    print()
    print("...Fin de la oración...")
    print()


# oraciones: 97
The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.

...Fin de la oración...

In accordance with Article 27 of the Constitution of 1917, President Lázaro Cárdenas declared that all mineral and oil reserves found within Mexico belong to "the nation", i.e., the federal government.

...Fin de la oración...

The Mexican government established a state-owned petroleum company, Petróleos Mexicanos, or PEMEX.

...Fin de la oración...



In [None]:
# Ejemplo lista por comprension
lista = []
for i in range(9):
  lista.append('Hola')

print(lista)

# Otro modo de crearla
otra_lista = ['Hola' for i in range(9)]

print(otra_lista)

['Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola']
['Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola', 'Hola']


convertimos a minúsculas, eliminamos stopwords, eliminamos signos de puntuación y radicalizamos.

In [None]:
sent_low = [[stemmer.stem(re.sub('[^a-z]', "", word.lower())) for word in word_tokenize(sentence) if word not in cached_stopwords and len(word) > 2] for sentence in sentences]
sent_low[0]

['the',
 'mexican',
 'oil',
 'expropri',
 'spanish',
 'expropiacin',
 'petrolera',
 'nation',
 'petroleum',
 'reserv',
 'facil',
 'foreign',
 'oil',
 'compani',
 'mexico',
 'march',
 '']

# TextRank

Construimos la matriz de adyacencias/similitud A entre las oraciones, tomando el número de palabras que están en ambas como la similitud entre las dos oraciones.

In [None]:
A = np.zeros((len(sent_low), len(sent_low)))

for i in range(len(sentences)):
    if i % 100 == 0:
        print(i, end=", ")
        if i % 1000 == 0:
            print()
    for j in range(i+1, len(sentences)):
        # La simillitud entre oraciones va a ser el número de palabras que tienen en común
        A[i][j] = A[j][i] = len([x for x in sent_low[i] if x in sent_low[j]])

0, 


In [None]:
[x for x in sent_low[0] if x in sent_low[1]]

['oil', 'nation', 'reserv', 'oil', 'mexico', '']

Así es como se ve un fragmento de la matriz A.

In [None]:
A[:5, :5]

array([[0., 6., 4., 3., 3.],
       [6., 0., 1., 1., 0.],
       [4., 1., 0., 2., 2.],
       [3., 1., 2., 0., 1.],
       [3., 0., 2., 1., 0.]])

Normalizamos las columnas de A

In [None]:
# Comparamos las oraciones unas con otras, pero no consigo mismas
suma = np.sum(A, axis=0)
A_norm = A.copy()
suma = np.sum(A, axis=0)
for i in range(len(sentences)):
  if suma[i] != 0:
    A_norm[i,:] = A[i,:]/suma[i]
A_norm[:5, :5]

array([[0.        , 0.0244898 , 0.01632653, 0.0122449 , 0.0122449 ],
       [0.04225352, 0.        , 0.00704225, 0.00704225, 0.        ],
       [0.03669725, 0.00917431, 0.        , 0.01834862, 0.01834862],
       [0.02654867, 0.00884956, 0.01769912, 0.        , 0.00884956],
       [0.06521739, 0.        , 0.04347826, 0.02173913, 0.        ]])

In [None]:
A_norm[0,:].sum()

1.0

Se crea el vector de TextRank con unos y se itera hasta que converja. Es decir, hasta que obtengamos $\Pi$ tal que $$\Pi = A~\Pi$$

In [None]:
# Impresiones mas bonitas
np.set_printoptions(suppress=True)

In [None]:
# Tolerancia para la diferencia al comparar
tol = 1e-7

PI_ = np.ones(A_norm.shape[1])
A_norm_a = A_norm.T.copy()

i = 0
while True:
    pi_ = A_norm_a @ PI_
    print(i, abs(PI_- pi_).sum())
    if np.allclose(PI_, pi_, tol):
        break
    i += 1
    PI_ = pi_

0 47.841945948307036
1 10.423210816947975
2 1.6267680404269091
3 0.47305104375913454
4 0.1697209666478503
5 0.07167945054092646
6 0.03236976952734723
7 0.015096674774192107
8 0.007051656689697528
9 0.003309432357099292
10 0.0015530052346448588
11 0.0007284412298238906
12 0.00034149263573162933
13 0.0001600269319367159
14 7.496401704629001e-05
15 3.510705932405385e-05
16 1.643770696345935e-05
17 7.695100166790064e-06
18 3.6018838120589064e-06
19 1.6857787140337616e-06


In [None]:
pi_

array([2.60284811, 1.50858953, 1.1580018 , 1.20049729, 0.48869801,
       1.98666363, 2.7622061 , 0.15935804, 0.64805605, 0.70117538,
       0.91365278, 1.01989148, 1.56170883, 0.92427665, 1.33860758,
       1.33860758, 1.51921338, 1.76356239, 0.37183546, 0.3505877 ,
       0.71179925, 1.45547013, 1.73169076, 1.15800178, 1.77418626,
       0.67992765, 0.77554247, 0.59493672, 0.89240507, 1.25361661,
       1.31735984, 0.94552439, 0.29746835, 1.01989148, 0.67992765,
       1.25361665, 1.13675407, 1.48734172, 0.43557864, 1.52983722,
       0.55244121, 1.8910488 , 1.16862567, 1.31735984, 0.03187161,
       0.38245931, 1.59358048, 2.76220618, 0.10623869, 0.65867995,
       0.99864376, 0.01062387, 0.01062387, 1.50858954, 0.13811031,
       2.12477399, 1.16862569, 1.5829566 , 0.55244123, 1.41297468,
       1.28548827, 1.85917725, 1.46609404, 0.77554252, 0.25497288,
       0.8817812 , 0.20185353, 0.73304702, 0.74367089, 1.62545207,
       0.91365281, 0.93490056, 0.52056963, 0.72242315, 0.33996

Alternativamente, podemos obtener los eigenvectores izquierdos de nuestra matriz A_norm. Los valores de PageRank corresponden al vector de probabilidades del estado estacionario de la matriz A que a su vez es el eigenvector izquierdo con eigenvalor asociado 1.

$$\Pi = \Pi A$$

In [None]:
D, vecs = splinalg.eig(A_norm, left=True, right=False)

In [None]:
D

array([ 1.        +0.j,  0.46786796+0.j,  0.35654311+0.j,  0.28575201+0.j,
        0.22966652+0.j,  0.1989689 +0.j,  0.17567107+0.j,  0.1713006 +0.j,
        0.15091118+0.j,  0.12943372+0.j,  0.12343474+0.j,  0.11488601+0.j,
       -0.18573025+0.j, -0.15428929+0.j,  0.09142279+0.j,  0.09007774+0.j,
        0.07700577+0.j,  0.07151512+0.j,  0.06683931+0.j,  0.06119591+0.j,
        0.05261238+0.j,  0.05151214+0.j, -0.13172345+0.j, -0.12847411+0.j,
        0.04863509+0.j, -0.1237895 +0.j, -0.12109938+0.j,  0.03836986+0.j,
        0.03216803+0.j, -0.11632875+0.j, -0.11474151+0.j,  0.02365264+0.j,
       -0.11005911+0.j, -0.11136241+0.j,  0.01968514+0.j, -0.10309623+0.j,
        0.01628221+0.j, -0.10000285+0.j, -0.09827466+0.j, -0.09419776+0.j,
        0.01198295+0.j,  0.01113738+0.j,  0.00870103+0.j, -0.09216607+0.j,
       -0.10869565+0.j, -0.08917937+0.j, -0.08512432+0.j,  0.00200263+0.j,
       -0.00434442+0.j, -0.00253741+0.j, -0.00811984+0.j, -0.00980335+0.j,
       -0.01224474+0.j, -

In [None]:
vecs.shape

(97, 97)

In [None]:
vecs

array([[ 0.23040363,  0.0631518 ,  0.1197814 , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.13354006,  0.12178215,  0.04340418, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.1025061 , -0.05603217,  0.0668489 , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       [ 0.09122103, -0.0270364 ,  0.09135856, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.04419988,  0.14596026, -0.19929998, ...,  0.        ,
         0.        ,  0.        ]])

In [None]:
pi2_ = vecs[:, 0]
pi2_

array([0.23040363, 0.13354006, 0.1025061 , 0.1062678 , 0.04325946,
       0.1758591 , 0.24450997, 0.01410634, 0.0573658 , 0.06206792,
       0.08087638, 0.09028061, 0.13824218, 0.0818168 , 0.1184933 ,
       0.1184933 , 0.13448049, 0.15611021, 0.0329148 , 0.03103396,
       0.06300834, 0.12883795, 0.15328895, 0.1025061 , 0.15705064,
       0.06018707, 0.06865088, 0.05266369, 0.07899553, 0.11096991,
       0.11661245, 0.08369764, 0.02633184, 0.09028061, 0.06018707,
       0.11096991, 0.10062526, 0.13165922, 0.03855734, 0.13542091,
       0.04890199, 0.16739529, 0.10344653, 0.11661245, 0.00282127,
       0.03385523, 0.14106345, 0.24450997, 0.00940423, 0.05830622,
       0.08839976, 0.00094042, 0.00094042, 0.13354006, 0.0122255 ,
       0.1880846 , 0.10344653, 0.14012302, 0.04890199, 0.12507626,
       0.11379118, 0.16457402, 0.12977837, 0.06865088, 0.02257015,
       0.07805511, 0.01786804, 0.06488919, 0.06582961, 0.14388472,
       0.08087638, 0.08275722, 0.04608073, 0.06394876, 0.03009

Obtenemos los índices de los k valores más grandes en $\Pi$ y los usamos para obtener las oraciones más relevantes.

In [None]:
k = 4
pi_.argsort()[-k:][::-1]

array([47,  6,  0, 55])

In [None]:
k = 4
pi2_.argsort()[-k:][::-1]

array([47,  6,  0, 55])

In [None]:
summary = [sentences[idx] for idx in pi_.argsort()[-k:][::-1]]

In [None]:
summary

['== Oil Expropriation Day, March 18, 1938 ==\nOn March 18, 1938 President Cárdenas embarked on the expropriation of all oil resources and facilities by the state, nationalizing the U.S. and Anglo-Dutch (Mexican Eagle Petroleum Company) operating companies.',
 'The foreign oil companies refused to sign the agreement, and counter offered with a payment of 14 million pesos toward wages and benefits.On November 3, 1937, the union demanded that the companies sign the collective agreement and on May 17, the union summoned a strike in case their demands were not met.',
 'The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.',
 '== Opposition ==\n\n\n=== International ===\nIn retaliation, the oil companies initiated a public relations campaign against Mexico, urging people to stop buying Mexican goods and lobbying to embargo U.S. technology to Mexico.']

Por último, sólo queda ver qué considero TextRank como las oraciones más importantes.

In [None]:
for bullet in summary:
    print('___________')
    print(bullet)

___________
== Oil Expropriation Day, March 18, 1938 ==
On March 18, 1938 President Cárdenas embarked on the expropriation of all oil resources and facilities by the state, nationalizing the U.S. and Anglo-Dutch (Mexican Eagle Petroleum Company) operating companies.
___________
The foreign oil companies refused to sign the agreement, and counter offered with a payment of 14 million pesos toward wages and benefits.On November 3, 1937, the union demanded that the companies sign the collective agreement and on May 17, the union summoned a strike in case their demands were not met.
___________
The Mexican oil expropriation (Spanish: expropiación petrolera) was the nationalization of all petroleum reserves, facilities, and foreign oil companies in Mexico on March 18, 1938.
___________
== Opposition ==


=== International ===
In retaliation, the oil companies initiated a public relations campaign against Mexico, urging people to stop buying Mexican goods and lobbying to embargo U.S. technolo

Podemos traducir la salida.

In [None]:
!pip install sacremoses

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Aprox 34 seg las primeras 10 oraciones
for bullet in summary:
    print()
    print(translate(bullet, "es"))


== Día de Expropiación de Petróleo, 18 de marzo de 1938 ==El 18 de marzo de 1938 el presidente Cárdenas se embarcó en la expropiación de todos los recursos e instalaciones petroleras por el estado, nacionalizando las compañías operadoras estadounidenses y angloholandesas (Mexican Eagle Petroleum Company).

El 3 de noviembre de 1937, el sindicato exigió a las empresas que firmaran el convenio colectivo y el 17 de mayo, el sindicato convocó una huelga en caso de que no se cumplieran sus demandas.

La expropiación petrolera mexicana fue la nacionalización de todas las reservas de petróleo, instalaciones y compañías petroleras extranjeras en México el 18 de marzo de 1938.

== Oposición ===== Internacional ===En represalia, las compañías petroleras iniciaron una campaña de relaciones públicas contra México, instando a la gente a dejar de comprar bienes mexicanos y a presionar para que la tecnología estadounidense sea objeto de embargo a México.


# Función para crear resúmenes

Podemos condensar todo lo anterior en una función que reciba texto y nos regrese las oraciones más relevantes de acuerdo a TextRank.

In [None]:
def summary(text, k, to_spanish = True, tol = 1e-5, d = .15, eig = False):
    print("Paso 1. Obteniendo oraciones")
    sentences = [x for x in sent_tokenize(text)]

    print(f"# oraciones: {len(sentences)}")

    print("Paso 2. Procesando texto")
    sent_low = [[stemmer.stem(re.sub('[^a-z]', "", word.lower())) for word in word_tokenize(sentence) if word not in cached_stopwords and len(word) > 2] for sentence in sentences]

    print("Paso 3. Creando matriz de similitud")
    A = np.zeros((len(sent_low), len(sent_low)))

    for i in range(len(sentences)):
        for j in range(i+1, len(sentences)):
            # La simillitud entre oraciones va a ser el número de palabras que tienen en común
            A[i][j] = A[j][i] = len([x for x in sent_low[i] if x in sent_low[j]])

    print("Paso 4. Normalizando matriz de similitud")
    suma = np.sum(A, axis=0)
    A_norm = A.copy()
    suma = np.sum(A, axis=0)
    for i in range(len(sentences)):
      if suma[i] != 0:
        A_norm[i,:] = A[i,:]/suma[i]

    print("Paso 5. Ejecutando TextRank")
    if eig:
        vals, vecs = splinalg.eig(A_norm, left=True, right=False)
        pi_ = vecs[:, 0]
    else:
        A_norm_a = A_norm.T.copy()
        PI_ = np.ones(A_norm.shape[1])

        while True:
            pi_ = A_norm_a.dot(PI_)
            if np.allclose(PI_, pi_, tol):
                break
            PI_ = pi_

    print("\tPaso 5. Terminado")

    if not to_spanish:
        return [sentences[idx] for idx in pi_.argsort()[-k:][::-1]]

    print("Paso 6. Traduciendo")
    return [translate(sentences[idx], "es") for idx in pi_.argsort()[-k:][::-1]]

def print_bullet_points(bullet_points):
    for point in bullet_points:
        print(f"- {point}\n")


In [None]:
wiki = wikipedia.page('Automatic summarization')
text = wiki.content
bullet_points = summary(text, 5, False, eig = True)

Paso 1. Obteniendo oraciones
# oraciones: 317
Paso 2. Procesando texto
Paso 3. Creando matriz de similitud
Paso 4. Normalizando matriz de similitud
Paso 5. Ejecutando TextRank
	Paso 5. Terminado


In [None]:
print_bullet_points(bullet_points)

- ==== Maximum entropy-based summarization ====
During the DUC 2001 and 2002 evaluation workshops, TNO developed a sentence extraction system for multi-document summarization in the news domain.

- === Document summarization ===
Like keyphrase extraction, document summarization aims to identify the essence of a text.

- The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".

- ==== TextRank and LexRank ====
The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data.

- Similar results were achieved with the use of determinantal point processes (which are a special case of submodular functions) for DUC-04.A new method for multi-lingual multi-document summarization that avoids redundancy generates ideograms

In [None]:
!wget https://www.gutenberg.org/files/84/84-0.txt -O book.txt

--2023-04-28 21:25:29--  https://www.gutenberg.org/files/84/84-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 448642 (438K) [text/plain]
Saving to: ‘book.txt’


2023-04-28 21:25:31 (875 KB/s) - ‘book.txt’ saved [448642/448642]



In [None]:
with open("book.txt") as f:
    book_raw = f.read()

print(book_raw[0:1000])

﻿The Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft Shelley

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Frankenstein
       or, The Modern Prometheus

Author: Mary Wollstonecraft Shelley

Release Date: October 31, 1993 [eBook #84]
[Most recently updated: December 2, 2022]

Language: English

Character set encoding: UTF-8

Produced by: Judith Boss, Christy Phillips, Lynn Hanninen and David Meltzer. HTML version by Al Haines.
Further corrections by Menno de Leeuw.

*** START OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***




Frankenstein;

or, the Modern Prometheus

by 

In [None]:
start = book_raw.rfind("Chapter 1\n")
end = book_raw.rfind('Chapter 2\n')

In [None]:
chapter_n = book_raw[start + len("Chapter 5\n"): end]

In [None]:
chapter_n

'\n\nI am by birth a Genevese, and my family is one of the most\ndistinguished of that republic. My ancestors had been for many years\ncounsellors and syndics, and my father had filled several public\nsituations with honour and reputation. He was respected by all who\nknew him for his integrity and indefatigable attention to public\nbusiness. He passed his younger days perpetually occupied by the\naffairs of his country; a variety of circumstances had prevented his\nmarrying early, nor was it until the decline of life that he became a\nhusband and the father of a family.\n\nAs the circumstances of his marriage illustrate his character, I cannot\nrefrain from relating them. One of his most intimate friends was a\nmerchant who, from a flourishing state, fell, through numerous\nmischances, into poverty. This man, whose name was Beaufort, was of a\nproud and unbending disposition and could not bear to live in poverty\nand oblivion in the same country where he had formerly been\ndistinguish

In [None]:
bullet_points = summary(chapter_n, 5, False, eig = False)

Paso 1. Obteniendo oraciones
# oraciones: 74
Paso 2. Procesando texto
Paso 3. Creando matriz de similitud
Paso 4. Normalizando matriz de similitud
Paso 5. Ejecutando TextRank
	Paso 5. Terminado


In [None]:
print_bullet_points(bullet_points)

- Her father grew worse; her time
was more entirely occupied in attending him; her means of subsistence
decreased; and in the tenth month her father died in her arms, leaving
her an orphan and a beggar.

- One day, when my father had gone by himself to Milan, my mother,
accompanied by me, visited this abode.

- When my father returned from Milan, he found playing with me in the hall of
our villa a child fairer than pictured cherub—a creature who seemed
to shed radiance from her looks and whose form and motions were lighter
than the chamois of the hills.

- The father of their
charge was one of those Italians nursed in the memory of the antique glory
of Italy—one among the _schiavi ognor frementi,_ who exerted
himself to obtain the liberty of his country.

- He passed his younger days perpetually occupied by the
affairs of his country; a variety of circumstances had prevented his
marrying early, nor was it until the decline of life that he became a
husband and the father of a family.



# Ejercicios

## Matriz de similitud entre oraciones

Para la similitud entre las oraciones se uso el número de palabras que aparecen en ambas. **Reemplazar por similitud coseno** y comparar los resultados.

Un muy buen primer acercamiento podría ser usando Latent Semantic Analysis y calcular la similitud coseno entre todos los documentos.

Si tienen una DataFrame con las columnas ```[id_documento_1, id_documento_2, similitud]```, usar la función [```pandas.DataFrame.pivot```](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) puede ayudar a crear la matriz de similitud, dicha función toma como argumentos "index", "columns" y "values".




## Oraciones vs. Palabras

En este Notebook utilizamos las oraciones para obtener el resumen, de haber utilizado las palabras, de TextRank obtendríamos las palabras clave del texto.

Implementar TextRank con palabras. Para la matriz de similitud (o adyacencias), se pueden ligar las palabras que son consecutivas o definir una ventana de k palabras consecutivas en cada oración (parecido a skip-gram) y ligar todas estas palabras. En este caso, la matriz A tendría la dimensión del vocabulario (lista de palabras únicas) y tendría un 1 si las palabras están ligadas.

Una alternativa más sería ocupar un embedding de palabras (e.g. word2vec) y calcular la similitud coseno entre los vectores de cada palabra para llenas a A.

Después de eso, todo sería lo mismo.

## Idioma *

Este ejemplo esta hecho para texto en inglés por las stopwords que se usan y el radicalizador (PorterStemmer). Hacer los cambios necesarios para que reciba textos en español.

Esto es, cambiar las stopwords (nltk tiene stopwords en español) y el radicalizador (Pista: ```nltk.stemmer``` tiene más radicalizadores y uno de ellos tienen un algoritmo para el español)

## Resumen sobre un tema *

Aquí usamos sólo un documento para aplicarle TextRank. Podemos tener un corpus de documentos del mismo tema (e.g. noticias sobre el AIFA, etc) y aplicarlo para obtener los puntos importantes de todo el corpus.

A la implementación actual no se le tiene que cambiar nada, sólo concatenar en una sola cadena de texto todo el corpus.

Ejercicio: Construir un corpus con 4 artículos sobre un tema de interés, concatenarlos y pasarlo como parámetro a la función ```summary```.

## Mejorar la función ```summary```

Podemos dividir el código de la función para que funcionen como módulos y permita cierta libertad a la hora de ejecutarse. Por ejemplo, podríamos tener varias funciones que calculen la matriz A de diferentes maneras y que dentro de ```summary``` se ejecute una de tantas de acuerdo a un parámetro de la función.

Ejercicio: Crear funciones para cada paso de ```summary```

# Sobre la obtención de los valores de PageRank

https://nlp.stanford.edu/IR-book/html/htmledition/the-pagerank-computation-1.html

https://nlp.stanford.edu/IR-book/html/htmledition/markov-chains-1.html