# Text Summarization for World History Encyclopedia(a website)
> Practical approaches addressing text summarization from World History Encyclopedia website.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter,machine learning,tf-idf,text summarization,rouge,LSA,TextRank]
- image: images/World_history_encyclopedia.png

In [1]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [2]:
from rouge_score import rouge_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize
import matplotlib.pyplot as plt
import html
import re
import random
import rouge_score

# Tip from Practical NLP textbook

 Categories of common text summarization tasks:
 - Extractive vs Abstractive summarization
 - Query-focused vs query-independent summarization
 - Single-document vs query-document summarization

Most common case:
 - Single document, query-independent, extractive summarizer.

Limitation:
 - May need to be customized into your own use cases
 - ROUGE method as a metric to evaluation summarization may also need to be customized for research prorblem.
 - Summarization is sensitive to the size of the text given as input. A better approach would be run text summarization separately on different part of texts.

<font size="6">Incentive:</font>



Although being a well-rounded person is not an easy task, gaining more general knowledge outside of our professionals especially knowledge from world history sometimes could give us another perspective to perceive our world or have more empathy toward environments and people. By far the most effective way to be well-rounded person is by reading more books from a variety of genres. However, people nowadays can be distracted by their personal hardships and relatioships with others and therefore lack of time to read every book from beginning to end. As a result, we provide a life saver in this situation for anyone wants to learn some history from website **World History Encyclopedia**. In this notebook, we are going to further summarize these already abbreviated history articles from this website. This summarization will not only save your time from truely read the whole artilce but also quickly help you understand the whole articles (if you have time to read) by reading this summarization first.

# Text Summarization

In this notebook, we focus on extractive methods in text summarization.


## Data pre-processing 


In [3]:
import requests
from bs4 import BeautifulSoup
import os.path
from dateutil import parser
import pandas as pd
import numpy as np
import re
import os

In [4]:
BASE_DIR = '/content'

In [5]:
def download_article(url):
    # check if article already there
    filename = url.split('/')[-2] + ".html"
    os.makedirs('world_history_encyclopedia', exist_ok=True)
    filename = f"{BASE_DIR}/world_history_encyclopedia/" + filename
    if not os.path.isfile(filename):
      r = requests.get(url)
      with open(filename, "w+") as f:
          f.write(r.text)
    return filename

In [6]:
def clean_article(soup_source):
    r = re.compile("(Sign up|news letter|\n)")
    texts = ""
    for t in soup_source.select('p'):
        if 'World History Encyclopedia' in t.text:
            break
        else:
            if not r.match(t.text):
                texts += t.text + ' '
    return texts

In [7]:
def parse_article(article_file):
    with open(article_file, "r") as f:
        html = f.read()
    r = {}
    soup = BeautifulSoup(html, 'html.parser')
    r['headline'] = soup.h1.text
    r['first_paragraph'] = soup.p.text
    r['text'] = clean_article(soup)
    return r

In [8]:
import reprlib
r = reprlib.Repr()
r.maxstring = 800

url1 = "https://www.worldhistory.org/article/1596/mulan-the-legend-through-history/"
article_name1 = download_article(url1)
article1 = parse_article(article_name1)
print (r.repr(article1['text']))

"Mulan (“magnolia”) is a legendary character in Chinese literature who is best known in the modern day from the Disney filmed adaptations (1998, 2020). Her story, however, about a young girl who takes her father's place in the army to help save her country, is hundreds of years old The tale most likely originated in the Northern Wei Period (386-535 CE) of China before it was developed by succeed... Ja Quan) which combines tai chi with dance, Kung Fu, and other arts to create a unique form of self-defense and personal improvement. The discipline is intended to encourage the confidence, strength, and grace of Mulan in modern-day practitioners and is only one of the many examples of how the legend of Mulan continues to inspire people today, especially women, just as it has done in the past. "


## Summarizing text using topic representation

Topic representation methods distinguish important sentences by identifying topics of sentences through important words.

### Baseline: Identifying important sentences with sum of TF-IDF values

In this simple technique, we sum over the tf-idf vectors of each sentence to determine if we include this setence to be our part of our summaries.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize
import nltk

nltk.download('punkt')

sentences = tokenize.sent_tokenize(article1['text'])
tfidfVectorizer = TfidfVectorizer()
words_tfidf = tfidfVectorizer.fit_transform(sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [10]:
# Specify number lof summary sentences
num_summary_sentence = 10

# Sort the sentences in descending order by the sum of TF-IDF values
sent_sum = words_tfidf.sum(axis=1)
important_sent = np.argsort(sent_sum, axis=0)[::-1]

# Print three most important sentences in the order they appear in the article
for i in range(0, len(sentences)):
    if i in important_sent[:num_summary_sentence]:
        print (sentences[i])

Her story, however, about a young girl who takes her father's place in the army to help save her country, is hundreds of years old The tale most likely originated in the Northern Wei Period (386-535 CE) of China before it was developed by succeeding authors.
The play The Female Mulan (16th century CE) modifies earlier themes, moves the action back to the time of the Northern Wei, and introduces the happy ending of the marriage motif while succeeding versions conclude with Mulan killing herself to avoid the shame of having to become the emperor's concubine until the story returned the conclusion of the joyful family reunion and marriage.
The modern and ancient versions follow basically the same plot of a young girl who takes her aged father's place in the army when he is called to serve, performs her duties admirably, saves her country, and returns home to her family where she is received with honor.
In the Disney films, she is revealed as a woman but perseveres against the prohibition 

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import tokenize

def tfidf_summary(text, num_summary_sentence):
    summary_sentence = []
    sentences = tokenize.sent_tokenize(text)
    tfidfVectorizer = TfidfVectorizer()
    words_tfidf = tfidfVectorizer.fit_transform(sentences)
    sentence_sum = words_tfidf.sum(axis=1)
    important_sentences = np.argsort(sentence_sum, axis=0)[::-1]
    for i in range(0, len(sentences)):
        if i in important_sentences[:num_summary_sentence]:
            summary_sentence.append(sentences[i])
    return summary_sentence

In [12]:
print("Tf-IDF method:")
tfidf_summary(article1['text'], 10)

Tf-IDF method:


["Her story, however, about a young girl who takes her father's place in the army to help save her country, is hundreds of years old The tale most likely originated in the Northern Wei Period (386-535 CE) of China before it was developed by succeeding authors.",
 "The play The Female Mulan (16th century CE) modifies earlier themes, moves the action back to the time of the Northern Wei, and introduces the happy ending of the marriage motif while succeeding versions conclude with Mulan killing herself to avoid the shame of having to become the emperor's concubine until the story returned the conclusion of the joyful family reunion and marriage.",
 "The modern and ancient versions follow basically the same plot of a young girl who takes her aged father's place in the army when he is called to serve, performs her duties admirably, saves her country, and returns home to her family where she is received with honor.",
 "In the Disney films, she is revealed as a woman but perseveres against th

### LSA algorithm


LSA essentiallu perform SVD technique from linear algebra to simply the original term frequency sentence matrix into matrices with essence of the article.

In [13]:
!pip install sumy

Collecting sumy
  Downloading sumy-0.9.0-py2.py3-none-any.whl (87 kB)
[?25l[K     |███▊                            | 10 kB 30.9 MB/s eta 0:00:01[K     |███████▌                        | 20 kB 9.1 MB/s eta 0:00:01[K     |███████████▏                    | 30 kB 7.7 MB/s eta 0:00:01[K     |███████████████                 | 40 kB 7.1 MB/s eta 0:00:01[K     |██████████████████▋             | 51 kB 4.0 MB/s eta 0:00:01[K     |██████████████████████▍         | 61 kB 4.2 MB/s eta 0:00:01[K     |██████████████████████████      | 71 kB 4.3 MB/s eta 0:00:01[K     |█████████████████████████████▉  | 81 kB 4.8 MB/s eta 0:00:01[K     |████████████████████████████████| 87 kB 3.1 MB/s 
Collecting breadability>=0.1.20
  Downloading breadability-0.1.20.tar.gz (32 kB)
Collecting pycountry>=18.2.23
  Downloading pycountry-22.1.10.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 22.0 MB/s 
Building wheels for collected packages: breadability, pycountry
  Building wheel 

In [14]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

from sumy.summarizers.lsa import LsaSummarizer

LANGUAGE = "english"
stemmer = Stemmer(LANGUAGE)

parser = PlaintextParser.from_string(article1['text'], Tokenizer(LANGUAGE))

summarizer = LsaSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print (str(sentence))


Her story, however, about a young girl who takes her father's place in the army to help save her country, is hundreds of years old The tale most likely originated in the Northern Wei Period (386-535 CE) of China before it was developed by succeeding authors.
Scholarly consensus is that Mulan is a fictional character, probably developed in Northern China in response to the greater independence women enjoyed there, whose legend was then revised in succeeding eras to reflect the values and challenges of the times.
The original work, The Poem of Mulan, dates to the 6th century CE and reflects the influences of Mongolian-Turkic peoples on the region with a focus on filial piety the central virtue and moral of the tale.
By the time the character reached the modern era, through the film Mulan Joins the Army (1939), she was a staunch nationalist, driving out foreign invaders, and her earlier virtue of filial piety had been replaced by unwavering love for her country.
Since the appearance of th

In [15]:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer

def lsa_summary(text, num_summary_sentence):
    summary_sentence = []
    LANGUAGE = "english"
    stemmer = Stemmer(LANGUAGE)
    parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
    summarizer = LsaSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, num_summary_sentence):
        summary_sentence.append(str(sentence))
    return summary_sentence

In [19]:
print("LSA Method:")
lsa_summary(article1['text'], 10)

LSA Method:


["Her story, however, about a young girl who takes her father's place in the army to help save her country, is hundreds of years old The tale most likely originated in the Northern Wei Period (386-535 CE) of China before it was developed by succeeding authors.",
 'Scholarly consensus is that Mulan is a fictional character, probably developed in Northern China in response to the greater independence women enjoyed there, whose legend was then revised in succeeding eras to reflect the values and challenges of the times.',
 'The original work, The Poem of Mulan, dates to the 6th century CE and reflects the influences of Mongolian-Turkic peoples on the region with a focus on filial piety the central virtue and moral of the tale.',
 'By the time the character reached the modern era, through the film Mulan Joins the Army (1939), she was a staunch nationalist, driving out foreign invaders, and her earlier virtue of filial piety had been replaced by unwavering love for her country.',
 "Since th

## Summarizing text using an indicator representation

Basically, indicator representation technique create intermediate featuress between sentences to take into account their relationship instead of using only words in each sentence. 

### TextRank algorithm

TextRank technique is inspired by Google's graph-based ranking algorithm. In the case of natural language text, the author of this technique build a graph associated with the text. Original paper can be found [here](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)

In [20]:
from sumy.summarizers.text_rank import TextRankSummarizer

parser = PlaintextParser.from_string(article1['text'], Tokenizer(LANGUAGE))
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

for sentence in summarizer(parser.document, num_summary_sentence):
    print (str(sentence))

The play The Female Mulan (16th century CE) modifies earlier themes, moves the action back to the time of the Northern Wei, and introduces the happy ending of the marriage motif while succeeding versions conclude with Mulan killing herself to avoid the shame of having to become the emperor's concubine until the story returned the conclusion of the joyful family reunion and marriage.
The story, as it is best known today through the recent films, places Mulan in an unidentified era of Imperial China (221 BCE - 1912 CE), but the original poem is set during the Northern Wei Period.
The original poem takes place during the chaotic era between the fall of the Han Dynasty (202 BCE - 220 CE) and the rise of the Sui Dynasty (589-618 CE) during which China first split into the Period of the Three Kingdoms (220-280 CE) and was then ruled by succeeding short-lived dynasties, one of which was the Wei, which established itself during the period of the Northern and Southern Dynasties (386-589 CE).
Th

In [21]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.text_rank import TextRankSummarizer

def textrank_summary(text, num_summary_sentence):
    summary_sentence = []
    LANGUAGE = "english"
    stemmer = Stemmer(LANGUAGE)
    parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
    summarizer = TextRankSummarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, num_summary_sentence):
        summary_sentence.append(str(sentence))
    return summary_sentence

In [22]:
textrank_summary(article1['text'], 10)

["The play The Female Mulan (16th century CE) modifies earlier themes, moves the action back to the time of the Northern Wei, and introduces the happy ending of the marriage motif while succeeding versions conclude with Mulan killing herself to avoid the shame of having to become the emperor's concubine until the story returned the conclusion of the joyful family reunion and marriage.",
 'The story, as it is best known today through the recent films, places Mulan in an unidentified era of Imperial China (221 BCE - 1912 CE), but the original poem is set during the Northern Wei Period.',
 'The original poem takes place during the chaotic era between the fall of the Han Dynasty (202 BCE - 220 CE) and the rise of the Sui Dynasty (589-618 CE) during which China first split into the Period of the Three Kingdoms (220-280 CE) and was then ruled by succeeding short-lived dynasties, one of which was the Wei, which established itself during the period of the Northern and Southern Dynasties (386-5

## Measuring the performance of Text Summarization methods(Rouge Score)

ROUGE - Recall-Oriented Understudy for Gisting Evaluation, is a popular metrics for measuring the accuracy when dealing with langugage summarization or translation. Although ROUGE method can only measure syntactical matches rather than semantic similarities between words, it is still a very good tool at hand to handle summarization and machine translation tasks.

In [23]:
def print_rouge_score(rouge_score):
    for k,v in rouge_score.items():
        print (k, 'Precision:', "{:.2f}".format(v.precision), 'Recall:', "{:.2f}".format(v.recall), 'fmeasure:', "{:.2f}".format(v.fmeasure))

Since we don't have human-generated summaries of these World History Encyclopedia articles, model evaluation below, we will use the first paragraph as the gist of each of articles from world history encyclopedia.

# Example articles

In [83]:
class TextSummarization():

    def __init__(self, url, alg, num_sentences):
        self.article = parse_article(download_article(url))
        self.raw_summary = ''.join(alg(self.article['text'], num_sentences))
        self.summary_alg = alg
        self.num_sentences = num_summary_sentence
        self.standard = self.article['first_paragraph']

    def rouge_score(self, n_gram):
        # assert isinstance(n_gram, list) == True, "Not list type"
        # rouge score can also be measured based on different length of ngrams
        scorer = rouge_scorer.RougeScorer([f'rouge{n_gram}'], use_stemmer=True)
        scores = scorer.score(self.standard, self.raw_summary)
        print(self.summary_alg.__name__ + ":")
        print_rouge_score(scores)

    def print_summary(self):
        print()
        print(*self.summary_alg(self.article['text'], self.num_sentences), sep='\n')
        print()


## Mulan: The Legend Through History

In [84]:
url = 'https://www.worldhistory.org/article/1596/mulan-the-legend-through-history/'
lsa_summa = TextSummarization(url, tfidf_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
lsa_summa = TextSummarization(url, lsa_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
lsa_summa = TextSummarization(url, textrank_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()

tfidf_summary:
rouge2 Precision: 0.05 Recall: 0.59 fmeasure: 0.09

Her story, however, about a young girl who takes her father's place in the army to help save her country, is hundreds of years old The tale most likely originated in the Northern Wei Period (386-535 CE) of China before it was developed by succeeding authors.
The play The Female Mulan (16th century CE) modifies earlier themes, moves the action back to the time of the Northern Wei, and introduces the happy ending of the marriage motif while succeeding versions conclude with Mulan killing herself to avoid the shame of having to become the emperor's concubine until the story returned the conclusion of the joyful family reunion and marriage.
The modern and ancient versions follow basically the same plot of a young girl who takes her aged father's place in the army when he is called to serve, performs her duties admirably, saves her country, and returns home to her family where she is received with honor.
In the Disney films,

## Genghis Khan

In [85]:
url = 'https://www.worldhistory.org/Genghis_Khan/'
lsa_summa = TextSummarization(url, tfidf_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
lsa_summa = TextSummarization(url, lsa_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
lsa_summa = TextSummarization(url, textrank_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()

tfidf_summary:
rouge2 Precision: 0.02 Recall: 0.17 fmeasure: 0.03

Genghis Khan had a fearsome reputation but he was an able administrator who introduced writing to the Mongols, created their first law code, promoted trade and granted religious freedom by permitting all religions to be freely practised anywhere in the Mongol world.
Temujin's mother was called Hoelun and his father, Yisugei, who was a tribal leader, and he arranged for his son to marry Borte (aka Bortei), the daughter of another influential Mongol leader, Dei-secen, but before this plan could come to fruition, Temujin's father was poisoned by a rival.
Things then got even worse when the young Temujin was captured by a rival clan leader, perhaps following an incident where Temujin may have killed one of his older half-brothers, Bekter, who likely represented a rival branch of the family that had taken on the legacy of Yisugei.
Courageous himself in battle, Temujin would often reward bravery shown by the defeated, famousl

## The Iberian Conquest of the Americas

In [86]:
url = 'https://www.worldhistory.org/article/1920/the-iberian-conquest-of-the-americas/'
lsa_summa = TextSummarization(url, tfidf_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
lsa_summa = TextSummarization(url, lsa_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
textrank_summa = TextSummarization(url, textrank_summary, 10)
textrank_summa.rouge_score(2)
textrank_summa.print_summary()

tfidf_summary:
rouge2 Precision: 0.21 Recall: 1.00 fmeasure: 0.35

European explorers began to probe the Western Hemisphere in the early 1500s, and they found to their utter amazement not only a huge landmass but also a world filled with several diverse and populous indigenous cultures.
Among their most important conquests were those of Christopher Columbus in the Caribbean (1492-1502); Hernán Cortés in Aztec Mexico (1519-1521), Francisco Pizzaro and Diego de Almagro in Inca Peru (1528-1532), and Juan de Grijalva (1518) and Hernán Cortés (1519; 1524-1525) in Mayan Yucatán and Guatemala.
Following an earlier expedition through the Yucatán led by Juan de Grijalva, Hernán Cortés began his campaign against the Aztec Empire in 1519 and with his coalition army captured the emperor Cuauhtémoc and the capital of the Aztec Empire, Tenochtitlan in 1521.
It began when Francisco Pizarro, with his Andean allies captured and strangled Emperor Atahualpa in 1532, but it did not end for another 40 year

## Christmas Through the Ages


In [90]:
url = 'https://www.worldhistory.org/article/1893/christmas-through-the-ages/'
lsa_summa = TextSummarization(url, tfidf_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
lsa_summa = TextSummarization(url, lsa_summary, 10)
lsa_summa.rouge_score(2)
lsa_summa.print_summary()
textrank_summa = TextSummarization(url, textrank_summary, 10)
textrank_summa.rouge_score(2)
textrank_summa.print_summary()

tfidf_summary:
rouge2 Precision: 0.02 Recall: 0.11 fmeasure: 0.04

That parties everywhere could get out of hand is attested by records of watchmen being paid to ensure property was not damaged over the 12-day holiday, particularly the big parties held on the eve of the 6th of January, known as Twelfth Night.
The poor enjoyed more modest entertainment like cards and dice, carols, playing musical instruments, board games, telling folktales, and enjoying traditional party games like permitting one person to be the 'king of the feast' if they found a bean in the special bread or cake – everyone else then had to mimic the 'king' (a role-reversal game that echoed Saturnalia's similar 'Lord of Misrule').
During the Elizabethan Era (1558-1603 CE) 'holy days' continued to be the main source of public 'holidays' – a term now being used for the first time – but there were also more secular activities establishing themselves as popular traditions.
With no public roads, travelling by horse and car

# Conclusion

 According to our experiments, not any single one of models can be the best model for any article. Therefore, it is worthwhile to test a bunch of different summarizer algorithms to choose the best one for our application.