# Summarise classical books with state-of-the-art machine learning models
BART and T5 are state-of-the-art machine learning models developed by [Lewis et al. 2019 (Facebook Research)](https://arxiv.org/abs/1910.13461) and [Raffel et al. 2019 (Google Research)](https://arxiv.org/abs/1910.10683). They have been trained to summarize text and are made available for easy use by [@HuggingFace](https://twitter.com/huggingface)'s [Transformers library](https://huggingface.co/transformers/). This notebook shows how to summarise history's most influential books like the Communist Manifesto or Orwell's 1984 in a few lines of code in a few minutes with these two models. You can copy the notebook, run and change it yourself and compare the two models. Notebook by [@MoritzLaurer](https://twitter.com/MoritzLaurer)





In [None]:
## installation
# see https://twitter.com/huggingface/status/1242512382800400384
# details https://github.com/huggingface/transformers/releases/tag/v2.6.0
!pip install transformers --upgrade

In [None]:
from transformers import pipeline
import requests
import pprint
import time
pp = pprint.PrettyPrinter(indent=14)

In [None]:
## documentation for summarizer: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
# summarize with BART
summarizer_bart = pipeline(task='summarization', model="bart-large-cnn")
# summarize with T5
summarizer_t5 = pipeline(task='summarization', model="t5-large") # options: ‘t5-small’, ‘t5-base’, ‘t5-large’, ‘t5-3b’, ‘t5-11b’
#for T5 you can chose the size of the model. Everything above t5-base is very slow, even on GPU or TPU.

## 1. Karl Marx, Friedrich Engels - Manifesto of the Communist Party

In [None]:
## download book
book_raw = requests.get("http://www.gutenberg.org/cache/epub/61/pg61.txt")
communist_manifesto = book_raw.text
# cleaning
delimiter = "[From the English edition of 1888, edited by Friedrich Engels]"
communist_manifesto_cl = communist_manifesto.split(delimiter, 1)[1]
delimiter2 = "WORKING MEN OF ALL COUNTRIES, UNITE!"
communist_manifesto_cl =  communist_manifesto_cl.split(delimiter2, 1)[0] + delimiter2
#print(communist_manifesto_cl)

#### 1.1 - BART model, machine-generated summary  - Communist Manifesto

In [None]:
## summarize book with BART model
t0 = time.time() # timer
summary_manifesto_bart = summarizer_bart(communist_manifesto_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.")

Summarization took 1.27 minutes.


In [None]:
pp.pprint(summary_manifesto_bart[0]['summary_text'])

('Communism is already acknowledged by all European Powers as a Power. It is '
 'high time that Communists should openly publish their views, their aims, '
 'their.endencies, and meet this nursery tale of the Spectre of.Communism with '
 'a Manifesto of the party itself. Society as a.whole is more and more '
 'splitting up into two great hostile camps, directly facing. each other: '
 'Bourgeoisie and Proletariat. The modern bourgeois society that has sprouted '
 'from the ruins. of feudal society has not done away with class antagonisms. '
 'It has but established new classes, new conditions of oppression, new forms '
 'of struggle in place of the old ones. The history of all hitherto existing '
 'societies is the history. of class struggles. The discovery of America, the '
 'rounding of the Cape, opened up fresh ground for the rising bourgeoisie.')


#### 1.2 - T5 model, machine-generated summary - Communist Manifesto

In [None]:
## summarize book with T5 model
t0 = time.time() # timer
summary_manifesto_t5 = summarizer_t5(communist_manifesto_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 21.18 minutes.


In [None]:
pp.pprint(summary_manifesto_t5[0]['summary_text'])

('a spectre is haunting old europe--the threat of communism . john avlon: the '
 'party in opposition has been decried as communist by its opponents . but he '
 "says it's high time for communists to openly publish their views, aims, "
 'tendencies . the Manifesto will be published in the english, french, german, '
 'italian, flemish and danish languages . "communism is already acknowledged '
 'by all european powers to a. a " a- na aa en -a n a, n, if en, ena')


## 2. George Orwell - 1984

In [None]:
## download book
book_raw = requests.get("http://gutenberg.net.au/ebooks01/0100021.txt")
orwell_1984 = book_raw.text
# cleaning
delimiter = 'PART ONE'
orwell_1984_cl = delimiter + orwell_1984.split(delimiter, 1)[1]
delimiter2 = "THE END"
orwell_1984_cl = orwell_1984_cl.split(delimiter2, 1)[0] + delimiter2
#print(orwell_1984_cl)

#### 2.1 - BART model, machine-generated summary  - Orwell 1984

In [None]:
## summarize book with BART model
t0 = time.time() # timer
summary_orwell_bart = summarizer_bart(orwell_1984_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 2.36 minutes.


In [None]:
pp.pprint(summary_orwell_bart[0]['summary_text'])

('Winston Smith lived in Victory Mansions, a flat seven flights up from the '
 'city centre. He heard a voice on a telescreen reading out a list of figures '
 'about pig-iron. The voice came from an oblong metal plaque like a dulled '
 'mirror which formed part of the surface of the right-hand wall. Winston '
 'turned a switch and the voice sank. It was the police patrol, snooping into '
 "people'swindows, but it didn't matter. You had to live from the instinct "
 'that became instinctive, except in darkness, when every movement was '
 'scrutinized, every sound made, every movement scrutinized. The thought '
 'police watched everybody all the time, and at any rate they could plug in '
 'your wire in any time they wanted.')


#### 2.2 - T5 model, machine-generated summary - Orwell 1984



In [None]:
## summarize book with T5 model
t0 = time.time() # timer
summary_orwell_t5 = summarizer_t5(orwell_1984_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 23.14 minutes.


In [None]:
pp.pprint(summary_orwell_t5[0]['summary_text'])

("'big brother is watching you' was the caption beneath a poster . a fruity "
 'voice was reading out a list of figures which had something to do with the '
 'production of pig-iron . the flat was seven flights up, and Winston, who was '
 'thirty-nine, went slowly, resting several times on the way . on each '
 'landing, opposite the lift-shaft, the poster with the enormous face gazed '
 'from the wall . Winston turned a switch and the voice sank somewhat, though '
 'the. .. en a aa ena .- a. s aen enenaaao asa')


## 3. Charles Darwin - The Origin of Species by Means of Natural Selection

In [None]:
## download book
book_raw = requests.get("http://www.gutenberg.org/cache/epub/2009/pg2009.txt")
darwin_origin_of_species = book_raw.text
# cleaning
delimiter = 'INTRODUCTION.'
darwin_origin_of_species_cl = "ORIGIN OF SPECIES." + delimiter + darwin_origin_of_species.split(delimiter, 1)[1]
delimiter2 = "GLOSSARY OF THE PRINCIPAL SCIENTIFIC TERMS USED IN THE PRESENT VOLUME."
darwin_origin_of_species_cl =  darwin_origin_of_species_cl.split(delimiter2, 1)[0] 
print(darwin_origin_of_species_cl)

#### 3.1 - BART model, machine-generated summary - Darwin, Origin of Species

In [None]:
## summarize book with BART model
t0 = time.time() # timer
summary_darwin_bart = summarizer_bart(darwin_origin_of_species_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 6.52 minutes.


In [None]:
pp.pprint(summary_darwin_bart[0]['summary_text'])

('The origin of species is the mystery of mysteries, as it has been called by '
 'one of our greatest philosophers. A naturalist, Sir C. Lyell and Dr. Hooker '
 'have helped him in every possible way by his large stores of knowledge and '
 'his excellentjudgment. Mr. Wallace, who is now studying the natural history '
 'of the Malay Archipelago, has arrived at almost exactly the same '
 'generalconclusions that I have on the origin of Species. I can here give '
 'only the general conclusions at which I have arrived, with a few facts in '
 'most cases, but which, I hope, will suffice. I am well aware that scarcely a '
 'single point is discussed in this volume on which facts cannot be adduced. A '
 'fair result can be obtained only by fully stating and balancing the facts '
 'and arguments on both sides of each question.')


#### 3.2 - T5 model, machine-generated summary  - Darwin, Origin of Species

In [None]:
## summarize book with T5 model
t0 = time.time() # timer
summary_darwin_t5 = summarizer_t5(darwin_origin_of_species_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 25.43 minutes.


In [None]:
pp.pprint(summary_darwin_t5[0]['summary_text'])

('the origin of species is a mystery, as it has been called by one of our '
 "greatest philosophers . the author's work is now (1859) nearly finished; but "
 'as it will take years to complete it, he publishes abstract . author: "i '
 'hope that I may enter on these personal details, as I give them to show that '
 'i have not been hasty in coming to a decision" he says he has been urged to '
 'publish this abstract, which must necessarily be imperfect . a na a- '
 'n--n-a-na-aa--a  aa --- aena-on')


## 4. Mary Wollstonecraft - A Vindication of the Rights of Woman

In [None]:
## download book
book_raw = requests.get("http://www.gutenberg.org/cache/epub/3420/pg3420.txt")
rights_woman = book_raw.text
# cleaning
delimiter = 'A VINDICATION OF THE RIGHTS OF WOMAN,'
rights_woman_cl = delimiter + rights_woman.split(delimiter, 1)[1]
#print(rights_woman_cl)

#### 4.1 - BART model, machine-generated summary - Rights of Woman

In [None]:
## summarize book
t0 = time.time() # timer
summary_rights_woman_bart = summarizer_bart(rights_woman_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 2.36 minutes.


In [None]:
pp.pprint(summary_rights_woman_bart[0]['summary_text'])

('M. Wollstonecraft was born in 1759. She became a teacher from motives of '
 'benevolence, rather than philanthropy. Her father was so great that the '
 'place of her birth is uncertain. She left her parents at the age of '
 'nineteen, and resided with a Mrs. Dawson for two years. Her friend and '
 'colleague, Dr. Price, died of a pulmonary disease. She gave proof of the '
 'superior qualification of superior qualification for the superior role of a '
 'woman. She wrote a book called The Rights of the Woman, published in 2001. '
 'The book is published by Simon & Schuster, London, priced £16.99. For more '
 'information on the book, or to order a copy, visit: '
 'http://www.simonandschuster.com/ The-Rights-of-the-Woman.html.')


#### 4.2 - T5 model, machine-generated summary - Rights of Woman

In [None]:
## summarize book
t0 = time.time() # timer
summary_rights_woman_t5 = summarizer_t5(rights_woman_cl, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 21.9 minutes.


In [None]:
pp.pprint(summary_rights_woman_t5[0]['summary_text'])

('a VINDICATION OF THE RIGHTS OF WOMAN, WITH STRICTURES ON POLITICAL AND MORAL '
 'SUBJECTS, BY MARY WOLLSTONECRAFT . 8 April, 2001 A BRIEF SKETCH OF THE LIFE '
 'OF m. w. wollstonecraft . ms wollstonecraft was born in suffolk, england, in '
 '1913 . she was educated at the a ., n .na - enaa na, . in a,aa- '
 '.\xaden\xad\xad')
