## Text Summarization

This module automatically summarizes the given text, by extracting one or more important sentences from the text. In a similar way, it can also extract keywords. This tutorial will teach you to use this summarization module via some examples. First, we will try a small example, then we will try two larger ones, and then we will review the performance of the summarizer in terms of speed.

This summarizer is based on the , from an “TextRank” algorithm by Mihalcea et al. This algorithm was later improved upon by Barrios et al., by introducing something called a “BM25 ranking function”.

In [1]:
from pprint import pprint as print
from gensim.summarization import summarize

In [2]:
text = (
    "Thomas A. Anderson is a man living two lives. By day he is an "
    "average computer programmer and by night a hacker known as "
    "Neo. Neo has always questioned his reality, but the truth is "
    "far beyond his imagination. Neo finds himself targeted by the "
    "police when he is contacted by Morpheus, a legendary computer "
    "hacker branded a terrorist by the government. Morpheus awakens "
    "Neo to the real world, a ravaged wasteland where most of "
    "humanity have been captured by a race of machines that live "
    "off of the humans' body heat and electrochemical energy and "
    "who imprison their minds within an artificial reality known as "
    "the Matrix. As a rebel against the machines, Neo must return to "
    "the Matrix and confront the agents: super-powerful computer "
    "programs devoted to snuffing out Neo and the entire human "
    "rebellion. "
)
print(text)

('Thomas A. Anderson is a man living two lives. By day he is an average '
 'computer programmer and by night a hacker known as Neo. Neo has always '
 'questioned his reality, but the truth is far beyond his imagination. Neo '
 'finds himself targeted by the police when he is contacted by Morpheus, a '
 'legendary computer hacker branded a terrorist by the government. Morpheus '
 'awakens Neo to the real world, a ravaged wasteland where most of humanity '
 "have been captured by a race of machines that live off of the humans' body "
 'heat and electrochemical energy and who imprison their minds within an '
 'artificial reality known as the Matrix. As a rebel against the machines, Neo '
 'must return to the Matrix and confront the agents: super-powerful computer '
 'programs devoted to snuffing out Neo and the entire human rebellion. ')


In [3]:
print(summarize(text))

('Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.')


In [4]:
print(summarize(text, split=True))

['Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.']


You can adjust how much text the summarizer outputs via the “ratio” parameter or the “word_count” parameter. Using the “ratio” parameter, you specify what fraction of sentences in the original text should be returned as output. Below we specify that we want 50% of the original text (the default is 20%).

In [6]:
print(summarize(text, ratio=0.5))

('By day he is an average computer programmer and by night a hacker known as '
 'Neo. Neo has always questioned his reality, but the truth is far beyond his '
 'imagination.\n'
 'Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.\n'
 'As a rebel against the machines, Neo must return to the Matrix and confront '
 'the agents: super-powerful computer programs devoted to snuffing out Neo and '
 'the entire human rebellion.')


Using the “word_count” parameter, we specify the maximum amount of words we want in the summary. Below we have specified that we want no more than 50 words.

In [7]:
print(summarize(text, word_count=50))

('Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.')


As mentioned earlier, this module also supports keyword extraction. Keyword extraction works in the same way as summary generation (i.e. sentence extraction), in that the algorithm tries to find words that are important or seem representative of the entire text. They keywords are not always single words; in the case of multi-word keywords, they are typically all nouns.

In [12]:
from gensim.summarization import keywords

In [13]:
print(keywords(text))

'humanity\nhuman\nneo\nhumans body\nsuper'


### Larger example

In [14]:
import requests

text = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text
print(text)

('The screen is filled with green, cascading code which gives way to the '
 'title, The Matrix.\r\n'
 '\r\n'
 'A phone rings and text appears on the screen: "Call trans opt: received. '
 '2-19-98 13:24:18 REC: Log>" As a conversation takes place between Trinity '
 '(Carrie-Anne Moss) and Cypher (Joe Pantoliano), two free humans, a table of '
 'random green numbers are being scanned and individual numbers selected, '
 'creating a series of digits not unlike an ordinary phone number, as if a '
 'code is being deciphered or a call is being traced.\r\n'
 '\r\n'
 'Trinity discusses some unknown person. Cypher taunts Trinity, suggesting she '
 'enjoys watching him. Trinity counters that "Morpheus (Laurence Fishburne) '
 'says he may be \'the One\'," just as the sound of a number being selected '
 'alerts Trinity that someone may be tracing their call. She ends the call.\r\n'
 '\r\n'
 "Armed policemen move down a darkened, decrepit hallway in the Heart O' the "
 'City Hotel, their flashlight 

In [15]:
print(summarize(text, ratio=0.01))

('Anderson, a software engineer for a Metacortex, the other life as Neo, a '
 'computer hacker "guilty of virtually every computer crime we have a law '
 'for." Agent Smith asks him to help them capture Morpheus, a dangerous '
 'terrorist, in exchange for amnesty.\n'
 "Morpheus explains that he's been searching for Neo his entire life and asks "
 'if Neo feels like "Alice in Wonderland, falling down the rabbit hole." He '
 'explains to Neo that they exist in the Matrix, a false reality that has been '
 'constructed for humans to hide the truth.\n'
 "Neo is introduced to Morpheus's crew including Trinity; Apoc (Julian "
 'Arahanga), a man with long, flowing black hair; Switch; Cypher (bald with a '
 'goatee); two brawny brothers, Tank (Marcus Chong) and Dozer (Anthony Ray '
 'Parker); and a young, thin man named Mouse (Matt Doran).\n'
 'Trinity brings the helicopter down to the floor that Morpheus is on and Neo '
 'opens fire on the three Agents.')


In [16]:
print(keywords(text, ratio=0.01))

'neo\nmorpheus\ntrinity\ncypher'


### Another example

In [17]:
text = requests.get('http://rare-technologies.com/the_big_lebowski_synopsis.txt').text
print(text)
print(summarize(text, ratio=0.01))
print(keywords(text, ratio=0.01))

('A tumbleweed rolls up a hillside just outside of Los Angeles as a mysterious '
 'man known as The Stranger (Sam Elliott) narrates about a fella he wants to '
 'tell us about named Jeffrey Lebowski. With not much use for his given name, '
 'however, Jeffrey goes by the name The Dude (Jeff Bridges). The Stranger '
 'describes Dude as one of the laziest men in LA, which would place him "high '
 'in the running for laziest worldwide", but nevertheless "the man for his '
 'place and time."\r\n'
 '\r\n'
 'The Dude, wearing a bathrobe and flips flops, buys a carton of cream at '
 "Ralph's with a post-dated check for 69 cents. On the TV, President George "
 'Bush Sr. is addressing the nation, saying "aggression will not stand" '
 'against Kuwait. Dude returns to his apartment where, upon entering and '
 'closing the door, he is promptly grabbed by two men who force him into the '
 'bathroom and shove his head in the toilet. They demand money owed to Jackie '
 "Treehorn, saying that The Dude'

('Dude agrees to meet with the Big Lebowski, hoping to get compensation for '
 'his rug since it "really tied the room together" and figures that his wife, '
 "Bunny, shouldn't be owing money around town.\n"
 'Walter resolves to go to Plan B; he tells Larry to watch out the window as '
 'he and Dude go back out to the car where Donny is waiting.')
'dude\ndudes\nlebowski\nbowling\nbowls\nbrandt'


### Text-content dependent running times

The running time is not only dependent on the size of the dataset. For example, summarizing “The Matrix” synopsis (about 36,000 characters) takes about 3.1 seconds, while summarizing 35,000 characters of this book takes about 8.5 seconds. So the former is more than twice as fast.

One reason for this difference in running times is the data structure that is used. The algorithm represents the data using a graph, where vertices (nodes) are sentences, and then constructs weighted edges between the vertices that represent how the sentences relate to each other. This means that every piece of text will have a different graph, thus making the running times different. The size of this data structure is quadratic in the worst case (the worst case is when each vertex has an edge to every other vertex).

Another possible reason for the difference in running times is that the problems converge at different rates, meaning that the error drops slower for some datasets than for others.

Montemurro and Zanette’s entropy based keyword extraction algorithm<br>
This paper describes a technique to identify words that play a significant role in the large-scale structure of a text. These typically correspond to the major themes of the text. The text is divided into blocks of ~1000 words, and the entropy of each word’s distribution amongst the blocks is caclulated and compared with the expected entropy if the word were distributed randomly.

In [19]:
import requests
from gensim.summarization import mz_keywords

text=requests.get("http://www.gutenberg.org/files/49679/49679-0.txt").text
print(mz_keywords(text,scores=True,threshold=0.001))

  log_p = np.log2(p)
  h = np.nan_to_num(p * log_p).sum(axis=0)


[('i', 0.005071990145676084),
 ('the', 0.004078714811925573),
 ('lincoln', 0.003834207719481631),
 ('you', 0.00333099434510635),
 ('gutenberg', 0.003286171946544613),
 ('v', 0.0031486824001772298),
 ('a', 0.0030225302081737385),
 ('project', 0.0030137873650921583),
 ('s', 0.002804807648086567),
 ('iv', 0.0027211423370182043),
 ('he', 0.0026652557966447303),
 ('ii', 0.002522584294510855),
 ('his', 0.0021025932276434807),
 ('by', 0.002092414407555808),
 ('abraham', 0.0019871796860869762),
 ('or', 0.0019180648459331258),
 ('lincolna', 0.0019090487448340699),
 ('tm', 0.001887549850538215),
 ('iii', 0.001883132631521375),
 ('was', 0.0018691721439371342),
 ('work', 0.0017383218152950376),
 ('new', 0.0016870325205805429),
 ('co', 0.0016544975217374278),
 ('case', 0.0015991334540419223),
 ('court', 0.0014413967155396973),
 ('york', 0.001429133695025362),
 ('on', 0.0013292841806795005),
 ('it', 0.001308454011675044),
 ('had', 0.001298103630126742),
 ('to', 0.0012629182579600709),
 ('my', 0.0012

By default, the algorithm weights the entropy by the overall frequency of the word in the document. We can remove this weighting by setting weighted=False

In [20]:
print(mz_keywords(text,scores=True,weighted=False,threshold=1.0))

[('gutenberg', 3.8130548486405993),
 ('project', 3.5738550368621964),
 ('tm', 3.5734630161654266),
 ('co', 3.188187179789421),
 ('foundation', 2.9349504275296248),
 ('dogskin', 2.767166394411781),
 ('electronic', 2.712759445340285),
 ('donations', 2.5598097474452906),
 ('foxboro', 2.552819829558231),
 ('access', 2.534996621584064),
 ('gloves', 2.534996621584064),
 ('_works_', 2.519083905903437),
 ('iv', 2.4068950059833725),
 ('v', 2.376066199199476),
 ('license', 2.32674033665853),
 ('works', 2.320294093790008),
 ('replacement', 2.297629530050557),
 ('e', 2.1840002559354215),
 ('coon', 2.1754936158294536),
 ('volunteers', 2.1754936158294536),
 ('york', 2.172102058646223),
 ('ii', 2.143421998464259),
 ('edited', 2.110161739139703),
 ('refund', 2.100145067024387),
 ('iii', 2.052633589900031),
 ('bounded', 1.9832369322912882),
 ('format', 1.9832369322912882),
 ('jewelry', 1.9832369322912882),
 ('metzker', 1.9832369322912882),
 ('millions', 1.9832369322912882),
 ('ragsdale', 1.983236932291

When this option is used, it is possible to calculate a threshold automatically from the number of blocks

In [21]:
print(mz_keywords(text,scores=True,weighted=False,threshold="auto"))

[('gutenberg', 3.8130548486405993),
 ('project', 3.5738550368621964),
 ('tm', 3.5734630161654266),
 ('co', 3.188187179789421),
 ('foundation', 2.9349504275296248),
 ('dogskin', 2.767166394411781),
 ('electronic', 2.712759445340285),
 ('donations', 2.5598097474452906),
 ('foxboro', 2.552819829558231),
 ('access', 2.534996621584064),
 ('gloves', 2.534996621584064),
 ('_works_', 2.519083905903437),
 ('iv', 2.4068950059833725),
 ('v', 2.376066199199476),
 ('license', 2.32674033665853),
 ('works', 2.320294093790008),
 ('replacement', 2.297629530050557),
 ('e', 2.1840002559354215),
 ('coon', 2.1754936158294536),
 ('volunteers', 2.1754936158294536),
 ('york', 2.172102058646223),
 ('ii', 2.143421998464259),
 ('edited', 2.110161739139703),
 ('refund', 2.100145067024387),
 ('iii', 2.052633589900031),
 ('bounded', 1.9832369322912882),
 ('format', 1.9832369322912882),
 ('jewelry', 1.9832369322912882),
 ('metzker', 1.9832369322912882),
 ('millions', 1.9832369322912882),
 ('ragsdale', 1.983236932291