**Text summarization**

Text summarizer can give a short summary of a large text. SpaCy together with pyTextRank have an summrizer model which is presented in this notebook.

the following example is based on that code-snippet for pytextrank: [spacy.io.universe](https://spacy.io/universe/project/spacy-pytextrank#gatsby-noscript)

In [8]:
# resources
#
import spacy
import pytextrank
import wikipedia
#
# Load English tokenizer, tagger, parser, NER and word vectors
sp = spacy.load('en_core_web_sm')
# core-model with German language:
#sp = spacy.load('de_core_news_sm')
#

In [9]:
# prepare pipeline
#
tr = pytextrank.TextRank()
sp.add_pipe(tr.PipelineComponent, name='textrank', last=True)

In [10]:
# data
#
# fetch the wikipedia site about coronavirus 2 with the wikipedia library
#
# https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2
#
# doc = sp(wikipedia.page('Severe_acute_respiratory_syndrome_coronavirus_2').content)
#
doc = sp(wikipedia.page('Fussball-Bundesliga').content)

In [11]:
# summarize the text with textrank
#
for sent in doc._.textrank.summary():
    print(sent)

List of foreign Bundesliga players
List of football clubs in Germany by major honours won
List of attendance figures at domestic professional sports leagues – the Bundesliga in a worldwide context

Alemannia Aachen lost to Werder Bremen in the 2004 DFB-Pokal Final, Alemannia secured an entry in the 2004–05 UEFA Cup, because Werder qualified for the Champions League as First Bundesliga champions.
All of the Bundesliga clubs qualify for the DFB-Pokal.
Bayern Munich has won the title 29 times, the most among Bundesliga clubs.


In [12]:
# compare to the summary which can be given from the wikipedia library:
#
# print(wikipedia.page('Severe_acute_respiratory_syndrome_coronavirus_2').summary)
print(wikipedia.page('Fussball-Bundesliga').summary)

The Bundesliga (German: [ˈbʊndəsˌliːɡa] (listen); lit.  'Federal League'), sometimes referred to as the Fußball-Bundesliga ([ˌfuːsbal-]) or 1. Bundesliga ([ˌeːɐ̯stə-]), is a professional association football league in Germany. At the top of the German football league system, the Bundesliga is Germany's primary football competition. The Bundesliga comprises 18 teams and operates on a system of promotion and relegation with the 2. Bundesliga. Seasons run from August to May. Most games are played on Saturdays and Sundays, with a few games played on weekdays. All of the Bundesliga clubs qualify for the DFB-Pokal. The winner of the Bundesliga qualifies for the DFL-Supercup.
Fifty-six clubs have competed in the Bundesliga since its founding. Bayern Munich has won the title 29 times, the most among Bundesliga clubs. However, the Bundesliga has seen other champions, with Borussia Dortmund, Hamburger SV, Werder Bremen, Borussia Mönchengladbach, and VfB Stuttgart most prominent among them. The B

In [13]:
# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print('{:.4f} {:5d}  {}'.format(p.rank, p.count, p.text))
    print(p.chunks)

0.0887     5  bundesliga clubs
[Bundesliga clubs, Bundesliga clubs, Bundesliga clubs, the Bundesliga clubs, any Bundesliga club]
0.0858     3  bundesliga teams
[Bundesliga teams, Bundesliga teams, Bundesliga teams]
0.0816     1  bundesliga club borussia dortmund
[Bundesliga club Borussia Dortmund]
0.0759     1  bundesliga matches
[Bundesliga matches]
0.0752     3  bundesliga titles
[Bundesliga titles, Bundesliga titles, not only Bundesliga titles]
0.0726     1  bundesliga sides
[Bundesliga sides]
0.0726     1  german football clubs
[German football clubs]
0.0725     8  first bundesliga
[First Bundesliga, First Bundesliga, the First Bundesliga, the First Bundesliga, the First Bundesliga, the First Bundesliga, the First Bundesliga, the First Bundesliga]
0.0716     1  foreign bundesliga players
[foreign Bundesliga players]
0.0714     1  first bundesliga champions
[First Bundesliga champions]
0.0703     2  2nd bundesliga
[2nd Bundesliga, 2nd Bundesliga]
0.0699     1  german clubs
[German c

In [None]:
Copyright © 2020 IUBH Internationale Hochschule