# **Text summarization**

Text summarizer can give a short summary of a large text. <br>
[SpaCy](www.spacy.io) together with [pyTextRank](https://github.com/DerwenAI/pytextrank) have a text summriziation model which is presented in this notebook.

the following example is based on that code-snippet for pytextrank: [spacy.io.universe](https://spacy.io/universe/project/spacy-pytextrank#gatsby-noscript)

#### install additional libraries

In [None]:
!pip install pytextrank==3.0.1

In [None]:
!pip install wikipedia==1.4.0

In [None]:
# download language model for spacy
!python -m spacy download en_core_web_sm

#### load resources

In [8]:
# resources
#
import spacy
import pytextrank
import wikipedia
#
# Load English tokenizer, tagger, parser, NER and word vectors
sp = spacy.load('en_core_web_sm')
#
# core-model with German language:
#sp = spacy.load('de_core_news_sm')
#

In [None]:
# prepare pipeline
#
sp.add_pipe('textrank', last=True)

#### fetch text from wikipedia

In [13]:
# data
#
# fetch the wikipedia site about coronavirus 2 with the wikipedia library
#
# https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2
#
# doc = sp(wikipedia.page('Severe_acute_respiratory_syndrome_coronavirus_2').content)
#
doc = sp(wikipedia.page('Fussball-Bundesliga').content)

#### summarize text

In [14]:
# summarize the text with textrank
#
for sent in doc._.textrank.summary():
    print(sent)

Bayern Munich has won the title 29 times, the most among Bundesliga clubs.
The Bundesliga has the lowest ticket prices and the highest average attendance among Europe's five major leagues.
= Attendances ===

Based on its per-game average, the Bundesliga is the best-attended association football league in the world; out of all sports, its average of 45,116 fans per game during the 2011–12 season was the second highest of any professional sports league worldwide, behind only the National Football League of the United States.
Having won the Champions League in 1997 and a number of Bundesliga titles, Dortmund had gambled on maintaining their success with an expensive group of largely foreign players but failed, narrowly escaping liquidation in 2006.


In [15]:
# compare to the summary which can be given from the wikipedia library:
#
# print(wikipedia.page('Severe_acute_respiratory_syndrome_coronavirus_2').summary)
print(wikipedia.page('Fussball-Bundesliga').summary)

The Bundesliga (German: [ˈbʊndəsˌliːɡa] (listen); lit.  'Federal League'), sometimes referred to as the Fußball-Bundesliga ([ˌfuːsbal-]) or 1. Bundesliga ([ˌeːɐ̯stə-]), is a professional association football league in Germany. At the top of the German football league system, the Bundesliga is Germany's primary football competition. The Bundesliga comprises 18 teams and operates on a system of promotion and relegation with the 2. Bundesliga. Seasons run from August to May. Most games are played on Saturdays and Sundays, with a few games played on weekdays. All of the Bundesliga clubs qualify for the DFB-Pokal. The winner of the Bundesliga qualifies for the DFL-Supercup.
Fifty-six clubs have competed in the Bundesliga since its founding. Bayern Munich has won the title 29 times, the most among Bundesliga clubs. However, the Bundesliga has seen other champions, with Borussia Dortmund, Hamburger SV, Werder Bremen, Borussia Mönchengladbach, and VfB Stuttgart most prominent among them. The B

In [17]:
# examine the top-ranked phrases in the document
#
# display the first 10:

for i in range(0,9):
  p = doc._.phrases[i]
  print('{:.4f} {:5d}  {}'.format(p.rank, p.count, p.text))
  print(p.chunks)

0.0935     3  Bundesliga clubs
[Bundesliga clubs, Bundesliga clubs, Bundesliga clubs]
0.0834     1  Bundesliga club Borussia Dortmund
[Bundesliga club Borussia Dortmund]
0.0828     2  Bundesliga titles
[Bundesliga titles, Bundesliga titles]
0.0794     1  Bundesliga matches
[Bundesliga matches]
0.0784    95  Bundesliga
[Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesliga, Bundesl

Copyright © 2021 IUBH Internationale Hochschule