# **Text Summarization**

Text summarization in NLP describes methods to automatically generate text summaries containing the most relevant information from source texts. With text summarization, we use extractive and abstractive techniques. In extractive techniques, algorithms extract the most important word sequences of the document to produce a summary of the given text. Abstractive techniques generate summaries by generating a new text and paraphrase the content of the original document, pretty much like humans do when they write an abstract [[1]](#scrollTo=8Pzkt1Z_M6OH).

This notebook shows an example of unsupervised extractive text summarization with TextRank.

## Unsupervised extractive text summarization with TextRank

TextRank is a common unsupervised extractive summarization technique. It compares every sentence in the text with every other sentence by calculating a similarity score, for example, the cosine similarity for each sentence pair. The closer the score is to 1, the more similar the sentence is to the other sentence representing the other sentences in a good way. These scores are summed up for each sentence to get a rank. The higher the rank, the more important the sentence is in the text. Finally, the sentences can be sorted by rank and a summary can be built from a defined number of highest ranked sentences [[1]](#scrollTo=8Pzkt1Z_M6OH).

Unsupervised text summarization can be performed with the ``spaCy`` library and the TextRank algorithm by using the ``pytextrank`` library. For more details about the ``spaCy`` and ``pytextrank`` libraries, please refer to [[2]](https://spacy.io/) and [[3]](https://derwen.ai/docs/ptr/).

For text summarization with ``spaCy`` and ``pytextrank``, we will apply the following steps:
* Install and import libraries
* Download and install the language model
* Create a ``spaCy`` pipeline and load the language model to it
* Add ``pytextrank`` to the ``spaCy`` pipeline
* Create a ``spaCy`` document with a sample text
* Use the ``textrank.summary()`` method to create a text summary

### Install and import libraries

#### Install ``pytextrank`` library

``pytextrank`` is an implementation of TextRank for the use in ``spaCy`` pipelines. It provides fast, effective phrase extraction from texts, along with an extractive summarization [[4]](https://spacy.io/universe/project/spacy-pytextrank).



In [None]:
# Install the pytextrank library 
!pip install pytextrank==3.0.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytextrank==3.0.1
  Downloading pytextrank-3.0.1-py3-none-any.whl (19 kB)
Collecting icecream>=2.1
  Downloading icecream-2.1.2-py2.py3-none-any.whl (8.3 kB)
Collecting graphviz>=0.13
  Downloading graphviz-0.20-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 2.4 MB/s 
Collecting asttokens>=2.0.1
  Downloading asttokens-2.0.5-py2.py3-none-any.whl (20 kB)
Collecting colorama>=0.3.9
  Downloading colorama-0.4.5-py2.py3-none-any.whl (16 kB)
Collecting executing>=0.3.1
  Downloading executing-0.8.3-py2.py3-none-any.whl (16 kB)
Installing collected packages: executing, colorama, asttokens, icecream, graphviz, pytextrank
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed asttokens-2.0.5 colorama-0.4.5 executing-0.8.3 graphviz

#### Import ``spaCy`` and ``pytextrank`` libraries

We import the ``spaCy`` and ``pytextrank`` libraries.

``spaCy`` is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning [[5]](https://spacy.io/usage/spacy-101). It supports the implementation of tasks for sentiment analysis, chatbots, text summarization, intent and entity extraction, and others [[1]](#scrollTo=8Pzkt1Z_M6OH). More information about ``spaCy`` please refer to  [[2]](https://spacy.io/).

We have installed the ``pytextrank`` library in the previous section. Now we will import it. For more details about ``pytextrank``, please refer to [[3]](https://derwen.ai/docs/ptr/).

In [None]:
# Import spaCy and pytextrank libraries
import spacy
import pytextrank

### Download and install language model
We load the ``en_core_web_sm`` English language model by using the ``spaCy`` library.
For more details about ``en_core_web_sm``, please refer to [[6]](https://spacy.io/models).

In [None]:
# Download "en_core_web_sm" English language model
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 4.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Create ``spaCy`` pipeline and load language model
We use the ``spacy.load()`` function to load our language model ``en_core_web_sm`` to the ``spaCy`` pipeline ``sp``.


In [None]:
# Create a spaCy pipeline "sp" and load the language model
sp = spacy.load('en_core_web_sm')

### Add ``pytextrank`` to ``spaCy`` pipeline

We use the ``add_pipe()`` method to add ``pytextrank`` to the ``spaCy`` pipeline ``sp``.

In [None]:
# Add pytextrank to the spaCy pipeline
sp.add_pipe('textrank', last=True)

<pytextrank.base.BaseTextRank at 0x7fc5aeb8dd50>

Now our ``spaCy`` pipeline is ready for text summarization. In the following step, we will create a ``spaCy`` document for text summarization.

### Create ``spaCy`` document with sample text

In this step, we add a sample text to the ``spaCy`` pipeline and create a ``Doc`` object as ``doc``.

When we create a ``Doc`` object by using the ``spaCy`` library, it automatically performs tokenization, named entity recognition (NER) and part-of-speech (POS) tagging processes for an input text. The following figure demonstrates the processing pipeline of a given text to create a ``Doc`` object [[7]](https://spacy.io/usage/processing-pipelines).

![spaCy](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [None]:
# Define a sample text
text="""Alan Mathison Turing, a British mathematician and computer scientist,\
 was one of the early pioneers of artificial intelligence. Turing (1950) describes \
 the foundation of what was later called the Turing test. The experimental setup of \
 the Turing test is as follows. A human interrogator uses a chat program to talk to \
 two conversation partners: a chatbot and another human being. Both of them try to \
 convince the interrogator that they are the human. If the interrogator is not able to \
 identify the human through intense questioning, the machine is considered to have passed \
 the Turing test. According to Turing, passing the test can lead to the conclusion that \
 the machine’s intellectual power is on a level comparable to the human brain. While the \
 Turing test has often been criticized because of its focus on functionality, the question \
 of whether the machine is conscious about its answers remains open. Several attempts have \
 been made to pass the Turing test, but it still remains an unresolved challenge."""

# Create a spaCy Doc object "doc" with the sample text
doc = sp(text)

### Create text summary

We use the ``textrank.summary()`` method of ``pytextrank`` to run an extractive summarization. For that, we set the following parameters:

* ``limit_phrases``: It defines the maximum number of top-ranked phrases.  ``pytextrank`` calculates a cosine similarity score for each phrase in the given text and sorts in descending order. In this example, we set ``limit_phrases=3``. That means we take the first 3 top-ranked phrases of each sentence. The sum of these 3 phrases is used to calculate the importance of each sentence. Since ``limit_phrases`` is a hyperparameter, you can set different limits.

* ``limit_sentences``: It defines the total number of sentences to return. In this example, we set ``limit_sentences=3``. That means our summary will contain 3 sentences.

* ``preserve_order``: It preserves the order of sentences as they originally occurred in the given text. In this example, we set ``preserve_order=True``.

The ``textrank.summary()`` method automatically performs the following steps:
* Calculate a similarity score for each phrase in the sample text
* Rank each sentence by the total similarity score calculated from the top P phrases (P=``limit_phrases``)
* Return the S sentences with the highest total similarity score (S=``limit_sentences``) as the text summary

For more details about the ``textrank.summary()`` method, please refer to [[8]](https://derwen.ai/docs/ptr/ref/).



In [None]:
# Perform text summarization
summary = list(doc._.textrank.summary(limit_phrases=3, limit_sentences=3, preserve_order=True))
for sent in summary:
  print(sent,"\n")

Alan Mathison Turing, a British mathematician and computer scientist, was one of the early pioneers of artificial intelligence. 

Turing (1950) describes  the foundation of what was later called the Turing test. 

According to Turing, passing the test can lead to the conclusion that  the machine’s intellectual power is on a level comparable to the human brain. 



# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://spacy.io
- [3] https://derwen.ai/docs/ptr
- [4] https://spacy.io/universe/project/spacy-pytextrank
- [5] https://spacy.io/usage/spacy-101
- [6] https://spacy.io/models
- [7] https://spacy.io/usage/processing-pipelines
- [8] https://derwen.ai/docs/ptr/ref

Copyright © 2022 IU International University of Applied Sciences