# **Text Summarization**

Text Summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.

To understand it in a better way, let us look at the different types of text summarization-

Text summarization methods can be grouped into two main categories:
**Extractive and Abstractive methods.**


# **Extractive Text Summarization**

The traditional method with the main objective to identify the significant sentences of the text and add them to the summary. Note that the summary obtained contains exact sentences from the original text data.

# **Abstractive Text Summarization**

The advanced method, with the approach to identify the important sections, interpret the context and reproduce the text in a new way. This ensures that the core information is conveyed through the shortest text possible.
Note that here, the sentences, in summary, are generated by the model, not just extracted from the original text data.

Text summarization with Natural Language Generation
1. BERT
2. GPT-2
3. XLNET

# **1. Text Summarization with BERT**
BERT (Bidirectional transformer) is a transformer used to overcome the limitations of RNN and other neural networks as Long term dependencies. It is a pre-trained model that is naturally bidirectional. This pre-trained model can be tuned to easily perform the NLP tasks as specified, Summarization in our case.

STEP 1: Install Transformers(2.2.0) and Bert for summarization

In [None]:
!pip install transformers==2.2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==2.2.0
  Downloading transformers-2.2.0-py3-none-any.whl (360 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m360.6/360.6 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting boto3 (from transformers==2.2.0)
  Downloading boto3-1.26.160-py3-none-any.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.9/135.9 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from transformers==2.2.0)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacremoses (from transformers==2.2.0)
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m52.3 MB

In [None]:
!pip install bert-extractive-summarizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install --upgrade transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 2.2.0
    Uninstalling transformers-2.2.0:
      Successfully uninstalled transformers-2.2.0
Successfully installed transformers-4.30.2


Step :2 Install spaCy (spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.)

In [None]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Step 3: Import the library

In [None]:
from summarizer import Summarizer,TransformerSummarizer

Step 4: Consider and assign the article ‘body’ which has to be summarized.

In [None]:
body = '''
       Scientists say they have discovered a new species of orangutans on Indonesia’s island of Sumatra.
The population differs in several ways from the two existing orangutan species found in Sumatra and the neighboring island of Borneo.
The orangutans were found inside North Sumatra’s Batang Toru forest, the science publication Current Biology reported.
Researchers named the new species the Tapanuli orangutan. They say the animals are considered a new species because of genetic, skeletal and tooth differences.
Michael Kruetzen is a geneticist with the University of Zurich who has studied the orangutans for several years. He said he was excited to be part of the unusual discovery of a new great ape in the present day. He noted that most great apes are currently considered endangered or severely endangered.
Gorillas, chimpanzees and bonobos also belong to the great ape species.
Orangutan – which means person of the forest in the Indonesian and Malay languages - is the world’s biggest tree-living mammal. The orange-haired animals can move easily among the trees because their arms are longer than their legs. They live more lonely lives than other great apes, spending a lot of time sleeping and eating fruit in the forest.
The new study said fewer than 800 of the newly-described orangutans exist. Their low numbers make the group the most endangered of all the great ape species.
They live within an area covering about 1,000 square kilometers. The population is considered highly vulnerable. That is because the environment which they depend on is greatly threatened by development.
Researchers say if steps are not taken quickly to reduce the current and future threats, the new species could become extinct “within our lifetime.”
Research into the new species began in 2013, when an orangutan protection group in Sumatra found an injured orangutan in an area far away from the other species. The adult male orangutan had been beaten by local villagers and died of his injuries. The complete skull was examined by researchers.
Among the physical differences of the new species are a notably smaller head and frizzier hair. The Tapanuli orangutans also have a different diet and are found only in higher forest areas.
There is no unified international system for recognizing new species. But to be considered, discovery claims at least require publication in a major scientific publication.
Russell Mittermeier is head of the primate specialist group at the International Union for the Conservation of Nature. He called the finding a “remarkable discovery.” He said it puts responsibility on the Indonesian government to help the species survive.
Matthew Nowak is one of the writers of the study. He told the Associated Press that there are three groups of the Tapanuli orangutans that are separated by non-protected land.He said forest land needs to connect the separated groups.
In addition, the writers of the study are recommending that plans for a hydropower center in the area be stopped by the government.
It also recommended that remaining forest in the Sumatran area where the orangutans live be protected.
I’m Bryan Lynn.

        '''

Step 5: Load the Bert Summarizer model and print the summary.

In [None]:
bert_model = Summarizer()
bert_summary = ''.join(bert_model(body, min_length=60))
print(bert_summary)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Scientists say they have discovered a new species of orangutans on Indonesia’s island of Sumatra. They say the animals are considered a new species because of genetic, skeletal and tooth differences. Orangutan – which means person of the forest in the Indonesian and Malay languages - is the world’s biggest tree-living mammal. Their low numbers make the group the most endangered of all the great ape species. He said it puts responsibility on the Indonesian government to help the species survive. It also recommended that remaining forest in the Sumatran area where the orangutans live be protected.




# **2. Text Summarization with GPT-2**

The Generative Pre-trained Transformer 2 (which has around 1 billion parameters) and can only imagine the power of the most recent GPT3 which has 175 billion parameters! It can write from software codes to mind-blowing stories!

In [None]:
#load the model, and get the generated summary
GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")
full = ''.join(GPT2_model(body, min_length=60))
print(full)

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Scientists say they have discovered a new species of orangutans on Indonesia’s island of Sumatra. The population differs in several ways from the two existing orangutan species found in Sumatra and the neighboring island of Borneo. They say the animals are considered a new species because of genetic, skeletal and tooth differences. They live within an area covering about 1,000 square kilometers. That is because the environment which they depend on is greatly threatened by development. In addition, the writers of the study are recommending that plans for a hydropower center in the area be stopped by the government.




# **3. Text Summarization with XLNet**

XLNet is particularly interesting for language generation because it is pre-trained in a regressive manner similar to the GPT family of models.

In [None]:
#load the model and print out the summary it generates
model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")
xfull = ''.join(model(body, min_length=60))
print(xfull)

Downloading (…)lve/main/config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetModel: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Scientists say they have discovered a new species of orangutans on Indonesia’s island of Sumatra. They say the animals are considered a new species because of genetic, skeletal and tooth differences. Research into the new species began in 2013, when an orangutan protection group in Sumatra found an injured orangutan in an area far away from the other species. The Tapanuli orangutans also have a different diet and are found only in higher forest areas. In addition, the writers of the study are recommending that plans for a hydropower center in the area be stopped by the government. It also recommended that remaining forest in the Sumatran area where the orangutans live be protected.


