## Text Summarization using T5 model

In [1]:
!pip install transformers sentencepiece

Collecting transformers
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

In [3]:
import pandas as pd
import torch
import json
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

## Download and Load the dataset

Download the dataset: [News Summary](https://www.kaggle.com/datasets/sunnysai12345/news-summary)

Once downloaded upload to colab and then proceed

In [7]:
data = pd.read_csv("news_summary.csv", encoding='iso-8859-1')
data.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [8]:
data['text'].iloc[0]

'The Administration of Union Territory Daman and Diu has revoked its order that made it compulsory for women to tie rakhis to their male colleagues on the occasion of Rakshabandhan on August 7. The administration was forced to withdraw the decision within 24 hours of issuing the circular after it received flak from employees and was slammed on social media.'

In [9]:
data['ctext'].iloc[0]

'The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7. In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues,? the order, issued on August 1 by Gurpreet Singh, deputy secretary (personnel), had said.To ensure that no one skipped office, an attendance report was to be sent to the government the next evening.The two notifications ? one mandating the celebration of Rakshabandhan (left) and the other withdrawing the mandate (right) ? were issued by the Dama

In [10]:
data['headlines'].iloc[0]

'Daman & Diu revokes mandatory Rakshabandhan in offices order'

## Tokenization and Model Inference

In [11]:
model_name = 't5-small'
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [12]:
text = data['ctext'].iloc[3]
text

'Lashkar-e-Taiba\'s Kashmir commander Abu Dujana was killed in an encounter in a village in Pulwama district of Jammu and Kashmir earlier this week. Dujana, who had managed to give the security forces a slip several times in the past, carried a bounty of Rs 15 lakh on his head.Reports say that Dujana had come to meet his wife when he was trapped inside a house in Hakripora village. Security officials involved in the encounter tried their best to convince Dujana to surrender but he refused, reports say.According to reports, Dujana rejected call for surrender from an Army officer. The Army had commissioned a local to start a telephonic conversation with Dujana. After initiating the talk, the local villager handed over the phone to the army officer."Kya haal hai? Maine kaha, kya haal hai (How are you. I asked, how are you)?" Dujana is heard asking the officer. The officer replies: "Humara haal chhor Dujana. Surrender kyun nahi kar deta. Tu galat kar rha hai (Why don\'t you surrender? You 

In [13]:
before_summarize = text.strip().replace("\n","")
print("Before text summarization:")
before_summarize

Before text summarization:


'Lashkar-e-Taiba\'s Kashmir commander Abu Dujana was killed in an encounter in a village in Pulwama district of Jammu and Kashmir earlier this week. Dujana, who had managed to give the security forces a slip several times in the past, carried a bounty of Rs 15 lakh on his head.Reports say that Dujana had come to meet his wife when he was trapped inside a house in Hakripora village. Security officials involved in the encounter tried their best to convince Dujana to surrender but he refused, reports say.According to reports, Dujana rejected call for surrender from an Army officer. The Army had commissioned a local to start a telephonic conversation with Dujana. After initiating the talk, the local villager handed over the phone to the army officer."Kya haal hai? Maine kaha, kya haal hai (How are you. I asked, how are you)?" Dujana is heard asking the officer. The officer replies: "Humara haal chhor Dujana. Surrender kyun nahi kar deta. Tu galat kar rha hai (Why don\'t you surrender? You 

In [14]:
tokenized_text = tokenizer.encode(before_summarize, return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (735 > 512). Running this sequence through the model will result in indexing errors


In [15]:
summary_ids = model.generate(tokenized_text,
                                    no_repeat_ngram_size=3,
                                    min_length=40,
                                    max_length=150,
                                    early_stopping=True)



In [16]:
summary_ids

tensor([[    0,     3,     9,     3,  1931,  9621,   447,  3634,    28,     8,
          9102,  5502,    47,   708,    28,     3,     9,   415,     3,     5,
             8,  5502, 18606,    10,    96, 13284,  1635,     9,  4244,   138,
             3,   524,   107,   127,   970,  7066,     9,     5,  3705,    52,
          3868,     3,  3781,   202,     3,  8607,    23,     3,  4031,    20,
            17,     9,     5,  2740,  7466,   144,     3,  4031,     3,    52,
          1024,  4244,    23,     5,  2740,     3,  1258,  1024,     6,     3,
          3781,     9,  4244,    23,     6,     8,    15,   157,  4244,    23,
            58,    41,   196,   751,    31,    17, 23970,     5,   432,     9,
             9,   107,   133,   103,  2891,    19,   132,    16,    82, 15126,
            61,   121,   970,  7066,   152,     9,    47,  4792,    16,    46,
          6326,    16,     3,     9,     1]])

In [17]:
after_summarize = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [18]:
print("After text summarization:")
after_summarize

After text summarization:


'a telephonic conversation with the army officer was started with a local. the officer replied: "Humara haal chhor Dujana. Surrender kyun nahi kar deta. Tu galat kar rha hai. Tu kaha, kya hai, theek hai? (I won\'t surrender. Allaah would do whatever is there in my fate)" Dujanana was killed in an encounter in a'

## Using Pipeline

In [19]:
from transformers import pipeline

In [20]:
summarizer = pipeline("summarization", model="t5-small", tokenizer="t5-small")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [21]:
context = """
Elon Musk, a visionary entrepreneur, is a name synonymous with groundbreaking innovations that transcend industries. Born in South Africa in 1971, Musk embarked on a journey that would lead him to revolutionize the realms of technology, space exploration, and sustainable energy.

One of Musk's most notable contributions lies in the electric vehicle sector. He co-founded Tesla Motors in 2003, aiming to make electric cars a mainstream reality. With Tesla's sleek designs, impressive performance, and advancements in battery technology, Musk has not only disrupted the automotive industry but also accelerated the world's transition to sustainable transportation.

SpaceX, another brainchild of Musk, is at the forefront of space exploration. Musk's audacious goal to make life multiplanetary has driven SpaceX to develop reusable rockets and drastically reduce the cost of space travel. The successful landing of Falcon rockets and the launch of the Falcon Heavy have reshaped the space industry, making it more accessible and economical.
"""

In [22]:
summary = summarizer(context, max_length=150, min_length=50, do_sample=True, top_k=50, top_p=0.95)

In [23]:
summary[0]['summary_text']

"entrepreneur Elon Musk founded Tesla Motors in 2003, aiming to make electric cars a mainstream reality . with Tesla's sleek designs, impressive performance, and advancements in battery technology, Musk has disrupted the automotive industry but also accelerated the transition to sustainable transportation ."

In [24]:
print("Before:",len(context))
print("After:",len(summary[0]['summary_text']))

Before: 1042
After: 307


## Using GPU

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [6]:
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

>With your model and data on the GPU, you can now proceed with training or inference as usual. Just make sure that any tensors you create or process during these steps are also moved to the GPU.

In [29]:
model = model.to(device)
tokenized_text = tokenized_text.to(device)

In [30]:
summary_ids = model.generate(tokenized_text,
                                    no_repeat_ngram_size=3,
                                    min_length=40,
                                    max_length=150,
                                    early_stopping=True)

In [31]:
summary_ids

tensor([[    0,     3,     9,     3,  1931,  9621,   447,  3634,    28,     8,
          9102,  5502,    47,   708,    28,     3,     9,   415,     3,     5,
             8,  5502, 18606,    10,    96, 13284,  1635,     9,  4244,   138,
             3,   524,   107,   127,   970,  7066,     9,     5,  3705,    52,
          3868,     3,  3781,   202,     3,  8607,    23,     3,  4031,    20,
            17,     9,     5,  2740,  7466,   144,     3,  4031,     3,    52,
          1024,  4244,    23,     5,  2740,     3,  1258,  1024,     6,     3,
          3781,     9,  4244,    23,     6,     8,    15,   157,  4244,    23,
            58,    41,   196,   751,    31,    17, 23970,     5,   432,     9,
             9,   107,   133,   103,  2891,    19,   132,    16,    82, 15126,
            61,   121,   970,  7066,   152,     9,    47,  4792,    16,    46,
          6326,    16,     3,     9,     1]], device='cuda:0')

In [32]:
after_summarize = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [33]:
after_summarize

'a telephonic conversation with the army officer was started with a local. the officer replied: "Humara haal chhor Dujana. Surrender kyun nahi kar deta. Tu galat kar rha hai. Tu kaha, kya hai, theek hai? (I won\'t surrender. Allaah would do whatever is there in my fate)" Dujanana was killed in an encounter in a'

## Release cache

While working with large language models, it is always a best pratice to clear cache when you run in Memory Error

In [34]:
torch.cuda.empty_cache()