<a href="https://colab.research.google.com/github/parjanyahk/pegasus-xsum_textSummerization/blob/main/textSummerization_pegasusxsum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PARJANYA H K** 

***Importing and Installing Necessary Dependencies***

In [1]:
# Importing PyTorch

import torch

In [2]:
# Installing transformers

# pip install transformers

In [3]:
# Installing sentencepiece because it was a needed dependency to import from transformers

# pip install sentencepiece

***Importing and Loading The Model***

In [4]:
# Importing 2 dependency classes from transformers

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# PegasusTokenizer allows us to convert the sentences into a set of tokens (unique number representation for the sentences) which can then be used to pass to the DL model
# PegasusForConditionalGeneration class holds the DL model and allows us to use the model

In [5]:
# Loading tokenizer

tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

# from_pretrained is a method

In [6]:
# Loading model

model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

# from_pretrained is a method

***Performing Abstractive Summarization***



In [7]:
# Sample text taken from wikipedia
# https://en.wikipedia.org/wiki/Elon_Musk

sample = """
Elon Reeve Musk is an entrepreneur, investor, and business magnate. He is the founder, CEO, and Chief Engineer at SpaceX; early-stage investor, CEO, and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$270 billion as of March 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list. Musk was born to a Canadian mother and South African father, and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada at age 17 to avoid conscription. He was enrolled at Queen's University and transferred to the University of Pennsylvania two years later, where he received a bachelor's degree in economics and physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding the web software company Zip2 with his brother Kimbal. The startup was acquired by Compaq for $307 million in 1999. The same year, Musk co-founded online bank X.com, which merged with Confinity in 2000 to form PayPal. The company was bought by eBay in 2002 for $1.5 billion.

In 2002, Musk founded SpaceX, an aerospace manufacturer and space transport services company, of which he is CEO and Chief Engineer. In 2004, he joined electric vehicle manufacturer Tesla Motors, Inc. (now Tesla, Inc.) as chairman and product architect, becoming its CEO in 2008. In 2006, he helped create SolarCity, a solar energy services company that was later acquired by Tesla and became Tesla Energy. In 2015, he co-founded OpenAI, a nonprofit research company that promotes friendly artificial intelligence. In 2016, he co-founded Neuralink, a neurotechnology company focused on developing brain–computer interfaces, and founded The Boring Company, a tunnel construction company. Musk has proposed the Hyperloop, a high-speed vactrain transportation system.

Musk has been criticized for unorthodox and unscientific stances and highly publicized controversial statements. In 2018, he was sued by the US Securities and Exchange Commission (SEC) for falsely tweeting that he had secured funding for a private takeover of Tesla. He settled with the SEC, temporarily stepping down from his chairmanship and agreeing to limitations on his Twitter usage. In 2019, he won a defamation trial brought against him by a British caver who advised in the Tham Luang cave rescue. Musk has also been criticized for spreading misinformation about the COVID-19 pandemic and for his other views on such matters as artificial intelligence, cryptocurrency, and public transport.

"""

In [8]:
# Creating tokens

tokens = tokenizer(sample, truncation=True, padding="longest", return_tensors="pt")

# Using tokenizer, we take the input parameters and store it in tokens
# truncation=True will shorten the text to pass it to the model
# padding="longest" will set the padding to longest
# return_tensors="pt" returns PyTorch tensors 

In [9]:
tokens

# The sample text has been converted into tokens shown below whcich has input ID, which are actual tokens and attention mask specifies where our attention is being directed when generating the text

{'input_ids': tensor([[32981, 77734, 20248,   117,   142,  8406,   108,  6594,   108,   111,
           260, 73518,   107,   285,   117,   109,  4252,   108,  2792,   108,
           111,  3670,  9822,   134, 37946,   206,   616,   121, 10085,  6594,
           108,  2792,   108,   111,  4711, 18663,   113, 11997,   108,  1238,
           107,   206,  4252,   113,   139, 64404,  1555,   206,   111,  1229,
           121,  9489,   113, 45077, 15365,   111,  2207, 13901,   107,   441,
           142,  3627,  2677,  1092,   113,   279,   787,  4811, 28274,  1722,
           130,   113,  1051, 56164,  4101, 60708, 20248,   117,   109, 44106,
           465,   115,   109,   278,   992,   112,   302,   109, 15742, 74645,
           116,  7186,   111,   109, 15347,   440,   121,  1139, 52931,   467,
           107, 20248,   140,  1723,   112,   114,  3066,  1499,   111,   793,
          2636,  1802,   108,   111,  2244,   115, 37845,   108,   793,  1922,
           107,   285,  9397,  3243,  

***Summarizing***

In [10]:
summary = model.generate(**tokens)

# Unpacking everything in tokens using **tokens

In [11]:
{**tokens}

#tokens is of type dictionary, identifyable by { } in the output of tokens above

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [12]:
# Returns output tensors (output tokens)
# We passed some tensors as input

summary

# Decoding this will get us the summarizerd text

tensor([[    0, 32981, 20248,   117,   156,   113,   109,   278,   131,   116,
         21315,   200,   107,     1]])

***Decoding***

In [13]:
# Decoding summary

tokenizer.decode(summary[0])

# We can decode using tokenizer
# We use [0] to get the first result since the output tokens is nested, meaning it has [[ ]]. We just need the first instance here

"Elon Musk is one of the world's richest people."

When we go back to Elon's Wikipedia page, and search for the decoded string, we don't find a match.
Hence the name Abstractive Summarization.