# YouTube Video Transcript Summarization

## Installing & Importing Libaries

In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 3.3 MB 4.0 MB/s 
[K     |████████████████████████████████| 895 kB 62.4 MB/s 
[K     |████████████████████████████████| 596 kB 58.4 MB/s 
[K     |████████████████████████████████| 61 kB 409 kB/s 
[K     |████████████████████████████████| 3.3 MB 41.4 MB/s 
[?25h

In [2]:
!pip install -q youtube_transcript_api

In [3]:
# importing libraries
from transformers import pipeline
from youtube_transcript_api import YouTubeTranscriptApi

## Extracting Video ID from URL

In [4]:
btc_video = "https://www.youtube.com/watch?v=IX6rUhNC8uA"

In [5]:
vid_id = btc_video.split("=")[1]

In [6]:
vid_id

'IX6rUhNC8uA'

## Getting Video Transcript

In [8]:
YouTubeTranscriptApi.get_transcript(vid_id)

[{'duration': 4.88,
  'start': 0.08,
  'text': 'to put it simply balancer is an ethereum'},
 {'duration': 5.039,
  'start': 2.72,
  'text': 'based application specifically a'},
 {'duration': 4.719,
  'start': 4.96,
  'text': 'decentralized exchange that utilizes a'},
 {'duration': 3.76,
  'start': 7.759,
  'text': 'special trading algorithm called an'},
 {'duration': 4.161,
  'start': 9.679,
  'text': 'automated market maker that allows'},
 {'duration': 4.24,
  'start': 11.519,
  'text': 'traders to swap their crypto tokens in a'},
 {'duration': 3.92,
  'start': 13.84,
  'text': 'very efficient way if that sounded like'},
 {'duration': 3.52,
  'start': 15.759,
  'text': "a bunch of mumbo jumbo don't worry i'm"},
 {'duration': 3.679,
  'start': 17.76,
  'text': 'gonna explain all of this throughout the'},
 {'duration': 3.92,
  'start': 19.279,
  'text': 'rest of the video welcome to whiteboard'},
 {'duration': 3.92,
  'start': 21.439,
  'text': 'crypto the number one youtube channel'},


In [9]:
transcript = YouTubeTranscriptApi.get_transcript(vid_id)

### Edge case for transcript not having "text"

In [10]:
# to check if the video transcript has text or not
transcript[0:5]

[{'duration': 4.88,
  'start': 0.08,
  'text': 'to put it simply balancer is an ethereum'},
 {'duration': 5.039,
  'start': 2.72,
  'text': 'based application specifically a'},
 {'duration': 4.719,
  'start': 4.96,
  'text': 'decentralized exchange that utilizes a'},
 {'duration': 3.76,
  'start': 7.759,
  'text': 'special trading algorithm called an'},
 {'duration': 4.161,
  'start': 9.679,
  'text': 'automated market maker that allows'}]

### Putting all timestamped texts together in one corpus

In [11]:
# iterating throughout and adding all text together
result = ""
for i in transcript:
    result += ' ' + i['text']
#print(result)
print(len(result))

12458


In [119]:
result

" to put it simply balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way if that sounded like a bunch of mumbo jumbo don't worry i'm gonna explain all of this throughout the rest of the video welcome to whiteboard crypto the number one youtube channel for crypto education and here we explain topics of the cryptocurrency world using analogies stories and examples so that even your grandpa could understand them in this video we are going to be explaining what balancer is how it works and what their automated portfolio manager or as i call it a create your own index fund idea is let's dig in in the world of defy and cryptocurrencies we don't use the old method of making trades this old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets added to a h

## Generating Summary

### Facebook Bart Large CNN Model

In [65]:
summarizerfb = pipeline("summarization", model="facebook/bart-large-cnn")
#sumd_text = summarizerfb(result, max_length=130, min_length=30, do_sample=False)

In [155]:
# iterating in batches since max token length for Bart models is 1024, so we divide each batch here into token lengths of <1000
num_iters = int(len(result)/1000)

# summarizing on each batch and appending to final summary
summarized_text = []
summarized_text2 = []
for i in range(0, num_iters + 1):
  start = 0
  start = i * 1000
  end = (i + 1) * 1000
  print("input text \n" + result[start:end])
  out = summarizerfb(result[start:end], max_length=130, min_length=30, do_sample=False)
  out = out[0]
  out = out['summary_text']
  print("Summarized text\n"+out)
  summarized_text.append(out)
  summarized_text2 = ' '.join(summarized_text)

input text 
 to put it simply balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way if that sounded like a bunch of mumbo jumbo don't worry i'm gonna explain all of this throughout the rest of the video welcome to whiteboard crypto the number one youtube channel for crypto education and here we explain topics of the cryptocurrency world using analogies stories and examples so that even your grandpa could understand them in this video we are going to be explaining what balancer is how it works and what their automated portfolio manager or as i call it a create your own index fund idea is let's dig in in the world of defy and cryptocurrencies we don't use the old method of making trades this old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets a

Your max_length is set to 130, but you input_length is only 95. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)


Summarized text
The maximum supply of balancer tokens is 100 million around 35 million of these tokens were minted at launch and the remaining 65 million can be distributed to liquidity providers. balancer token was actually one of the very first governments token in the whole wide world of defy and right now that's exactly what they're used for governance.
input text 
nd the next two videos on this channel about the balancer platform and functionalities are created with the help of a grant that they approved for whiteboard crypto so wrapping this video up i want to thank the balancer grant team for approving that grant and helping me with the script and i want to thank you for watching this video i hope you enjoyed it i really hope that maybe you've learned something and most of all i hope to see you in our next video
Summarized text
The next two videos on this channel about the balancer platform and functionalities are created with the help of a grant that they approved for whiteboar

In [156]:
len(result)

12458

In [157]:
result

" to put it simply balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way if that sounded like a bunch of mumbo jumbo don't worry i'm gonna explain all of this throughout the rest of the video welcome to whiteboard crypto the number one youtube channel for crypto education and here we explain topics of the cryptocurrency world using analogies stories and examples so that even your grandpa could understand them in this video we are going to be explaining what balancer is how it works and what their automated portfolio manager or as i call it a create your own index fund idea is let's dig in in the world of defy and cryptocurrencies we don't use the old method of making trades this old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets added to a h

In [158]:
len(str(summarized_text2))

4084

In [159]:
str(summarized_text2)

"Whiteboard crypto is the number one youtube channel for crypto education. We explain topics of the cryptocurrency world using analogies stories and examples. In this video we are going to be explaining what balancer is how it works. An order book is a list of everyone else's trades this list is called an order book essentially you pick what price you want to buy or sell at you can't be guaranteed when the trade will happen if at all because you have to wait around for someone else to match with your trade. There is a newer solution that is much better and it doesn't rely on humans but instead algorithms this method is called a market maker it'll never run out of assets to sell you. The way this algorithm works is actually quite simple as you buy more of one asset the algorithm automatically charges you more and more to keep buying that asset. Even if you have a trillion dollars to trade with the pool will never run out of either asset to sell you because it'll keep charging you an inf

We'll be using the **facebook/bart-large-cnn** model as our final model for the YouTube Video Transcript Summarizer.

### T5-Base Model

In [76]:
# defining the model from Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [77]:
# Encoding & decoding in T5. T5 uses a max_length of 512 so we cut the article to 512 tokens.

num_iters = int(len(result)/512)
summarized_text = []
for i in range(0, num_iters + 1):
  start = 0
  start = i * 1000
  end = (i + 1) * 1000
  print("input text \n" + result[start:end])
  inp = tokenizer("summarize: " + result[start:end], return_tensors="pt", max_length=512, truncation=True)
  out = model.generate(inp["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
  output = tokenizer.decode(out[0])
  summarized_text.append(output)

input text 
 to put it simply balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way if that sounded like a bunch of mumbo jumbo don't worry i'm gonna explain all of this throughout the rest of the video welcome to whiteboard crypto the number one youtube channel for crypto education and here we explain topics of the cryptocurrency world using analogies stories and examples so that even your grandpa could understand them in this video we are going to be explaining what balancer is how it works and what their automated portfolio manager or as i call it a create your own index fund idea is let's dig in in the world of defy and cryptocurrencies we don't use the old method of making trades this old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets a

In [78]:
len(result)

12458

In [79]:
result

" to put it simply balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way if that sounded like a bunch of mumbo jumbo don't worry i'm gonna explain all of this throughout the rest of the video welcome to whiteboard crypto the number one youtube channel for crypto education and here we explain topics of the cryptocurrency world using analogies stories and examples so that even your grandpa could understand them in this video we are going to be explaining what balancer is how it works and what their automated portfolio manager or as i call it a create your own index fund idea is let's dig in in the world of defy and cryptocurrencies we don't use the old method of making trades this old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets added to a h

In [80]:
len(summarized_text)

25

In [81]:
summarized_text

['<pad> balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way. here we explain topics of the cryptocurrency world using analogies stories stories and examples so that even your grandpa could understand them.</s>',
 '<pad> the way an automated market maker works is quite complicated. first a pool of money must be created this pool of money contains two or more assets that someone supplies willingly so that oth oth can trade at the price they want. this pool of money contains two or more assets that someone supplies willingly so that oth can trade at the price they want. this pool of money contains two or more assets that someone supplies willingly so that oth can trade at the price they want. this pool of money must</s>',
 "<pad> er people can use that pool to trade with the way the trading happens is all automatic w

### Distilbart Model



In [53]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [59]:
# iterating in batches
num_iters = int(len(result)/1000)

summarized_text = []
for i in range(0, num_iters + 1):
  start = 0
  start = i * 1000
  end = (i + 1) * 1000
  print("input text \n" + result[start:end])
  out = summarizer(result[start:end], min_length = 30, max_length = 100)
  out = out[0]
  out = out['summary_text']
  print("Summarized text\n"+out)
  summarized_text.append(out)

#print(summarized_text)

input text 
 to put it simply balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way if that sounded like a bunch of mumbo jumbo don't worry i'm gonna explain all of this throughout the rest of the video welcome to whiteboard crypto the number one youtube channel for crypto education and here we explain topics of the cryptocurrency world using analogies stories and examples so that even your grandpa could understand them in this video we are going to be explaining what balancer is how it works and what their automated portfolio manager or as i call it a create your own index fund idea is let's dig in in the world of defy and cryptocurrencies we don't use the old method of making trades this old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets a

Your max_length is set to 100, but you input_length is only 95. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)


Summarized text
 The maximum supply of balancer tokens is 100 million around 35 million were minted at launch and the remaining 65 million can be distributed to liquidity providers . The ultimate amount of these tokens is completely up to balancer holders by proposing suggestions and then voting on them . The balancer token was actually one of the very first governments token in the whole world of defy .
input text 
nd the next two videos on this channel about the balancer platform and functionalities are created with the help of a grant that they approved for whiteboard crypto so wrapping this video up i want to thank the balancer grant team for approving that grant and helping me with the script and i want to thank you for watching this video i hope you enjoyed it i really hope that maybe you've learned something and most of all i hope to see you in our next video
Summarized text
 The next two videos on this channel are created with the help of a grant that they approved for whiteboa

In [60]:
len(result)

12458

In [61]:
result

" to put it simply balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way if that sounded like a bunch of mumbo jumbo don't worry i'm gonna explain all of this throughout the rest of the video welcome to whiteboard crypto the number one youtube channel for crypto education and here we explain topics of the cryptocurrency world using analogies stories and examples so that even your grandpa could understand them in this video we are going to be explaining what balancer is how it works and what their automated portfolio manager or as i call it a create your own index fund idea is let's dig in in the world of defy and cryptocurrencies we don't use the old method of making trades this old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets added to a h

In [62]:
len(str(summarized_text))

4418

In [63]:
str(summarized_text)

'[\' balancer is an ethereum based application specifically a decentralized exchange that utilizes a special trading algorithm called an automated market maker that allows traders to swap their crypto tokens in a very efficient way . The old method is called an order book model where a buyer and seller write down whatever they want to trade with and then it gets added to a hu .\', " The way an automated market maker works is quite complicated but we\'re gonna try to simplify it for you in this video . A pool of money must be created in order to create a pool of assets that someone supplies willingly so that oth the money contains two or more assets .", " The way this algorithm works is actually quite simple as you buy more of one asset the algorithm automatically charges you more and more to keep buying that asset . Even if you have a trillion dollars to trade with the pool will never run out of either asset to sell you because it\'ll keep charging you an infinitely higher price .", \'

# Blog Post Summarization after Web Scraping

## Installing Transformers & Importing Dependencies

In [83]:
!pip install transformers



In [84]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests

## Load Summarization Pipeline

In [85]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


## Getting our Blog Post

In [90]:
URL = "https://medium.com/building-the-metaverse/what-we-talk-about-when-we-talk-about-the-metaverse-c9ef03c1a5dd"

In [93]:
r = requests.get(URL)

In [94]:
r.text

'<!doctype html><html lang="en"><head><script defer src="https://cdn.optimizely.com/js/16180790160.js"></script><title data-rh="true">What We Talk About When We Talk About the Metaverse | by Jon Radoff | Building the Metaverse | Medium</title><meta data-rh="true" charset="utf-8"/><meta data-rh="true" name="viewport" content="width=device-width,minimum-scale=1,initial-scale=1,maximum-scale=1"/><meta data-rh="true" name="theme-color" content="#000000"/><meta data-rh="true" name="twitter:app:name:iphone" content="Medium"/><meta data-rh="true" name="twitter:app:id:iphone" content="828256236"/><meta data-rh="true" property="al:ios:app_name" content="Medium"/><meta data-rh="true" property="al:ios:app_store_id" content="828256236"/><meta data-rh="true" property="al:android:package" content="com.medium.reader"/><meta data-rh="true" property="fb:app_id" content="542599432471018"/><meta data-rh="true" property="og:site_name" content="Medium"/><meta data-rh="true" property="og:type" content="arti

In [97]:
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all(['h1', 'p'])
text = [result.text for result in results]
ARTICLE = ' '.join(text)

In [98]:
ARTICLE

'What We Talk About When We Talk About the Metaverse Since this is the inaugural article on Building the Metaverse, I wanted to take a moment to think about what this word means, how it has evolved, and what the future may bring. “Metaverse” is a word that conjures different meanings to people: to some, it’s an immersive virtual-reality experience within a persistent landscape; to others, a specific technology stack; to some, it is a vision of future society. About 14 years ago, I asked one of my favorite science-fiction authors, Charlie Stross, to write an article for me about the future of games. He had to say this about the metaverse: The really interesting question is whether things will converge on a single overarching metaverse with games or business meetings happening in different places, or whether they’ll fracture and we’ll see even more divergent environments cropping up. It is now possible to say that neither of these futures has entirely come true. In Neal Stephenson’s Snow

## Chunking Text

In [99]:
# splitting sentences by end-of-sentence tags after appending it
ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')
sentences = ARTICLE.split('<eos>')

In [100]:
sentences

['What We Talk About When We Talk About the Metaverse Since this is the inaugural article on Building the Metaverse, I wanted to take a moment to think about what this word means, how it has evolved, and what the future may bring.',
 ' “Metaverse” is a word that conjures different meanings to people: to some, it’s an immersive virtual-reality experience within a persistent landscape; to others, a specific technology stack; to some, it is a vision of future society.',
 ' About 14 years ago, I asked one of my favorite science-fiction authors, Charlie Stross, to write an article for me about the future of games.',
 ' He had to say this about the metaverse: The really interesting question is whether things will converge on a single overarching metaverse with games or business meetings happening in different places, or whether they’ll fracture and we’ll see even more divergent environments cropping up.',
 ' It is now possible to say that neither of these futures has entirely come true.',
 '

In [101]:
max_chunk = 500
current_chunk = 0 
chunks = []
for sentence in sentences:
    # checking if we have an empty chunk 
    if len(chunks) == current_chunk + 1: 
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

0


In [103]:
chunks[1]

' Indeed, this has already yielded impressive results in certain settings.  In exchange, the cost for this is the high rents you’ll pay and the limits on creativity imposed by the owners.  Another approach to these monolithic silos in which creators are hemmed-in is a “metaverse for all. ” In this, the defining characteristic is decentralization.  We already see this in the realm of decentralized finance (DeFi).  To unlock the creativity of the world, we’ll need similar patterns to play out in the space of virtual worlds and games.  Technologies, interfaces and business services are emerging that allow anyone to mash-up, mix, build around and be compensated for creativity.  This sort of metaverse demands the ability to create and exchange assets among games, worlds and environments — and a decoupling of the rules, content and technical underpinnings.  Creative work will need to be freed from programming and technical impediments that cause the multiplication of processes, workflows and

In [104]:
len(chunks)

3

In [107]:
len(chunks[1].split(' '))

494

## Summarize Text

In [112]:
res = summarizer(chunks, max_length=70, min_length=30, do_sample=False)

In [113]:
chunks

['What We Talk About When We Talk About the Metaverse Since this is the inaugural article on Building the Metaverse, I wanted to take a moment to think about what this word means, how it has evolved, and what the future may bring.  “Metaverse” is a word that conjures different meanings to people: to some, it’s an immersive virtual-reality experience within a persistent landscape; to others, a specific technology stack; to some, it is a vision of future society.  About 14 years ago, I asked one of my favorite science-fiction authors, Charlie Stross, to write an article for me about the future of games.  He had to say this about the metaverse: The really interesting question is whether things will converge on a single overarching metaverse with games or business meetings happening in different places, or whether they’ll fracture and we’ll see even more divergent environments cropping up.  It is now possible to say that neither of these futures has entirely come true.  In Neal Stephenson’

In [114]:
res

[{'summary_text': ' “Metaverse” is a word that conjures different meanings to people: to some, it’s an immersive virtual-reality experience within a persistent landscape; to others, a specific technology stack . “The metaverse is a living multiverse of worlds. The common theme is that the “player”'},
 {'summary_text': ' Building the Metaverse is a “metaverse for all’s creators’ ability to create and exchange assets among games, worlds and environments . Walled gardens will contain the expansive theme parks that you’ll enjoy visiting — but they won’t be the only destinations . These theme parks will include games, experiences'},
 {'summary_text': ' Metaverse:  2D, 3D, mobile phones, VR/AR, games, MMORPGs, social networks, digital collectibles, esports; game design, Unity, Unreal, free-to-play (f2p), blockchain, NFTs.  Business, technology and culture of all virtual worlds, realities'}]

In [115]:
''.join([summ['summary_text'] for summ in res])

' “Metaverse” is a word that conjures different meanings to people: to some, it’s an immersive virtual-reality experience within a persistent landscape; to others, a specific technology stack . “The metaverse is a living multiverse of worlds. The common theme is that the “player” Building the Metaverse is a “metaverse for all’s creators’ ability to create and exchange assets among games, worlds and environments . Walled gardens will contain the expansive theme parks that you’ll enjoy visiting — but they won’t be the only destinations . These theme parks will include games, experiences Metaverse:  2D, 3D, mobile phones, VR/AR, games, MMORPGs, social networks, digital collectibles, esports; game design, Unity, Unreal, free-to-play (f2p), blockchain, NFTs.  Business, technology and culture of all virtual worlds, realities'

## Output to Text File

In [116]:
text = ' '.join([summ['summary_text'] for summ in res])

In [117]:
with open('blogsummary.txt', 'w') as f:
    f.write(text)

# Abstractive Summarization with Pegasus

## Install Dependencies

In [1]:
# installing PyTorch
!pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Looking in links: https://download.pytorch.org/whl/cu113/torch_stable.html
Collecting torch==1.10.0+cu113
  Downloading https://download.pytorch.org/whl/cu113/torch-1.10.0%2Bcu113-cp37-cp37m-linux_x86_64.whl (1821.5 MB)
[K     |██████████████▋                 | 834.1 MB 2.2 MB/s eta 0:07:30tcmalloc: large alloc 1147494400 bytes == 0x558337e5e000 @  0x7f9285469615 0x5582fe6324cc 0x5582fe71247a 0x5582fe6352ed 0x5582fe726e1d 0x5582fe6a8e99 0x5582fe6a39ee 0x5582fe636bda 0x5582fe6a8d00 0x5582fe6a39ee 0x5582fe636bda 0x5582fe6a5737 0x5582fe727c66 0x5582fe6a4daf 0x5582fe727c66 0x5582fe6a4daf 0x5582fe727c66 0x5582fe6a4daf 0x5582fe637039 0x5582fe67a409 0x5582fe635c52 0x5582fe6a8c25 0x5582fe6a39ee 0x5582fe636bda 0x5582fe6a5737 0x5582fe6a39ee 0x5582fe636bda 0x5582fe6a4915 0x5582fe636afa 0x5582fe6a4c0d 0x5582fe6a39ee
[K     |██████████████████▌             | 1055.7 MB 1.6 MB/s eta 0:07:55tcmalloc: large alloc 1434370048 bytes == 0x55837c4b4000 @  0x7f9285469615 0x5582fe6324cc 0x5582fe71247a 0x558

In [2]:
# installing HF Transformers
!pip install transformers



In [3]:
!pip install SentencePiece



## Import & Load Model

In [4]:
# importing dependencies from tansformers
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [5]:
# loading tokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.36M [00:00<?, ?B/s]

In [6]:
# loading model
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

## Perform Abstractive Summarization

In [7]:
text = """
Kunal Nayyar (/kʊˈnɑːl ˈnaɪ.ər/, kuu-NAHL NY-ər; born 30 April 1981) is a British American actor. He portrayed Raj Koothrappali on the CBS sitcom The Big Bang Theory (2007–2019), and voiced Vijay on the Nickelodeon animated sitcom Sanjay and Craig (2013–2016). Nayyar also appeared in the films Ice Age: Continental Drift (2012), The Scribbler (2014), Dr. Cabbie (2014), Consumed (2015), Trolls (2016), and Sweetness in the Belly (2019). Forbes listed Nayyar as the world's third-highest-paid television actor in 2015 and 2018, with earnings of US$20 million and US$23.5 million, respectively.[1][2]

Early life
Nayyar was born in Hounslow, West London to a family of Indian immigrants. When he was 4 years old, his family returned to India, and he grew up in New Delhi.[3] He attended St. Columba's School, where he played badminton for the school team.[4][5] His parents live in New Delhi.[6][7]

In 1999, Nayyar moved to the United States to pursue a Bachelor of Business Administration in finance from the University of Portland, Oregon.[8] He started taking acting classes and appeared in several school plays while working on his degree.[3]

After participating in the American College Theater Festival, Nayyar decided to become an actor. He then attended Temple University in Philadelphia, Pennsylvania, where he received a Master of Fine Arts in acting.[9]

Career

Nayyar on a tour of The Big Bang Theory in 2008
After graduating, Nayyar found work doing American television ads and plays on the London stage.[10] He first gained attention in the US for his role in the West Coast production of Rajiv Joseph's 2006 play Huck & Holden, where he portrayed an Indian exchange student anxious to experience American culture before returning home.[9] In 2006, Nayyar teamed up with Arun Das to write the play Cotton Candy, which premiered in New Delhi to positive reviews.[11]

Nayyar made a guest appearance on the CBS drama NCIS in the season four episode "Suspicion", in which he played Youssef Zidan, an Iraqi terrorist.[12]

Nayyar's agent heard about a role for a scientist in an upcoming CBS pilot and encouraged him to audition for the part. This led to his casting in the sitcom The Big Bang Theory, where he played the role of an astrophysicist Raj Koothrappali.[13]

In 2011, he co-hosted the Tribute to Nerds show with co-star Simon Helberg at the comedy festival Just for Laughs.[14]

Nayyar voiced Gupta in Ice Age: Continental Drift in 2012. During the same year he completed the shooting of his first film, Dr. Cabbie, produced by Bollywood actor Salman Khan.[15]

From 5 May to 29 June 2015, Nayyar performed in an off-Broadway production, The Spoils, written by and starring actor Jesse Eisenberg. Nayyar played Kalyan, a Nepalese student and roommate of the protagonist Ben, played by Eisenberg.[16] The production transferred to London's West End in 2016.

Nayyar published a book about his career journey, titled Yes, My Accent is Real: and Some Other Things I Haven’t Told You, in September 2015.[3]

He voiced Guy Diamond in DreamWorks' animated movie Trolls, released in November 2016.

In 2020, Nayyar played a convicted serial killer named Sandeep on the Netflix UK production, Criminal: UK. He appeared in the final episode of Season 2, which was released in August 2020. In the same year, he joined the cast of upcoming thriller titled Suspicion on Apple TV+, based on the Israeli thriller TV series False Flag, alongside Uma Thurman, Elizabeth Henstridge and Elyes Gabel.[17]

In 2021, he was announced as having been picked to play the title role of A. J. Fikry in the upcoming comedy drama The Storied Life Of A. J. Fikry, alongside Lucy Hale and Christina Hendricks, an adaptation of the best-selling novel by Gabrielle Zevin.[18]

"""

In [8]:
# creating tokens
tokens = tokenizer(text, truncation=True, padding="longest", return_tensors="pt")

In [9]:
tokens

{'input_ids': tensor([[82150, 46401, 26052,   143,   191,  1052,   105,   454,   105,  1191,
           110,   105,  2558,   105,   107,   105,   551,   191,   108, 23539,
          1858,   121, 83705,  1240,  3942,   121,   105,   551,   206,  1723,
           677,   960, 17298,   158,   117,   114,  1816,   655,  5102,   107,
           285, 17720, 16117, 39464, 42559,  9241, 12502,   124,   109, 12022,
         33454,   139,  2338, 14280, 11245, 66776,  1198, 13316,   312,   111,
         25412, 32594,   124,   109, 52947,  8461, 33454, 43067,   111,  8491,
           143, 15622,  1198, 14262,   250, 46401, 26052,   163,  2893,   115,
           109,  3265,  6527,  6271,   151, 16235, 44235, 28238,   108,   139,
           520, 40250, 32961, 26470,   108,   982,   107, 13792, 20860, 26470,
           108, 80614,   252, 22924,   108, 41609,   116, 25350,   108,   111,
          7303,  1759,   115,   109, 32866, 38245,   107, 15347,  1661, 46401,
         26052,   130,   109,   278,  

In [10]:
# summarizing
summary = model.generate(**tokens)

In [11]:
# unpacking the tokens above and demo'ing here
{**tokens}

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [12]:
# summary in tokens
summary

tensor([[    0, 22682, 82150, 46401, 26052,   148,   174,  1729,   109,   278,
           131,   116,   776,   121, 42279,   121, 16097,  3069,  5102,   115,
          1680,   111,  3939,   122,  5264,   113,   787, 45766,   604,   111,
           787, 28273, 18476,   604,   108,  4802,   107,     1]])

In [13]:
# summary in words
tokenizer.decode(summary[0])

"Actor Kunal Nayyar has been named the world's third-highest-paid television actor in 2015 and 2018, with earnings of US$20 million and US$23.5 million, respectively."

# Wrapping all models in simple pipelines

## YTVideoToText

In [2]:
!pip install transformers
!pip install youtube_transcript_api

Collecting transformers
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 5.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 499 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 43.5 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 32.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 40.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
 

In [7]:
def YTVideoToText(video_link):
    # installing & importing libraries
    from transformers import pipeline
    from youtube_transcript_api import YouTubeTranscriptApi

    # fetching video transcript
    video_id = video_link.split("=")[1]
    transcript = YouTubeTranscriptApi.get_transcript(video_id)

    # iterating throughout and adding all text together
    result = ""
    for i in transcript:
        result += ' ' + i['text']

    # summarize text
    summarizerfb = pipeline("summarization", model="facebook/bart-large-cnn")
    
    num_iters = int(len(result)/1000)
    summarized_text = []
    summarized_text2 = []
    for i in range(0, num_iters + 1):
        start = 0
        start = i * 1000
        end = (i + 1) * 1000
        out = summarizerfb(result[start:end], max_length=130, min_length=30, do_sample=False)
        out = out[0]
        out = out['summary_text']
        summarized_text.append(out)
        summarized_text2 = ' '.join(summarized_text)

    # returning summary
    print(str(summarized_text2))

In [8]:
YTVideoToText("https://www.youtube.com/watch?v=Oz9zw7-_vhM")

Your max_length is set to 130, but you input_length is only 120. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=60)


There's some super strange stuff happening online right now, and I need to tell you about it. First, look at this tweet. The first tweet ever tweetedin the history of Twitter. The tweet was purchased for $2,915,835.47. NFT stands for non fungible token. Some people think it may revolutionize our society, while at the same time accelerating the climate disaster. Johnny Harris talks about the non-fungible token, NFT. He also talks about a $20,000 Tesla that you can win by donating $10. Omaze.com/JohnnyHarris is offering a chance to win a Tesla and $20,000. The money will be donated to organizations that provide clean water, food security, and light to regions around the world. "I just love it, I love the color, I feel like an identity with this thing. And I kind of fell in love" "This jacket is not replaceable. If I went onto the website and paid $39 for a Uni Qlo orange jacket that was this exact same model" Everything in our economy is one or the other, fungible or non-fungible. A sack

## postSummaryWithBart

In [9]:
def postSummaryWithBart(blog_link):
    # importing libraries
    from transformers import pipeline
    from bs4 import BeautifulSoup
    import requests

    # loading summarization pipeline
    summarizer = pipeline("summarization")

    # getting our blog post
    URL = blog_link
    r = requests.get(URL)
    soup = BeautifulSoup(r.text, 'html.parser')
    results = soup.find_all(['h1', 'p'])
    text = [result.text for result in results]
    ARTICLE = ' '.join(text)

    # replacing punctuations with end-of-sentence tags
    ARTICLE = ARTICLE.replace('.', '.<eos>')
    ARTICLE = ARTICLE.replace('?', '?<eos>')
    ARTICLE = ARTICLE.replace('!', '!<eos>')
    sentences = ARTICLE.split('<eos>')

    # chunking text
    max_chunk = 500
    current_chunk = 0 
    chunks = []
    for sentence in sentences:
        # checking if we have an empty chunk 
        if len(chunks) == current_chunk + 1: 
            if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
                chunks[current_chunk].extend(sentence.split(' '))
            else:
                current_chunk += 1
                chunks.append(sentence.split(' '))
        else:
            print(current_chunk)
            chunks.append(sentence.split(' '))
    for chunk_id in range(len(chunks)):
        chunks[chunk_id] = ' '.join(chunks[chunk_id])

    # summarizing text
    res = summarizer(chunks, max_length=70, min_length=30, do_sample=False)
    text = ''.join([summ['summary_text'] for summ in res])

    # returning summary
    print(text)

In [10]:
postSummaryWithBart("https://www.gimmesomeoven.com/life/blogs-im-reading-and-loving-lately/")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

0
 7 Blogs that I Totally Enjoy Reading Every Day are a few of my favorite non-recipe blogs . They are just that — blogs I find fun! Some deal with light and fluffy stuff, some dig a little deeper into the meat of life . Erin Loechner's blog is filled with beautifully-written essays on everything Joanna Goddard's blog is full of random lists, questions, product recommendations, and regular posts on everything from style, to food, design, travel, relationships and motherhood . She has been following Elise Blaha Cripe's blog since before she even knew what a blog was . The Everygirl blog shares a wealth of content for women about what it looks like to live a meaningful, healthy and stylish life . Courtney Carver’s minimalist blog is filled with some of the best articles and tips I’ve read . The Every Girl and Carrots N Cake are among the most fun lifestyle blogs to read . What are your “just for fun” favorite blogs to spend time reading? Share your favorite blogs with us! The New Potato,

## abstractiveSummaryWithPegasus

In [1]:
# installing PyTorch
!pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Looking in links: https://download.pytorch.org/whl/cu113/torch_stable.html
Collecting torch==1.10.0+cu113
  Downloading https://download.pytorch.org/whl/cu113/torch-1.10.0%2Bcu113-cp37-cp37m-linux_x86_64.whl (1821.5 MB)
[K     |██████████████▋                 | 834.1 MB 1.6 MB/s eta 0:10:16tcmalloc: large alloc 1147494400 bytes == 0x559dfd298000 @  0x7fcb75a41615 0x559dc2f004cc 0x559dc2fe047a 0x559dc2f032ed 0x559dc2ff4e1d 0x559dc2f76e99 0x559dc2f719ee 0x559dc2f04bda 0x559dc2f76d00 0x559dc2f719ee 0x559dc2f04bda 0x559dc2f73737 0x559dc2ff5c66 0x559dc2f72daf 0x559dc2ff5c66 0x559dc2f72daf 0x559dc2ff5c66 0x559dc2f72daf 0x559dc2f05039 0x559dc2f48409 0x559dc2f03c52 0x559dc2f76c25 0x559dc2f719ee 0x559dc2f04bda 0x559dc2f73737 0x559dc2f719ee 0x559dc2f04bda 0x559dc2f72915 0x559dc2f04afa 0x559dc2f72c0d 0x559dc2f719ee
[K     |██████████████████▌             | 1055.7 MB 1.5 MB/s eta 0:08:19tcmalloc: large alloc 1434370048 bytes == 0x559e418ee000 @  0x7fcb75a41615 0x559dc2f004cc 0x559dc2fe047a 0x559

In [2]:
!pip install SentencePiece
!pip install transformers



In [3]:
def abstractiveSummaryWithPegasus(words):
    # importing & loading model
    from transformers import PegasusForConditionalGeneration, PegasusTokenizer
    tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
    model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

    # perform summarization
    tokens = tokenizer(words, truncation=True, padding="longest", return_tensors="pt")
    summary = model.generate(**tokens)
    actual_summ = tokenizer.decode(summary[0])

    # returning summary
    print(actual_summ)

In [4]:
word_text = """
Each participant in the network can choose what they host/provide and can be home to different content. Similar to your home network, you are in control of what you share, and you don’t share everything.

This is a core tenet of decentralized identity. The same cryptographic principles underpinning cryptocurrencies like Bitcoin and Ethereum are being leveraged by applications to provide secure, cross-platform identity services. This is fundamentally different from other authentication systems such as OAuth 2.0, where a trusted party has to be reached to assess one's identity. This materializes in the form of “Login with Big Cloud provider” buttons. These cloud providers are the only ones with enough data, resources, and technical expertise.

"""

abstractiveSummaryWithPegasus(word_text)

The Big Cloud is a network of people who share data and resources.
