# Notebook : Extracting Metadata


## About

In this notebook, we will explore how to extract content from video. 

- metadata
- transcript
- core youtube frames

In [1]:
import rich
from pathlib import Path
import os
from IPython.display import Code


## Sample Video

In [2]:
from IPython.display import YouTubeVideo


In [3]:
video_id = 'ODluYyMZzs0'

In [4]:
video_obj = YouTubeVideo(video_id, width=800, height=400)
video_obj

In [5]:
video_url = f"https://www.youtube.com/watch?v={video_id}"
print (video_url)

https://www.youtube.com/watch?v=ODluYyMZzs0


![Information in a youtube video](../images/video_metadata.png)


There is a lot of information in the youtube description. 

But a lot of space is taken for socials and affliate links.

## Extracting content (text)

### Metadata

There is a great library that makes it easy to extract youtube information

In [6]:
from pytubefix import YouTube

In [7]:
yt = YouTube(video_url)

basic video info

In [8]:
print (f"""

Title: {yt.title}
Author: {yt.author}
Publish Date: {yt.publish_date}
Video Length : {yt.length}

Video Id : {yt.video_id}
Channel Id : {yt.channel_id}
Thumbnail Url : {yt.thumbnail_url}
Watch Url : {yt.watch_url}


""")



Title: Top 5 NYC Foods You MUST TRY Before You Die!
Author: Here Be Barr
Publish Date: 2023-04-16 08:00:08-07:00
Video Length : 757

Video Id : ODluYyMZzs0
Channel Id : UCmKNW9ontOlTd_p4Dg51sRQ
Thumbnail Url : https://i.ytimg.com/vi/ODluYyMZzs0/sddefault.jpg?v=643c0345
Watch Url : https://youtube.com/watch?v=ODluYyMZzs0





description

In [9]:
print (yt.description)

For 24 Hours we'll eat the most iconic food that New York has to offer. From pizza, to pastrami, bagels, hot dogs and more, join us on this epic NYC Food Tour in 2023!
🎨 Buy an NYC Art Print/Postcard from Adriana's Store: https://www.etsy.com/shop/illustrationsbyadri
🛒 SHOP our NEW NYC Guides For Your Next Trip: http://www.thatch.co/@herebebarr
⭐ CHEAPEST Way To Book NYC Attractions:  https://gyg.me/J449y9gl

📝 GET Your FREE First-Timers GUIDE to NYC: https://my.ny-guide.com/freenycguide

✔️ SUBSCRIBE NOW! DON’T FORGET! The more the merrier! :)

🍕 Buy Me A Slice of Pizza: https://www.buymeacoffee.com/herebebarr
✈️ TRAVEL FOR FREE With These Rewards Credit Cards: http://bit.ly/328jVBX
👕 Buy Some Merchandise (T-Shirts/Hoodies/Coffee Mugs): https://teespring.com/stores/here-be-barrs-store

CONNECT-
✅ VISIT MY WEBSITE: http://www.ny-guide.com
✅ FOLLOW ME ON IG: http://www.instagram.com/here.be.barr
✅ FOLLOW ME ON TWITTER: http://www.twitter.com/herebebarr
✅ LIKE US ON FACEBOOK: http://www.

engagement info

In [10]:
print (f"""

Likes: {yt.likes}
Rating: {yt.rating}
Views: {yt.views}
Video Length : {yt.length}
""")



Likes: None
Rating: None
Views: 691241
Video Length : 757



Youtube Chapters

Video chapters add info and context to each portion of the video and let you easily rewatch different parts of the video. 

Creators can add their own video chapters for each uploaded video or rely on automatic video chapters.

[Link](https://support.google.com/youtube/answer/9884579?hl=en)

In [11]:
rich.print( yt.chapters)

heatmap

data about "most replayed" graph for videos on Youtube.



In [12]:
rich.print(yt.replayed_heatmap[:5])


### Youtube Transcript

In [13]:
from youtube_transcript_api import YouTubeTranscriptApi 
import re

In [14]:
video_url

'https://www.youtube.com/watch?v=ODluYyMZzs0'

In [15]:
transcript_dict = YouTubeTranscriptApi.get_transcript(video_id)

In [16]:
rich.print (transcript_dict )

From above, we can see the transcript seems pretty good, but they are not full sentences.   

It doesn't capture punctuation and is of varing length.

### Youtube Transcript using LLamaHub Dataloader

[LlamaHub](https://llamahub.ai/)



In [17]:
from llama_index.readers.youtube_transcript import YoutubeTranscriptReader


In [18]:

loader = YoutubeTranscriptReader()
documents = loader.load_data(
    ytlinks=[video_url]
)

In [19]:
len(documents)

1

In [20]:
rich.print (documents[0] )

In [21]:
type(documents[0])

llama_index.core.schema.Document

## Extracting content (video)

videos, frames, and will be saved here

In [22]:
from moviepy.editor import VideoFileClip


In [23]:
data_folder = "../data/"
Path(data_folder).mkdir(parents=True, exist_ok=True)

In [24]:
yt = YouTube(video_url, use_po_token=False)

In [25]:
yt

<pytubefix.__main__.YouTube object: videoId=ODluYyMZzs0>

In [27]:
output_path = os.path.join(data_folder, video_id)
video_file_path = os.path.join(output_path, "video.mp4")



In [28]:
yt.streams.get_highest_resolution().download(
        output_path=output_path, filename="video.mp4"
    )

'/home/jupyter/pydata_rag_video/notebooks/../data/ODluYyMZzs0/video.mp4'

to simplify the next section, all the code we went over is shared in a utility file

In [29]:
Code(filename='video_utils.py', language='python')


## Notes

We explored how to: 
- extract video metadata
- fetch video transcipt
- download video 


Thanks to great libraries : pytubefix, youtube_transcript_api

