<a href="https://colab.research.google.com/github/krishnavarathan/REST-API/blob/main/youtube_transricpt_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates how to fetch YouTube video transcripts using the `youtube-transcript-api` library. It specifically addresses IP blocking issues often encountered when running on cloud platforms by implementing a proxy configuration using `WebshareProxyConfig`. The process involves installing the necessary library, setting up the proxy with custom headers, fetching transcripts for a given video ID, and then processing the retrieved data into a pandas Series for further analysis with error handling.


In [1]:
# Install the YouTube Transcript API library
!pip install youtube-transcript-api



**Problem statememnt:** Unfortunately, YouTube has started blocking most IPs that are known to belong to cloud providers (like AWS, Google Cloud Platform, Azure, etc.), which means you will most likely run into RequestBlocked or IpBlocked exceptions when deploying your code to any cloud solutions.

**Solution:** To by-pass this issue i used Webshare proxy-package. "Proxy Username" and "Proxy Password" taken from official Webshare Proxy site by creating an account.

**Results:** unlimited API requets, but time latency increased.

In [52]:
import pandas as pd
import numpy as
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.proxies import WebshareProxyConfig
from requests import Session

http_client = Session()

# set custom header
http_client.headers.update({"Accept-Encoding": "gzip, deflate"})

# set path to CA_BUNDLE file - Removed this line as it was causing the error
# http_client.verify = "/path/to/certfile"

ytt_api = YouTubeTranscriptApi( http_client=http_client,
    proxy_config=WebshareProxyConfig(
        proxy_username="ujekszac",
        proxy_password="6ndc1xya46ow",
    )
)

In [50]:
print(http_client.headers)

{'User-Agent': 'python-requests/2.32.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'close', 'Accept-Language': 'en-US'}


In [71]:
# languages param if you want to make sure the transcripts are retrieved in your desired language
# preserve_formatting=True if you'd like to keep HTML formatting elements such as <i> (italics) and <b> (bold)

vedio_id='NyOtXBBmjLY'
try:
  res=ytt_api.fetch(vedio_id, languages=['de', 'en'], preserve_formatting=True)
  print(len(res))
except VideoUnavailable:
  print(f"VideoUnavailable Error Occured! ")
except NoTranscriptFound:
  print(f"NoTranscriptFound Error Occured! ")
except Exception as e:
  print(f"Unable to fetch the data from souce{e}")

187


In [46]:
a=1
for i in res:
  print(f"Stanza {a} "+i.text)
  a+=1
res[-1]
a=0

Stanza 1 What's up everybody? Adrian, founder of
Stanza 2 Scrape Creators, the web scraping guy
Stanza 3 here and wanted to show you how to get
Stanza 4 transcripts from a channel's videos on
Stanza 5 YouTube. So here, like if we want to get
Stanza 6 the videos, the transcripts of the
Stanza 7 all-in podcast. Let's say we want to get
Stanza 8 like the first 20 or 30, then I'm going
Stanza 9 to show you how to do that and show you
Stanza 10 how to do it fast. So we are going to be
Stanza 11 using the scrape creators API and I
Stanza 12 created it. It's pretty fast if I do say
Stanza 13 so myself. So we're going to use these
Stanza 14 two endpoints, channel videos. So that
Stanza 15 is /v1 YouTube channel videos. You can
Stanza 16 see the sample responses here as well as
Stanza 17 how to make the requests. And I even
Stanza 18 have this nice copy for AI button that
Stanza 19 will copy the sample request and
Stanza 20 response. And then you can plug it in a
Stanza 21 cursor or chatbt and 

In [47]:
# Raw information about the snippet
res.to_raw_data()

[{'text': "What's up everybody? Adrian, founder of",
  'start': 0.4,
  'duration': 2.88},
 {'text': 'Scrape Creators, the web scraping guy',
  'start': 1.92,
  'duration': 2.8},
 {'text': 'here and wanted to show you how to get',
  'start': 3.28,
  'duration': 4.32},
 {'text': "transcripts from a channel's videos on",
  'start': 4.72,
  'duration': 5.24},
 {'text': 'YouTube. So here, like if we want to get',
  'start': 7.6,
  'duration': 4.64},
 {'text': 'the videos, the transcripts of the',
  'start': 9.96,
  'duration': 3.96},
 {'text': "all-in podcast. Let's say we want to get",
  'start': 12.24,
  'duration': 3.439},
 {'text': "like the first 20 or 30, then I'm going",
  'start': 13.92,
  'duration': 2.8},
 {'text': 'to show you how to do that and show you',
  'start': 15.679,
  'duration': 3.36},
 {'text': 'how to do it fast. So we are going to be',
  'start': 16.72,
  'duration': 4.639},
 {'text': 'using the scrape creators API and I',
  'start': 19.039,
  'duration': 4.16},
 {'t

In [32]:
# List available transcripts
# If you want to list all transcripts which are available for a given video you can call:

transcript_list=ytt_api.list(vedio_id)
print(transcript_list)
transcript = transcript_list.find_transcript(['de', 'en'])
transcript.language

For this video (NyOtXBBmjLY) transcripts are available in the following languages:

(MANUALLY CREATED)
None

(GENERATED)
 - en ("English (auto-generated)")[TRANSLATABLE]

(TRANSLATION LANGUAGES)
 - ar ("Arabic")
 - zh-Hant ("Chinese (Traditional)")
 - nl ("Dutch")
 - fr ("French")
 - de ("German")
 - hi ("Hindi")
 - id ("Indonesian")
 - it ("Italian")
 - ja ("Japanese")
 - ko ("Korean")
 - pt ("Portuguese")
 - ru ("Russian")
 - es ("Spanish")
 - th ("Thai")
 - uk ("Ukrainian")
 - vi ("Vietnamese")


'English (auto-generated)'

In [33]:
print(transcript)

en ("English (auto-generated)")[TRANSLATABLE]


In [34]:
# filter for manually created transcripts
# transcript = transcript_list.find_manually_created_transcript(['it', 'en'])
# transcript
#  for thi vedio_id none manually_created_transcript available

In [35]:
print(
    transcript.video_id, '\n',
    transcript.language, '\n',
    transcript.language_code, '\n',
    # whether it has been manually created or generated by YouTube
    transcript.is_generated, '\n',
    # whether this transcript can be translated or not
    transcript.is_translatable, '\n',
    # a list of languages the transcript can be translated to
    transcript.translation_languages, '\n',
)

NyOtXBBmjLY 
 English (auto-generated) 
 en 
 True 
 True 
 [_TranslationLanguage(language='Arabic', language_code='ar'), _TranslationLanguage(language='Chinese (Traditional)', language_code='zh-Hant'), _TranslationLanguage(language='Dutch', language_code='nl'), _TranslationLanguage(language='French', language_code='fr'), _TranslationLanguage(language='German', language_code='de'), _TranslationLanguage(language='Hindi', language_code='hi'), _TranslationLanguage(language='Indonesian', language_code='id'), _TranslationLanguage(language='Italian', language_code='it'), _TranslationLanguage(language='Japanese', language_code='ja'), _TranslationLanguage(language='Korean', language_code='ko'), _TranslationLanguage(language='Portuguese', language_code='pt'), _TranslationLanguage(language='Russian', language_code='ru'), _TranslationLanguage(language='Spanish', language_code='es'), _TranslationLanguage(language='Thai', language_code='th'), _TranslationLanguage(language='Ukrainian', language_code

In [51]:
sn=transcript.fetch().snippets
text_only = [segment.text for segment in sn]
text_only

["What's up everybody? Adrian, founder of",
 'Scrape Creators, the web scraping guy',
 'here and wanted to show you how to get',
 "transcripts from a channel's videos on",
 'YouTube. So here, like if we want to get',
 'the videos, the transcripts of the',
 "all-in podcast. Let's say we want to get",
 "like the first 20 or 30, then I'm going",
 'to show you how to do that and show you',
 'how to do it fast. So we are going to be',
 'using the scrape creators API and I',
 "created it. It's pretty fast if I do say",
 "so myself. So we're going to use these",
 'two endpoints, channel videos. So that',
 'is /v1 YouTube channel videos. You can',
 'see the sample responses here as well as',
 'how to make the requests. And I even',
 'have this nice copy for AI button that',
 'will copy the sample request and',
 'response. And then you can plug it in a',
 'cursor or chatbt and it can do it for',
 'you or you can do it yourself, whatever',
 'you like. Gives you the headers and',
 'query paramete

In [76]:
# Fitting the transricpts data into a Series
sr=np.array(text_only)
type(sr)
se_sr=pd.Series(sr, name='Transcripts',)
se_sr

Unnamed: 0,Transcripts
0,"What's up everybody? Adrian, founder of"
1,"Scrape Creators, the web scraping guy"
2,here and wanted to show you how to get
3,transcripts from a channel's videos on
4,"YouTube. So here, like if we want to get"
...,...
182,"library, Twitter, including search,"
183,"followers, following, Reddit, threads,"
184,"truth social, like you name it, we got"
185,it. So check that out.
