# YouTube Transcript Scraper Example

This notebook demonstrates how to use the TranscriptScraper to fetch and process YouTube transcripts.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from youtube_transcript_scraper import TranscriptScraper

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.width', 1000)

## Load the SponsorBlock Dataset

First, let's load the SponsorBlock dataset that contains video IDs we want to scrape.

In [2]:
# Load the SponsorBlock Dataset
from pyarrow.parquet import ParquetFile
import pyarrow as pa

pf = ParquetFile(r'C:\Users\caotr\Downloads\sponsorTimes.parquet')
batch1 = next(pf.iter_batches(batch_size=1e6))
sponsor_df = pa.Table.from_batches([batch1]).to_pandas()
sponsor_df.head()

Unnamed: 0,videoID,startTime,endTime,votes,locked,incorrectVotes,UUID,userID,timeSubmitted,views,category,actionType,service,videoDuration,hidden,reputation,shadowHidden,hashedVideoID,userAgent,description
0,1rb3bMvDdX4,588.941,631.89777,159,0,1,28aff450-a372-11e9-b256-cb886cabe693,38e7c2af-09f4-4492-bf49-75e443962ccd,1564088876715,4642,sponsor,skip,YouTube,0.0,0,0.0,0,95e409452186a56331b7a58d518361285e18b8db50de20...,,
1,fBxtS9BpVWs,41.0,53.0,115,0,1,b2465943-1313-449c-b75c-08b14756ac0a,38e7c2af-09f4-4492-bf49-75e443962ccd,1564088876715,776,sponsor,skip,YouTube,0.0,0,0.0,0,bdd81b2b8192683242fe3608c45d5b958ddc71e9b2981a...,,
2,9P6rdqiybaw,488.5215,542.11035,-2,0,1,81024780-a367-11e9-b256-cb886cabe693,38e7c2af-09f4-4492-bf49-75e443962ccd,1564088876715,25661,sponsor,skip,YouTube,552.0,0,0.0,0,cc9cd26ee245cb89f2be13d047de8ea1a642c8f56bcb6e...,,
3,ulCdoCfw-bY,487.50198,547.4875,-2,0,1,16090680-a367-11e9-b256-cb886cabe693,38e7c2af-09f4-4492-bf49-75e443962ccd,1564088876715,26984,sponsor,skip,YouTube,0.0,0,0.0,0,177779136cde894988da5e2d3160ef38d302a8554d710d...,,
4,uqKGREZs6-w,475.52167,532.20874,302,0,1,622f9270-a2a1-11e9-b210-99c885575bb9,38e7c2af-09f4-4492-bf49-75e443962ccd,1564088876715,18060,sponsor,skip,YouTube,0.0,1,0.0,0,70f8d0e75affa202ab510bd86828080af1dddf9218be46...,,


## Initialize the TranscriptScraper

Now, we'll initialize the TranscriptScraper with our desired configuration.

In [3]:
# Initialize the scraper
scraper = TranscriptScraper(
    output_dir=r'C:\Users\caotr\Downloads\transcript_data',  # Directory to save data
    max_workers=16,                # Number of parallel workers
    rate_limit=24000,                # Requests per minute
    cooldown_time=5,            # 5 seconds cooldown when rate limited
    unlimited_mode=True,         # Retry after rate limiting
    checkpoint_interval=500       # Save checkpoint every 1000 videos
)

## Extract Unique Video IDs

Next, let's extract unique video IDs from the SponsorBlock dataset to avoid duplicate processing.

In [4]:
# Get unique video IDs
video_ids = scraper.get_unique_video_ids(sponsor_df)
print(f"Found {len(video_ids)} unique video IDs")

# Display a few example IDs
print("Sample video IDs:")
print(video_ids[:5])

Found 596667 unique video IDs
Sample video IDs:
['1rb3bMvDdX4', 'fBxtS9BpVWs', '9P6rdqiybaw', 'ulCdoCfw-bY', 'uqKGREZs6-w']


## Process a Small Test Batch

Let's first process a small batch of videos to test the functionality.

In [5]:
# # Process a small batch first (10 videos)
# test_batch = video_ids[:10]
# processed_ids = scraper.process_video_batch(test_batch, resume=False)

## View Statistics

Now let's check the statistics of the test run.

In [6]:
# # Display stats
# scraper.display_stats()

## Process a Larger Batch

If the test is successful, we can process a larger batch of videos.

In [7]:
# # Process a larger batch (100 videos)
# # You can adjust this number or use the full list as needed
# larger_batch = video_ids[:1000]
# processed_ids = scraper.process_video_batch(larger_batch, resume=True)

## View Updated Statistics

In [8]:
# # Display updated stats
# scraper.display_stats()

## Process the Full Dataset

Once you're confident in the scraper's functionality, you can process the full list of video IDs.

In [5]:
# Uncomment to process all videos
processed_ids = scraper.process_video_batch(video_ids, resume=True)

2025-04-06 15:51:53,526 - INFO - Loaded checkpoint with 134489 processed videos
2025-04-06 15:51:53,624 - INFO - Starting processing of 462178 videos
Processing videos:   0%|          | 0/462178 [00:00<?, ?it/s]2025-04-06 15:52:04,038 - ERROR - Error processing a video: 'json_decode_errors'
Processing videos:   0%|          | 1/462178 [00:00<96:55:29,  1.32it/s]2025-04-06 15:52:04,060 - ERROR - Error processing a video: 'json_decode_errors'
2025-04-06 15:52:04,111 - ERROR - Error processing a video: 'json_decode_errors'
2025-04-06 15:52:04,119 - ERROR - Error processing a video: 'json_decode_errors'
2025-04-06 15:52:04,131 - ERROR - Error processing a video: 'json_decode_errors'
2025-04-06 15:52:04,133 - ERROR - Error processing a video: 'json_decode_errors'
2025-04-06 15:52:04,146 - ERROR - Error processing a video: 'json_decode_errors'
Processing videos:   0%|          | 7/462178 [00:00<13:13:12,  9.71it/s]2025-04-06 15:52:04,219 - ERROR - Error processing a video: 'json_decode_error

: 

: 

## Convert to DataFrame

Now, let's convert the fetched transcripts to a DataFrame for analysis.

In [None]:
# Convert to DataFrame
transcript_df = scraper.convert_to_dataframe()
print(f"Transcript DataFrame shape: {transcript_df.shape}")
transcript_df.head()

## Merge with SponsorBlock Data

Finally, let's merge the transcript data with the SponsorBlock data to identify transcript segments that overlap with sponsor segments.

In [None]:
# Merge with sponsor data
merged_df = scraper.merge_with_sponsor_data(transcript_df, sponsor_df)
print(f"Merged DataFrame shape: {merged_df.shape}")
merged_df.head()

## Analyze Sponsor Segments

Let's analyze the transcript segments that are marked as sponsors.

In [None]:
# Count how many transcript segments are in sponsor regions
sponsor_count = merged_df['is_sponsor'].sum()
total_count = len(merged_df)
sponsor_percentage = (sponsor_count / total_count) * 100

print(f"Sponsor segments: {sponsor_count} out of {total_count} ({sponsor_percentage:.2f}%)")

## Example: Visualize a Single Video's Transcript with Sponsor Regions

Let's visualize the transcript of a single video, highlighting the sponsor regions.

In [None]:
# Select a video that has both transcript and sponsor data
videos_with_sponsors = merged_df[merged_df['is_sponsor'] == True]['video_id'].unique()

if len(videos_with_sponsors) > 0:
    example_video_id = videos_with_sponsors[0]
    
    # Get the transcript for this video
    video_transcript = merged_df[merged_df['video_id'] == example_video_id].sort_values('start')
    
    # Plot the timeline
    plt.figure(figsize=(15, 6))
    
    # Plot all transcript segments
    plt.barh(y=0, width=video_transcript['duration'], left=video_transcript['start'], 
             color='lightblue', alpha=0.5, label='Transcript')
    
    # Highlight sponsor segments
    sponsor_segments = video_transcript[video_transcript['is_sponsor'] == True]
    plt.barh(y=0, width=sponsor_segments['duration'], left=sponsor_segments['start'], 
             color='red', alpha=0.5, label='Sponsor')
    
    plt.title(f'Transcript Timeline for Video ID: {example_video_id}')
    plt.xlabel('Time (seconds)')
    plt.yticks([])
    plt.legend()
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()
    
    # Show the text of sponsor segments
    print("Sample of sponsor segment texts:")
    print(sponsor_segments[['text', 'start', 'end']].head(10))
else:
    print("No videos with both transcript and sponsor data found in the processed batch.")

## Save the Results

Finally, let's save the merged DataFrame to a CSV file for further analysis.

In [None]:
# Save the merged DataFrame to CSV
merged_df.to_csv('transcript_with_sponsors.csv', index=False)
print("Data saved to 'transcript_with_sponsors.csv'")

## Conclusion

This notebook has demonstrated how to use the TranscriptScraper to fetch YouTube transcripts, merge them with SponsorBlock data, and perform basic analysis on the results.