In [2]:
%pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import libraries
import pandas as pd 
import numpy as np
from googleapiclient.discovery import build
from dotenv import dotenv_values

In [3]:
# Set up API key and YouTube object
api_key = dotenv_values().get('API_KEY')
youtube = build('youtube', 'v3', developerKey=api_key)

We use a YouTube video about the triple tanker oil spill in Manila Bay, from Kapuso mo, Jessica Soho. 

In [6]:
video_id = '1iU7y20ZxyE'
video_response = youtube.commentThreads().list(
    videoId=video_id, 
    part='snippet,replies', 
    maxResults=300, 
    order='time', 
).execute()
video_response

{'kind': 'youtube#commentThreadListResponse',
 'etag': 'XlngZp56JGEv2WNti_YtkWiIV_g',
 'nextPageToken': 'Z2V0X25ld2VzdF9maXJzdC0tQ2dnSWdBUVZGN2ZST0JJRkNKMGdHQUVTQlFpSElCZ0FFZ1VJcUNBWUFCSUZDSWdnR0FBU0JRaUpJQmdBR0FBaURnb01DT1hpeDdVR0VJRG83SVVC',
 'pageInfo': {'totalResults': 100, 'resultsPerPage': 100},
 'items': [{'kind': 'youtube#commentThread',
   'etag': 'Bg7f7Intx9SJTHt4auXEUvNFjYk',
   'id': 'UgwI3MwQ0mJe6M_MQ_N4AaABAg',
   'snippet': {'channelId': 'UCj5RwDivLksanrNvkW0FB4w',
    'videoId': '1iU7y20ZxyE',
    'topLevelComment': {'kind': 'youtube#comment',
     'etag': 'NJBDmZQgAJyaYYhya--r0KeHpEo',
     'id': 'UgwI3MwQ0mJe6M_MQ_N4AaABAg',
     'snippet': {'channelId': 'UCj5RwDivLksanrNvkW0FB4w',
      'videoId': '1iU7y20ZxyE',
      'textDisplay': 'Tikom ang bibig ng mga nasa taas sa laki ng tapal ng SMC',
      'textOriginal': 'Tikom ang bibig ng mga nasa taas sa laki ng tapal ng SMC',
      'authorDisplayName': '@ryanjohndimaisip8136',
      'authorProfileImageUrl': 'https://yt3.

In cases where there are more than 100 comments under the video, pagination is required to get the rest of them, since `.list()` only accepts a `maxResults` value of `0-100`. (more on this later) 

Now we can inspect the elements of some comments in the list. 

In [7]:
print(f'https://www.youtube.com/watch?v={video_id}&lc={video_response["items"][0]["snippet"]["topLevelComment"]["id"]}')
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['textDisplay'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['publishedAt'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['textOriginal'])
print(video_response['items'][0]['snippet']['topLevelComment']['snippet']['likeCount'])

https://www.youtube.com/watch?v=1iU7y20ZxyE&lc=UgwI3MwQ0mJe6M_MQ_N4AaABAg
Tikom ang bibig ng mga nasa taas sa laki ng tapal ng SMC
2024-08-23T02:16:14Z
Tikom ang bibig ng mga nasa taas sa laki ng tapal ng SMC
0


In [8]:
# Initialize storage for data
comments = [] 

# Iterate through the comments 
for item in video_response['items']: 

    comment = item['snippet']['topLevelComment']['snippet'] 

    comments.append([
        comment['textDisplay'],
        f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
        pd.to_datetime(comment['publishedAt']),
        comment['textOriginal'],
        comment['likeCount'],
        np.nan, # np.nan for parent_id column later on
    ])

    total_reply_count = item['snippet']['totalReplyCount'] 

    # Iterate through replies (kung meron) 
    if total_reply_count > 0: 
        parent_id = item['snippet']['topLevelComment']['id']

        replies = youtube.comments().list(
            part='snippet',
            parentId=parent_id, 
            maxResults=50,
        ).execute()

        for reply in replies['items']: 
            replyBody = reply['snippet'] 
            comments.append([
                replyBody['textDisplay'],
                f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
                replyBody['publishedAt'],
                replyBody['textOriginal'],
                replyBody['likeCount'],
                replyBody['parentId'],
            ])

comments

[['Tikom ang bibig ng mga nasa taas sa laki ng tapal ng SMC',
  'https://www.youtube.com/watch?v=1iU7y20ZxyE&lc=UgwI3MwQ0mJe6M_MQ_N4AaABAg',
  Timestamp('2024-08-23 02:16:14+0000', tz='UTC'),
  'Tikom ang bibig ng mga nasa taas sa laki ng tapal ng SMC',
  0,
  nan],
 ['dapat managot ung dapat managot dyn kapabayaan inuuna pa KC ung kaperahan ung kkitain',
  'https://www.youtube.com/watch?v=1iU7y20ZxyE&lc=Ugydlk6goUWQszU82vx4AaABAg',
  Timestamp('2024-08-13 12:04:57+0000', tz='UTC'),
  'dapat managot ung dapat managot dyn kapabayaan inuuna pa KC ung kaperahan ung kkitain',
  0,
  nan],
 ['perwisyo yan',
  'https://www.youtube.com/watch?v=1iU7y20ZxyE&lc=UgxsJYInZkDtzLxZyU94AaABAg',
  Timestamp('2024-08-13 12:01:15+0000', tz='UTC'),
  'perwisyo yan',
  0,
  nan],
 ['Ang dahilan ay Ng kbubuhan',
  'https://www.youtube.com/watch?v=1iU7y20ZxyE&lc=UgwGX_twZNBeBI8ut4Z4AaABAg',
  Timestamp('2024-08-12 09:35:56+0000', tz='UTC'),
  'Ang dahilan ay Ng kbubuhan',
  0,
  nan],
 ['Ai nko alam Ng may 

In [9]:
youtube_corpus = pd.DataFrame(
    comments, columns=['snippet','link','date_published','text','like_count','reply_parent_id',]
)
youtube_corpus

Unnamed: 0,snippet,link,date_published,text,like_count,reply_parent_id
0,Tikom ang bibig ng mga nasa taas sa laki ng ta...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-23 02:16:14+00:00,Tikom ang bibig ng mga nasa taas sa laki ng ta...,0,
1,dapat managot ung dapat managot dyn kapabayaan...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-13 12:04:57+00:00,dapat managot ung dapat managot dyn kapabayaan...,0,
2,perwisyo yan,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-13 12:01:15+00:00,perwisyo yan,0,
3,Ang dahilan ay Ng kbubuhan,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-12 09:35:56+00:00,Ang dahilan ay Ng kbubuhan,0,
4,Ai nko alam Ng may bagyo mga vompanya tlgang w...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-11 11:40:17+00:00,Ai nko alam Ng may bagyo mga vompanya tlgang w...,0,
...,...,...,...,...,...,...
97,Jusmiyo marimar ung claims nmin dito sa Mindor...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-06 11:10:19+00:00,Jusmiyo marimar ung claims nmin dito sa Mindor...,0,
98,Kung sa ibang bansa nangyare ito kulong na aga...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-06 10:57:00+00:00,Kung sa ibang bansa nangyare ito kulong na aga...,0,
99,Sana yong mga kompanya na mayari ng mga barko ...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-06 10:36:04+00:00,Sana yong mga kompanya na mayari ng mga barko ...,0,
100,Sayang mahal na nga ang Lang is tinatapon Lang...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-06 09:50:45+00:00,Sayang mahal na nga ang Lang is tinatapon Lang...,0,


In [11]:
# Maximum number of comments we want 
no_comments = 500 

# Re-initialize the data structures 
comments = []
youtube_corpus = None 

video_id = '1iU7y20ZxyE'
video_response = youtube.commentThreads().list(
    videoId=video_id, 
    part='snippet,replies', 
    maxResults=300, 
    order='time', 
    moderationStatus='published',
).execute()

while len(comments) < no_comments: 
    for item in video_response['items']: 

        comment = item['snippet']['topLevelComment']['snippet'] 

        comments.append([
            comment['textDisplay'],
            f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
            pd.to_datetime(comment['publishedAt']),
            comment['textOriginal'],
            comment['likeCount'],
            np.nan, # np.nan for parent_id column later on
        ])

        total_reply_count = item['snippet']['totalReplyCount'] 

        # Iterate through replies (kung meron) 
        if total_reply_count > 0: 
            parent_id = item['snippet']['topLevelComment']['id']

            replies = youtube.comments().list(
                part='snippet',
                parentId=parent_id, 
                maxResults=50,
            ).execute()

            for reply in replies['items']: 
                replyBody = reply['snippet'] 
                comments.append([
                    replyBody['textDisplay'],
                    f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
                    replyBody['publishedAt'],
                    replyBody['textOriginal'],
                    replyBody['likeCount'],
                    replyBody['parentId'],
                ])

    print(str(len(comments)) + ' comments currently in list.') 

    if 'nextPageToken' in video_response: 
        # Notify user that there is another page of comments 
        print('Next comment page found. Now extracting data.')

        video_response = youtube.commentThreads().list(
            videoId=video_id, 
            part='snippet,replies', 
            maxResults=100, 
            order='time', 
            pageToken=video_response['nextPageToken'], 
            moderationStatus='published'
        ).execute() 

    else: 
        # Notify user that there is no more page of comments
        print('No more comment pages left.') 
        break

102 comments currently in list.
Next comment page found. Now extracting data.
212 comments currently in list.
Next comment page found. Now extracting data.
286 comments currently in list.
No more comment pages left.


In [12]:
youtube_corpus = pd.DataFrame(
    comments, columns=['snippet','link','date_published','text','like_count','reply_parent_id',]
)
youtube_corpus

Unnamed: 0,snippet,link,date_published,text,like_count,reply_parent_id
0,Tikom ang bibig ng mga nasa taas sa laki ng ta...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-23 02:16:14+00:00,Tikom ang bibig ng mga nasa taas sa laki ng ta...,0,
1,dapat managot ung dapat managot dyn kapabayaan...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-13 12:04:57+00:00,dapat managot ung dapat managot dyn kapabayaan...,0,
2,perwisyo yan,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-13 12:01:15+00:00,perwisyo yan,0,
3,Ang dahilan ay Ng kbubuhan,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-12 09:35:56+00:00,Ang dahilan ay Ng kbubuhan,0,
4,Ai nko alam Ng may bagyo mga vompanya tlgang w...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-11 11:40:17+00:00,Ai nko alam Ng may bagyo mga vompanya tlgang w...,0,
...,...,...,...,...,...,...
281,bakit ung mga bahay sobrang lapit sa dagat? de...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-05 17:13:48+00:00,bakit ung mga bahay sobrang lapit sa dagat? de...,8,
282,sismio naman dysn myo gamitin ang pera para sa...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-06T18:32:11Z,sismio naman dysn myo gamitin ang pera para sa...,0,UgweJfmNtRkNg7X7IaJ4AaABAg
283,Hindi nila kayang bumili ng lupa para patayuan...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-07T10:03:02Z,Hindi nila kayang bumili ng lupa para patayuan...,0,UgweJfmNtRkNg7X7IaJ4AaABAg
284,"parang may duda aq jn ,",https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-05 16:07:17+00:00,"parang may duda aq jn ,",3,


In [4]:
def extract_youtube_comments(video_id: str, no_comments: int) -> pd.DataFrame:
    comments = []

    video_response = youtube.commentThreads().list(
        videoId=video_id, 
        part='snippet,replies', 
        maxResults=50, 
        order='time', 
        moderationStatus='published',
    ).execute()

    while len(comments) < no_comments: 
        for item in video_response['items']: 

            comment = item['snippet']['topLevelComment']['snippet'] 

            comments.append([
                comment['textDisplay'],
                f'https://www.youtube.com/watch?v={video_id}&lc={item["snippet"]["topLevelComment"]["id"]}',
                pd.to_datetime(comment['publishedAt']),
                comment['textOriginal'],
                comment['likeCount'],
                np.nan, # np.nan for parent_id column later on
            ])

            total_reply_count = item['snippet']['totalReplyCount'] 

            # Iterate through replies (kung meron) 
            if total_reply_count > 0: 
                parent_id = item['snippet']['topLevelComment']['id']

                replies = youtube.comments().list(
                    part='snippet',
                    parentId=parent_id, 
                    maxResults=50,
                ).execute()

                for reply in replies['items']: 
                    replyBody = reply['snippet'] 
                    comments.append([
                        replyBody['textDisplay'],
                        f"https://www.youtube.com/watch?v={video_id}&lc={reply['id']}",
                        replyBody['publishedAt'],
                        replyBody['textOriginal'],
                        replyBody['likeCount'],
                        replyBody['parentId'],
                    ])

        print(str(len(comments)) + ' comments currently in list.') 

        if 'nextPageToken' in video_response: 
            # Notify user that there is another page of comments 
            print('Next comment page found. Now extracting data.')

            video_response = youtube.commentThreads().list(
                videoId=video_id, 
                part='snippet,replies', 
                maxResults=50, 
                order='time', 
                pageToken=video_response['nextPageToken'], 
                moderationStatus='published'
            ).execute() 

        else: 
            # Notify user that there is no more page of comments
            print('No more comment pages left.') 
            break

    return pd.DataFrame(
        comments, columns=['snippet','link','date_published','text','like_count','reply_parent_id',]
    )

In [5]:
video_links = [
    '1iU7y20ZxyE', # GMA 
    'etVZ1VMn71Y',
    '4avFPtdvb94',
    'JMoBBoD-fhY',
    'RKlGfbNZXfU', 
    'TMkdWSojbNg',
    'YHTib7Piwp0', 
    'waLpAM584Eo', 
    'NGoYiqTSMH8', 
    'zdGImljmcFk', # UNTV 
    'NpwJLXoXt2s', 
    'g3wqZkggxOg', 
    'IYeLMmI9Ksc',
    'prSpeRfzFjE', 
    'qg7vmTErqcA',  
    'GHXgXzfI24U', 
    'XxqkuhaBgE4', # ABS-CBN 
    'NokWiHAhTsY', 
    'X__RPRS3qc8', 
    'h-6C67AcRaE', 
    'Uro5rtCgWRM', # Rappler 
    'caG8Bk5t2wo', # Inquirer
    'gjBqQMfA_FA', 
    'dN4Ip02WThA', # Philstar
    'EIgzv5eRnn8', # Manila Bulletin 
    'IwMq11lT6hk', # ANC 24/7
    '6CRVc9YI-yc', # The Manila Times 
    'ZKV6Fpt3Mdg', # DW News (german news outlet)
    'eGt7L_dEvlo', 
]

oilspill_youtube_corpus = None

for video_link in video_links: 
    print(f'\nExtracting comments from https://www.youtube.com/watch?v={video_link}')
    if oilspill_youtube_corpus is None: 
        oilspill_youtube_corpus = extract_youtube_comments(video_link, 300)
    else: 
        oilspill_youtube_corpus = pd.concat([
            oilspill_youtube_corpus, extract_youtube_comments(video_link, 300)
        ])

oilspill_youtube_corpus


Extracting comments from https://www.youtube.com/watch?v=1iU7y20ZxyE
52 comments currently in list.
Next comment page found. Now extracting data.
102 comments currently in list.
Next comment page found. Now extracting data.
154 comments currently in list.
Next comment page found. Now extracting data.
212 comments currently in list.
Next comment page found. Now extracting data.
286 comments currently in list.
No more comment pages left.

Extracting comments from https://www.youtube.com/watch?v=etVZ1VMn71Y
6 comments currently in list.
No more comment pages left.

Extracting comments from https://www.youtube.com/watch?v=4avFPtdvb94
12 comments currently in list.
No more comment pages left.

Extracting comments from https://www.youtube.com/watch?v=JMoBBoD-fhY
2 comments currently in list.
No more comment pages left.

Extracting comments from https://www.youtube.com/watch?v=RKlGfbNZXfU
14 comments currently in list.
No more comment pages left.

Extracting comments from https://www.youtube

Unnamed: 0,snippet,link,date_published,text,like_count,reply_parent_id
0,Tikom ang bibig ng mga nasa taas sa laki ng ta...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-23 02:16:14+00:00,Tikom ang bibig ng mga nasa taas sa laki ng ta...,0,
1,dapat managot ung dapat managot dyn kapabayaan...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-13 12:04:57+00:00,dapat managot ung dapat managot dyn kapabayaan...,0,
2,perwisyo yan,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-13 12:01:15+00:00,perwisyo yan,0,
3,Ang dahilan ay Ng kbubuhan,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-12 09:35:56+00:00,Ang dahilan ay Ng kbubuhan,0,
4,Ai nko alam Ng may bagyo mga vompanya tlgang w...,https://www.youtube.com/watch?v=1iU7y20ZxyE&lc...,2024-08-11 11:40:17+00:00,Ai nko alam Ng may bagyo mga vompanya tlgang w...,0,
...,...,...,...,...,...,...
94,And man will destroy his own domain.,https://www.youtube.com/watch?v=eGt7L_dEvlo&lc...,2024-07-29 12:55:11+00:00,And man will destroy his own domain.,17,
95,Years ago I heard that a GMO bacteria has been...,https://www.youtube.com/watch?v=eGt7L_dEvlo&lc...,2024-07-29 12:48:46+00:00,Years ago I heard that a GMO bacteria has been...,1,
96,Crude oil is no different from cooking oil. Th...,https://www.youtube.com/watch?v=eGt7L_dEvlo&lc...,2024-07-29T21:46:19Z,Crude oil is no different from cooking oil. Th...,0,UgyQxE3-2_87y00FKJh4AaABAg
97,I feel for you brothers .. respect Australian ...,https://www.youtube.com/watch?v=eGt7L_dEvlo&lc...,2024-07-29 12:47:02+00:00,I feel for you brothers .. respect Australian ...,14,


In [6]:
oilspill_youtube_corpus.to_csv('oilspill-comments.csv')