## YouTube API Info
Costs:
- Search: 100 credits per request (1-50 videos)
- Video Details: 1 credit per request (1-50 videos)
- Comment Threads: 1 credit per request (1-100 comment threads)

Quota: 10,000 credits per day
- Search: max 5,000 videos per day
- Video Details: max 500,000 videos per day
- Comment Threads: max 1,000,000 comment threads per day

In [51]:
%load_ext autoreload
%autoreload 2

import os
import pandas as pd
import pathlib as Path

from googleapiclient.discovery import build

# from config import YOUTUBE_API_KEY
from dotenv import load_dotenv
load_dotenv()
YOUTUBE_API_KEY = os.getenv('YOUTUBE_API_KEY')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Testing out the YouTube API

## Resources
- https://developers.google.com/youtube/v3/docs
- https://www.thepythoncode.com/article/using-youtube-api-in-python
- https://medium.com/daily-python/python-script-to-search-content-using-youtube-data-api-daily-python-8-1084776a6578
- https://medium.com/mcd-unison/youtube-data-api-v3-in-python-tutorial-with-examples-e829a25d2ebd

In [71]:
API_SERVICE_NAME = "youtube"
API_VERSION = "v3"

In [72]:
youtube = build(serviceName=API_SERVICE_NAME, version=API_VERSION, developerKey=YOUTUBE_API_KEY)

In [4]:
import isodate

def print_video_infos(video_response):
    items = video_response.get("items")[0]
    # get the snippet, statistics & content details from the video response
    snippet         = items["snippet"]
    statistics      = items["statistics"]
    content_details = items["contentDetails"]
    video_id        = items["id"]
    # get infos from the snippet
    channel_title = snippet["channelTitle"]
    title         = snippet["title"]
    description   = snippet["description"]
    publish_time  = snippet["publishedAt"]
    # get stats infos
    comment_count = statistics["commentCount"] if "commentCount" in statistics else "NaN"
    like_count    = statistics["likeCount"] if "likeCount" in statistics else "NaN"
    view_count    = statistics["viewCount"]
    # get duration from content details
    duration = content_details["duration"]
    # duration in the form of something like 'PT5H50M15S'
    # parsing it to be something like '5:50:15'
    duration_str = isodate.strftime(isodate.parse_duration(duration), "%H:%M:%S")
    
    print(f"""\
    Title: {title}
    Channel Title: {channel_title}
    Video ID: {video_id}
    Publish time: {publish_time}
    Duration: {duration_str}
    Number of comments: {comment_count}
    Number of likes: {like_count}
    Number of views: {view_count}
    """)
    
    # Description: {description}

In [5]:
# 100 API Credits Per Request/Search
# https://developers.google.com/youtube/v3/docs/search/list
def search(youtube, **kwargs):
    return youtube.search().list(
        part="snippet",
        **kwargs
    ).execute()

In [6]:
# 1 API Credit Per Request/Video
# https://developers.google.com/youtube/v3/docs/videos/list
def get_video_details(youtube, **kwargs):
    return youtube.videos().list(
        part="snippet,contentDetails,statistics",
        **kwargs,
    ).execute()

In [69]:
# 1 API Credit Per Request/Comment Thread
# https://developers.google.com/youtube/v3/docs/commentThreads/list
def get_video_comment_threads(youtube, get_replies=False, **kwargs):
    part = "snippet"
    part += ",replies" if get_replies else ""
    return youtube.commentThreads().list(
        part=part,
        **kwargs
        ).execute()

In [8]:
# search for the query 'nuclear' and retrieve 2 items only
# https://developers.google.com/youtube/v3/docs/search/list
response = search(
    youtube,
    q="StopTheSteal",
    pageToken=None, 
    publishedAfter="2021-01-01T00:00:00Z",
    publishedBefore="2021-01-12T00:00:00Z",
    order="viewCount", # date, rating, relevance (default), title, videoCount, viewCount
    type="video", # channel, playlist, video
    maxResults=50, # 0-50
    )

In [9]:
items = response.get("items")
nextPageToken = response.get("nextPageToken") if "nextPageToken" in response else None
video_ids = [item["id"]["videoId"] for item in items]
video_titles = [item["snippet"]["title"] for item in items]
channel_ids = [item["snippet"]["channelId"] for item in items]
channel_titles = [item["snippet"]["channelTitle"] for item in items]
published_time = [item["snippet"]["publishedAt"] for item in items]

In [10]:
df = pd.DataFrame()
df['video_ids'] = video_ids
df['video_titles'] = video_titles
df['channel_ids'] = channel_ids
df['channel_titles'] = channel_titles
df['published_time'] = published_time

In [98]:
def get_search_results(youtube, query, start_date, end_date, results_to_get=50):
    """
    Search for videos based on a query term.
    API Cost: 100 credits for 50 search results
    Documentation: https://developers.google.com/youtube/v3/docs/search/list
    
    Parameters:
        youtube (object): The YouTube API object.
        query (str): The search query term.
        start_date (str): The start date for the search query.
        end_date (str): The end date for the search query.
        max_results (int): The maximum number of results to return.
    
    Returns:
        A pandas DataFrame containing the search results.
    """
    nextPageToken = None
    first_page = True    
    
    search_df = pd.DataFrame()
    while (nextPageToken is not None) or (first_page is True):
        first_page = False
        to_get = 0
        if results_to_get >= 50:
            to_get = 50
            results_to_get -= 50
        elif 0 < results_to_get < 50:
            to_get = results_to_get
            results_to_get = 0
        else: # results_to_get <= 0
            break
        
        try:
            response = youtube.search().list(
                part="snippet",
                q=query,
                pageToken=None,
                type="video",
                order="viewCount", # date, rating, relevance (default), title, videoCount, viewCount
                publishedAfter=start_date + "T00:00:00Z",
                publishedBefore=end_date + "T00:00:00Z",
                maxResults=to_get,
            ).execute()
            
        except Exception as e:
            print(e)
            break
            
        nextPageToken = response.get("nextPageToken") if "nextPageToken" in response else None

        items = response.get("items")
        
        df = pd.DataFrame()
        df['video_ids'] = [item["id"]["videoId"] for item in items]
        df['video_titles'] = [item["snippet"]["title"] for item in items]
        df['channel_ids'] = [item["snippet"]["channelId"] for item in items]
        df['channel_titles'] = [item["snippet"]["channelTitle"] for item in items]
        df['published_time'] = [item["snippet"]["publishedAt"] for item in items]
    
        search_df = pd.concat([search_df, df], ignore_index=True)
    
    return search_df

In [103]:
df = get_search_results(youtube, "StopTheSteal", "2021-01-01", "2021-01-12", 60)

In [12]:
# video_responses = []
# for video_id in video_ids:
#     # get the video details
#     video_response = get_video_details(youtube, id=video_id)
#     video_responses.append(video_response)

In [73]:
comments = []
video_id = video_ids[1]

next_page_token = None
first_page = True
page_num = 0
max_pages = 1

while (next_page_token is not None) or (first_page is True):
    first_page = False
    if page_num >= max_pages:
        break
    
    threads = get_video_comment_threads(
        youtube,
        videoId=video_id,
        get_replies=True,
        pageToken=next_page_token,
        order="time", # time, relevance (sorting by relevance uses an algorithm which can filter out some comments)
        maxResults=100,
        )
    
    next_page_token = threads.get("nextPageToken") if "nextPageToken" in threads else None
    
    comments += threads['items']
    
    print(f"Retrieved {len(threads['items'])} comments from page {page_num}")
    page_num += 1
    break


Retrieved 100 comments from page 0


In [77]:
flattened_comments = []
for comment in comments:
    flattened_comments += [comment["snippet"]["topLevelComment"]]
    if "replies" in comment:
        flattened_comments += comment["replies"]["comments"]

flattened_df = pd.json_normalize(flattened_comments, sep=".")

flattened_df = flattened_df.rename(
    columns={
        "snippet.videoId" : "video_id",
        "snippet.textOriginal" : "text_original",
        "snippet.authorDisplayName" : "author_display_name",
        "snippet.authorChannelId.value" : "author_channel_id",
        "snippet.likeCount" : "like_count",
        "snippet.publishedAt" : "published_at",
        "snippet.updatedAt" : "updated_at",
        "snippet.parentId" : "parent_id",
    }
)

flattened_df = flattened_df[
    ["id",
    "video_id",
    "text_original",
    "author_display_name",
    "author_channel_id",
    "like_count",
    "published_at",
    "updated_at",
    "parent_id"]
]

In [78]:
flattened_df

Unnamed: 0,id,video_id,text_original,author_display_name,author_channel_id,like_count,published_at,updated_at,parent_id
0,UgxBtRRdCCHG55EkHUZ4AaABAg,lfP_5L8epow,This is what it was like to be inside the Capi...,VICE News,UCZaT_X_mc0BI-djXOlfhqWQ,295,2021-01-11T23:19:38Z,2021-01-11T23:19:38Z,
1,UgxBtRRdCCHG55EkHUZ4AaABAg.9IO_1krp-WF9nLM0qjNUTG,lfP_5L8epow,Feds caught destroying evidence on Jan 6 trial...,Rob Howell,UCyE10gQA3fBXp2ITp3xhc2g,0,2023-03-17T02:04:04Z,2023-03-17T02:04:04Z,UgxBtRRdCCHG55EkHUZ4AaABAg
2,UgxBtRRdCCHG55EkHUZ4AaABAg.9IO_1krp-WF9jldUie-Mkk,lfP_5L8epow,@GhostsAmongUs 👈🏼😏 the only group that hasn't ...,NVMVNV,UCPvb48QraVqqx0J3ZpW2Wlg,0,2022-12-18T06:19:21Z,2022-12-18T06:19:21Z,UgxBtRRdCCHG55EkHUZ4AaABAg
3,UgxBtRRdCCHG55EkHUZ4AaABAg.9IO_1krp-WF9jldMLJQeDo,lfP_5L8epow,@GhostsAmongUs 👈🏼😏 name a race that hasn't act...,NVMVNV,UCPvb48QraVqqx0J3ZpW2Wlg,0,2022-12-18T06:18:12Z,2022-12-18T06:18:12Z,UgxBtRRdCCHG55EkHUZ4AaABAg
4,UgxBtRRdCCHG55EkHUZ4AaABAg.9IO_1krp-WF9jldEqqyVE-,lfP_5L8epow,@Jesse Morin 👈🏼🙄 there's a difference between...,NVMVNV,UCPvb48QraVqqx0J3ZpW2Wlg,0,2022-12-18T06:17:11Z,2022-12-18T06:17:11Z,UgxBtRRdCCHG55EkHUZ4AaABAg
...,...,...,...,...,...,...,...,...,...
110,UgwKg0s3Me18_EZM0w54AaABAg,lfP_5L8epow,That day of violence followed a whole summer o...,Mark Cole,UCOiLgQ0lrrZxIvu8GrvTP1Q,1,2022-12-24T15:38:20Z,2022-12-24T15:38:20Z,
111,Ugzhk8ttMDvJB6zWXhF4AaABAg,lfP_5L8epow,Antifa was payed to get this sarted,Mark Cole,UCOiLgQ0lrrZxIvu8GrvTP1Q,0,2022-12-24T15:35:07Z,2022-12-24T15:35:07Z,
112,Ugzuh4iISUl7Ow12GSl4AaABAg,lfP_5L8epow,Why aren't you guys questioning Nancy Pelosi's...,Gaming History Source,UCgeSfaLXS8Macm3ddzYF_wg,1,2022-12-23T17:44:32Z,2022-12-23T17:44:32Z,
113,UgysPp9EU9qo2wCEsCt4AaABAg,lfP_5L8epow,So sad so many people went to jail for Trump p...,Kym Jess,UCiSaQ4cEACXj3Wd5fymDAuA,0,2022-12-22T18:13:26Z,2022-12-22T18:13:26Z,
