# Setting up Google Account for YouTube API

Google offers an API for YouTube as part of its Google Cloud Platform (GCP). Just like for many other API, you will need an API key. If you're familiar with other cloud platforms such as AWS, then the following process will feel familiar.

Set up a free trial GCP account. Go to your Google console and search for YouTube Data API (v3) (as of 2023). Enable the API.

There should be a place where you can create credentials. From that point on, create an API key, copy it, and place it in a text file. Make sure not to publicly show it. Other programmers can copy it and pretend they're you making API calls, pushing you over the quota.


# Ingesting YouTube Data for IGN Game Reviews Playlist

As of 2023, IGN maintains a playlist comprising videos of mostly game reviews; some videos are discussions rather than the typical review. So we first need to get the playlist. See the following for more information on how to use YouTube's API to get the playlist programatically: https://developers.google.com/youtube/v3/docs/playlistItems/list

You'll see sample code that will include `os` and `googleapiclient.flow`. In particular, `flow` will require client secrets to perform actions on your account's behalf -- for security reasons. However, since we are only interested "scraping" public information, we will only need an API key. See this post for more information: https://github.com/googleapis/google-api-python-client/blob/main/docs/api-keys.md

On top of requesting data from YouTube API, we would also like to log our calls and other helpful information. Consequently, we are going to break down fetching into 2 tasks:
1. Sending requests to YouTube API
2. Logging API calls

If you ever get confused about what an object like `request = youtube.<>` is doing, search up documentation for its `type`.

Logging error codes:
- -100 levels: client sided error
    - 0: undefined
    - 1: bad request
- -200 levels: server sided error
    - 0: undefined
    
    
    

## Sending Requests to YouTube API

Firstly, let's create a few functions to help us get a list of playlist items. More specifically, we'll be primarily using `get_youtube` and `request_playlist_videos`:
- `get_youtube` creates a YouTube build from which to send requests
- `request_playlist_videos` takes in parameters for the request and returns videos from the playlist

In [2]:
playlistId = "PLraFbwCoisJBTl0oXn8UoUam5HXWUZ7ES"
part = ["contentDetails", "id", "snippet", "status"] # all possible info

import googleapiclient.discovery
from time import gmtime, strftime
import pandas as pd
import warnings

# API functions

def get_api_key(file="secret.txt"):
    '''
    Returns the YouTube API key as a string found in file.
    
    Please also make sure you DO NOT publicly expose this file!
    '''
    key = ""
    with open(file) as f:
        key = f.read()
        
    return key

def get_youtube():
    '''
    Returns a YouTube Resource for interacting with the API
    For more information: https://googleapis.github.io/google-api-python-client/docs/epy/googleapiclient.discovery-module.html
    '''
    api_service_name = "youtube"
    api_version = "v3"
    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=get_api_key())
    
    return youtube

def request_playlist_videos(youtube, part=','.join(part), playlistId=playlistId, maxResults=5, pageToken=""):
    '''
    Makes and executes an HTTP request to get maxRresults videos and their part information from the playlistId.
    Each request will be logged.
    '''
    params = {
        "part": part,
        "playlistId": playlistId,
        "maxResults": maxResults,
        "pageToken": pageToken
    }
    
    result_code = 0
    try:
        request = youtube.playlistItems().list(**params)
        response = request.execute()
        result_code = 1
    except:
        warnings.warn("Bad request or response")
        result_code = -1
    
    log_playlistItems_call(result_code, params)
    
    return response


## Logging API Calls

It would be nice if we can keep track of when and how the API calls were made.


In [4]:
from datetime import datetime, timezone

# Functions to log API calls

def get_current_time(timezone=timezone.utc):
    '''
    Returns a datetime object giving the current date and time
    '''
    now = datetime.now(timezone)
    return now

def serialize_datetime(dt):
    '''
    Returns and converts a datetime object into a ISO-formatted string.
    '''
    return dt.isoformat()

def deserialize_datetime_string(dt_string):
    '''
    Converts an ISO-formatted time string into a datetime object.
    '''
    return datetime.fromisoformat(dt_string)

def clear_call_logs(name="call_logs.csv"):
    log_df = pd.DataFrame({
        "part": [],
        "playlistId": [],
        "maxResults": [],
        "pageToken": [],
        "time": [],
        "code": [],
    })
    log_df.to_csv(name, header=True, index=False, mode='w')

def log_playlistItems_call(result, params, name="call_logs.csv"):
    '''
    Logs a row of input into file with the right name. Each record is one HTTP request containing the parameters of that request.
    (part, playlistId, maxResults, pageToken, code, time)
    (params, code, time)
    
    TODO:
    - double check for headers in CSV file
    '''
    assert(type(params) == type({}))
    params["time"] = get_current_time()
    df = pd.DataFrame({**params}, index=[0])
    df['code'] = result
    df.to_csv(name, header=False, index=False, mode="a")
    return df

#### Testing `playlistItems.list` and Logging

In [6]:
clear_call_logs()

youtube = get_youtube()
response = request_playlist_videos(youtube)
# response

In [8]:
response['nextPageToken']

'EAAaL1BUOkNBVWlFRGc0UkVJd1JEVTVOak0wTXprNU5qWW9BVWkxdWFTend1cUNBMUFC'

In [7]:
response.it

{'kind': 'youtube#playlistItemListResponse',
 'etag': 'BFggDhOJ-BSl52zuixypDBGh6LY',
 'nextPageToken': 'EAAaL1BUOkNBVWlFRGc0UkVJd1JEVTVOak0wTXprNU5qWW9BVWkxdWFTend1cUNBMUFC',
 'items': [{'kind': 'youtube#playlistItem',
   'etag': 'xDS-oGO99hvdSXvE3C1sy_Le61I',
   'id': 'UExyYUZid0NvaXNKQlRsMG9YbjhVb1VhbTVIWFdVWjdFUy43NjI0NDIxM0M3RTAxRTU5',
   'snippet': {'publishedAt': '2023-11-30T00:58:29Z',
    'channelId': 'UCKy1dAqELo0zrOtPkf0eTMw',
    'title': 'SteamWorld Build Review',
    'description': "SteamWorld Build reviewed by Jon Bolding on PC, also available on PlayStation, Xbox, and Switch.\n\nSteamWorld Build is an enjoyable little city builder that doesn't give you grindy busywork or overstretch itself with bloated padding, focusing on a solid foundation of laying out a city and providing it with supplies. Its underground layer is a cool additional system that has you expanding, improving, and defending your mining operation in a way that matches the laid back pace of the surface nic

## Fetching all videos from the playlist

Now that each API call for is logged, it's easier to reproduce the steps to catch a mistake.

Uncomment the last line below to run the function.

In [149]:
def download_current_videos():
    clear_call_logs()
    
    max_results = 50
    max_video_count = 2179 # 5*50
    youtube = get_youtube()

    nextPageToken = ""
    video_count = 0
    request_number = 0
    while nextPageToken != None:    
        results = request_playlist_videos(youtube, maxResults=max_results, pageToken=nextPageToken)
        with open(f"./results/{request_number}.json", "w") as fp:
            json.dump(results, fp)

        try:
            nextPageToken = results["nextPageToken"]
        except:
            nextPageToken = None

        try:
            video_count += len(results["items"])
        except:
            video_count += 0

        request_number += 1

        print(f"{round(video_count / max_video_count * 100, 2)}%", end="\r")
        
# wrap this in a function so can comment out and not accidentally run the whole thing
# download_current_videos()

In [148]:
# double check that we have the video counts right
# can also verify counts here: https://www.youtube.com/playlist?list=PLraFbwCoisJBTl0oXn8UoUam5HXWUZ7ES
import json

with open("./results/0.json", "r") as fp:
    r = json.load(fp)

print(r["pageInfo"]["totalResults"])
print(video_count)
print("Above 2 should match")
print("-------")
print(request_number)

2178
2178
Above 2 should match
-------
44


## Appending new videos

Now that the *current* videos from the playlist have been downloaded, how will new data be appended to the current data?

There's also webhooks: https://developers.google.com/youtube/v3/guides/push_notifications

Strategy: only save new ones. Do not let old ones get through.

1. Ask playlist for 50 items. If the last item (oldest) is still considered new, then save results and poll again.
2. If the last item is old, then binary search for cutoff point.

In [3]:
import sys
sys.path.append('..')

In [6]:
sys.path.append(f"{sys.path[0]}/..")

In [7]:
sys.path

['/home/phood/Documents/GitHub/Reviewing-Game-Reviews/Ingestion/playlist',
 '/home/phood/Documents/Anaconda/Installation/anaconda3/envs/slit/lib/python311.zip',
 '/home/phood/Documents/Anaconda/Installation/anaconda3/envs/slit/lib/python3.11',
 '/home/phood/Documents/Anaconda/Installation/anaconda3/envs/slit/lib/python3.11/lib-dynload',
 '',
 '/home/phood/Documents/Anaconda/Installation/anaconda3/envs/slit/lib/python3.11/site-packages',
 '..',
 '/home/phood/Documents/GitHub/Reviewing-Game-Reviews/Ingestion/playlist/..']

In [8]:
import api_key

In [None]:
def poll():
    '''
    Only to consider merging JSON file
    '''
    pass

In [130]:
import os

http_results = os.listdir("./results")


In [135]:
len(http_results[0:4] + http_results[5:])

44

In [136]:
44*50

2200

In [129]:
print(video_count)

2200
