# 0A. Scraping Top 50 Trending YouTube videos

The following is a slight modification of the [data scraper](https://github.com/DataSnaek/Trending-YouTube-Scraper) prepared by DataSnaek.

The scraper has been edited to only scrape the top trending videos of the day of only US and GB (UK)
- Only the top trending videos of the current day are retrievable from the API

# Table of Contents
1. [Import Libraries](#import_libraries)
2. [Read in API key](#read_api_key)
3. [Initialise Variables](#initialise_variables)
    1. [Initialise Features and Headers of .csv Files](#initialise_features_and_headers)
    2. [Initialise List of Country Codes and Output Directory](#initialise_cc_and_output_dir)
4. [Scraper Functions](#scraper_functions)
5. [Write Data to .csv Files](#write_data_to_.csv_files)

### Import Libraries <a name="import_libraries"></a>

In [1]:
import requests, sys, time, os
from pathlib import Path

### Read in API key <a name="read_api_key"></a>

You will need to obtain a YouTube Data API key. Instructions for obtaining one can be found [here](https://developers.google.com/youtube/registering_an_application).

Place the file into a text file named *api_key.txt*

In [2]:
with open('api_key.txt', 'r') as f:
    api_key = f.readline()

### Initialise Variables <a name="initialise_variables"></a>

#### Initialise Features and Headers of .csv Files <a name="initialise_features_and_headers"></a>

In [1]:
# List of simple to collect features
snippet_features = ["title", "publishedAt", "channelId", "channelTitle", "categoryId"]

# Any characters to exclude, generally these are things that become problematic in CSV files
unsafe_characters = ['\n', '"']

# Used to identify columns, currently hardcoded order
header = ["video_id"] + snippet_features + ["trending_date", "tags", "view_count", "likes", "dislikes",
                                            "comment_count", "thumbnail_link", "comments_disabled",
                                            "ratings_disabled", "description"]

#### Initialise List of Country Codes and Output Directory <a name="initialise_cc_and_output_dir"></a>

In [3]:
country_codes = ['US', 'GB']
output_dir = Path('.')/'scraper-output'

### Scraper Functions <a name="scraper_functions"></a>

In [4]:
def prepare_feature(feature):
    # Removes any character from the unsafe characters list and surrounds the whole item in quotes
    for ch in unsafe_characters:
        feature = str(feature).replace(ch, "")
    return f'"{feature}"'


def api_request(page_token, country_code):
    # Builds the URL and requests the JSON from it
    request_url = f"https://www.googleapis.com/youtube/v3/videos?part=id,statistics,snippet{page_token}chart=mostPopular&regionCode={country_code}&maxResults=50&key={api_key}"
    request = requests.get(request_url)
    if request.status_code == 429:
        print("Timed out: Too many requests, please try again later.\n")
        sys.exit()
    return request.json()


def get_tags(tags_list):
    # Takes a list of tags, prepares each tag and joins them into a string by the pipe character
    return prepare_feature("|".join(tags_list))


def get_videos(items):
    lines = []
    for video in items:
        comments_disabled = False
        ratings_disabled = False

        # We can assume something is wrong with the video if it has no statistics, often this means it has been deleted
        # so we can just skip it
        if "statistics" not in video:
            continue

        # A full explanation of all of these features can be in the original GitHub project link
        video_id = prepare_feature(video['id'])

        # Snippet and statistics are sub-dicts of video, containing the most useful info
        snippet = video['snippet']
        statistics = video['statistics']

        # This list contains all of the features in snippet that are 1 deep and require no special processing
        features = [prepare_feature(snippet.get(feature, "")) for feature in snippet_features]

        # The following are special case features which require unique processing, or are not within the snippet dict
        description = snippet.get("description", "")
        thumbnail_link = snippet.get("thumbnails", dict()).get("default", dict()).get("url", "")
        trending_date = time.strftime("%y.%d.%m")
        tags = get_tags(snippet.get("tags", ["[none]"]))
        view_count = statistics.get("viewCount", 0)

        # This may be unclear, essentially the way the API works is that if a video has comments or ratings disabled
        # then it has no feature for it, thus if they don't exist in the statistics dict we know they are disabled
        if 'likeCount' in statistics and 'dislikeCount' in statistics:
            likes = statistics['likeCount']
            dislikes = statistics['dislikeCount']
        else:
            ratings_disabled = True
            likes = 0
            dislikes = 0

        if 'commentCount' in statistics:
            comment_count = statistics['commentCount']
        else:
            comments_disabled = True
            comment_count = 0

        # Compiles all of the various bits of info into one consistently formatted line
        line = [video_id] + features + [prepare_feature(x) for x in [trending_date, tags, view_count, likes, dislikes,
                                                                     comment_count, thumbnail_link, comments_disabled,
                                                                     ratings_disabled, description]]
        lines.append(",".join(line))
    return lines


def get_pages(country_code, next_page_token="&"):
    country_data = []

    # Because the API uses page tokens (which are literally just the same function of numbers everywhere) it is much
    # more inconvenient to iterate over pages, but that is what is done here.
    while next_page_token is not None:
        # A page of data i.e. a list of videos and all needed data
        video_data_page = api_request(next_page_token, country_code)

        # Get the next page token and build a string which can be injected into the request with it, unless it's None,
        # then let the whole thing be None so that the loop ends after this cycle
        next_page_token = video_data_page.get("nextPageToken", None)
        next_page_token = f"&pageToken={next_page_token}&" if next_page_token is not None else next_page_token

        # Get all of the items as a list and let get_videos return the needed features
        items = video_data_page.get('items', [])
        country_data += get_videos(items)

    return country_data

In [5]:
def write_to_file(country_code, country_data):
    print(f"Writing {country_code} data for {time.strftime('%y.%d.%m')} to file...")

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    with open(f"{output_dir}/{time.strftime('%y.%d.%m')}_{country_code}_videos.csv", "w+", encoding='utf-8') as file:
        for row in country_data:
            file.write(f"{row}\n")

            
# Driver function
def get_data():
    for country_code in country_codes:
        country_data = [",".join(header)] + get_pages(country_code)
        write_to_file(country_code, country_data)

### Write Data to .csv Files <a name="write_data_to_.csv_files"></a>

In [6]:
get_data()

Writing US data for 19.10.11 to file...
Writing GB data for 19.10.11 to file...
