# **YouTube Scraper (Draft)**


### Using YouTube API to scrape data on the most popular music videos 

---

### **Set-up ⚙️**

Install the following modules in the terminal

`pip install blahblahblah`

Import necessary packages

*⚠️ Note: Do not run this more than once. Restart the kernel before running this code chunk.*

In [None]:
from googleapiclient.discovery import build
from IPython.display import JSON
import pandas as pd
import json
import os
os.chdir(os.path.expanduser("../"))             # change directory to main project directory

from dees_package.youtube_functions import *    # imports custom functions for youtube scraping

Check that we are in the correct current working directory

*⚠️ Note: We should be in the main project directory*

In [None]:
print("Current working directory:", os.getcwd())

Open JSON file containing credentials

*⚠️ Note: Our credentials should be stored in a file titled `credentials.json` and stored in the root of the project folder*

In [None]:
credentials_file_path = './credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

Create service object of the YouTube version 3 API

*⚠️ Note: YouTube API key should be saved under the key `youtube_api_key` in the `credentials.json` file*

In [None]:
# creating service object of the youtube version 3 API
service_youtube = build('youtube', 'v3', developerKey=credentials['youtube_api_key'])

---

### **Data Scraping 🔍**

Getting list of music videos, carried out using .search().list() methods

In [None]:
youtube_search_data, video_id = youtube_search(service_youtube, 2000, "official music video", "video", "US", 10)

with open('./data/yt_search_data.json', 'w') as json_file: # qn for hanbin why are we saving this as a json?
    json.dump(youtube_search_data, json_file, indent=4)

yt_search_df = pd.DataFrame(youtube_search_data)
yt_search_df.to_csv('./data/search.csv')    # qn for hanbin why are we saving this as a csv?

Getting statistics on each video, using video IDs from previous function as an input, carried out using .videos().list() methods

In [None]:
video_stats = get_stats(service_youtube, video_id)  # there is a limit on the number of video ids, can only run 50 at a time. Solution: create different lists with 50 IDs each.
video_stats_df = pd.DataFrame(video_stats)

Merging dataframes


In [None]:
# merge the mv stats and search dataframes
merged_df = pd.merge(yt_search_df, video_stats_df, left_on='video_id', right_on='video_id')
merged_df.to_json('./data/merged.json')

merged_df.to_csv('./data/merged.csv')

Getting comments, carried out using .commentThreads().list() methods

In [None]:
comments_df = get_comments_in_videos(service_youtube, video_id) # note that comments are disabled for some videos
comments_df.head(5)

Final merge of dataframes


In [None]:
final_youtube_df = pd.merge(merged_df, comments_df, left_on='video_id', right_on='video_id', sort = False)
final_youtube_df.to_csv('./data/final_youtube.csv')
final_youtube_df.head(5)