# **YouTube Scraper (Draft)**


### Using YouTube API to scrape data on the most popular music videos 

---

### **Set-up ⚙️**

Install the following modules in the terminal

`pip install blahblahblah`

Import necessary packages

*⚠️ Note: Do not run this more than once. Restart the kernel before running this code chunk.*

In [1]:
from googleapiclient.discovery import build
from IPython.display import JSON
import pandas as pd
import json
import os
os.chdir(os.path.expanduser("../"))             # change directory to main project directory

from dees_package.youtube_functions import *    # imports custom functions for youtube scraping

Check that we are in the correct current working directory

*⚠️ Note: We should be in the main project directory*

In [2]:
print("Current working directory:", os.getcwd())

Current working directory: c:\Users\soong\Documents\LSE_Acads\AT_23.24\DS105A\final_project\ds105a-project-dees-nuts


Open JSON file containing credentials

*⚠️ Note: Our credentials should be stored in a file titled `credentials.json` and stored in the root of the project folder*

In [3]:
credentials_file_path = './credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

Create service object of the YouTube version 3 API

*⚠️ Note: YouTube API key should be saved under the key `youtube_api_key` in the `credentials.json` file*

In [4]:
# creating service object of the youtube version 3 API
service_youtube = build('youtube', 'v3', developerKey=credentials['youtube_api_key'])

---

### **Data Scraping 🔍**

Using the `search_youtube` function, scrape data on the most popular music videos in the US with the `.search().list()` method, and store the scraped data into a new dataframe

To avoid needing to run this repeatedly, we save the raw search data as a CSV file at this juncture 

In [5]:
youtube_search_data, video_id = search_youtube(service_youtube, 2000, "official music video", "video", "US", 10)

yt_search_df = pd.DataFrame(youtube_search_data)

# save raw data to csv
yt_search_df.to_csv('./data/raw_youtube_search_data.csv')

NameError: name 'search_youtube' is not defined

Using the `get_stats` function, scrape statistical data on each music video with the `.videos().list()` method and with the Video ID of each video as an input parameter, and store the scraped data into a new dataframe

To avoid needing to run this repeatedly, we save the raw search data as a CSV file at this juncture

Merge the new dataframe with the first dataframe

*Extra details: The API only allows for searching for 50 Video IDs at a time, thus we solved this by creating different lists with 50 IDs each*

In [None]:
video_stats = get_stats(service_youtube, video_id)  
video_stats_df = pd.DataFrame(video_stats)

# save raw data to csv
video_stats_df.to_csv('./data/raw_youtube_stats_data.csv')

# merge the mv stats and search dataframes
merged_df = pd.merge(yt_search_df, video_stats_df, left_on='video_id', right_on='video_id')

Using the `get_comments_in_videos` function, scrape the comments for each music video with the `.commentThreads().list()` method, and store the scraped data into a new dataframe

*Extra details: Comments are disabled for some videos*

In [None]:
comments_df = get_comments_in_videos(service_youtube, video_id) 

Do a final merge of dataframes and save it as a CSV file


In [None]:
final_youtube_df = pd.merge(merged_df, comments_df, left_on='video_id', right_on='video_id', sort = False)
final_youtube_df.to_csv('./data/final_youtube_video_data.csv')