# **01 YouTube Scraper**


### Using YouTube API to scrape data on the most popular music videos 

---

### **Set-up ⚙️**

Install dependencies from `requirements.txt`


```bash
pip install -r requirements.txt
```

Import necessary packages

*⚠️ Note: Do not run this more than once. Restart the kernel before running this code chunk.*

In [None]:
from googleapiclient.discovery import build
import isodate
import pandas as pd
import json
import ast
import os
os.chdir(os.path.expanduser("../"))             # change directory to main project directory

from functions.youtube_functions import *       # imports custom functions for youtube scraping

Check that we are in the correct current working directory

*⚠️ Note: We should be in the main project directory*

In [None]:
print("Current working directory:", os.getcwd())

Open JSON file containing credentials

*⚠️ Note: Our credentials should be stored in a file titled `credentials.json` and stored in the root of the project folder*

In [None]:
credentials_file_path = './credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

Create service object of the YouTube version 3 API

*⚠️ Note: YouTube API key should be saved under the key `youtube_api_key` in the `credentials.json` file*

In [None]:
# creating service object of the youtube version 3 API
service_youtube = build('youtube', 'v3', developerKey=credentials['youtube_api_key'])

---

### **Data Scraping 🔍**

Using the `search_youtube` function, scrape data on the most popular music videos in the US with the `.search().list()` method, and store the scraped data into a new dataframe

In [None]:
youtube_search_data, video_id = search_youtube(service_youtube, 2000, "official music video", "video", "US", 10)

yt_search_df = pd.DataFrame(youtube_search_data)

# save raw data to csv
yt_search_df.to_csv('./data/raw_youtube_search_data.csv')

Using the `get_stats` function, scrape statistical data on each music video with the `.videos().list()` method and with the Video ID of each video as an input parameter, and store the scraped data into a new dataframe

Merge the new dataframe with the first dataframe

*📇 Extra details: The API only allows for searching for 50 Video IDs at a time, thus we solved this by creating different lists with 50 IDs each*

In [None]:
video_stats = get_stats(service_youtube, video_id)  
video_stats_df = pd.DataFrame(video_stats)

# save raw data to csv
video_stats_df.to_csv('./data/raw_youtube_stats_data.csv')

# merge the mv stats and search dataframes
merged_df = pd.merge(yt_search_df, video_stats_df, left_on='video_id', right_on='video_id')

Using the `get_comments_in_videos` function, scrape the comments for each music video with the `.commentThreads().list()` method, and store the scraped data into a new dataframe

Merge the new dataframe with the previous dataframe to create our final dataframe for raw data

We save the final raw search data as a CSV file at this juncture before further data cleaning

*📇 Extra details: Comments are disabled for some videos*

In [None]:
comments_df = get_comments_in_videos(service_youtube, video_id)

# save raw data to csv
comments_df.to_csv('./data/raw_youtube_comments_data.csv')

final_youtube_df = pd.merge(merged_df, comments_df, left_on='video_id', right_on='video_id', sort = False)

# save raw data to csv
final_youtube_df.to_csv('./data/raw_youtube_final_data.csv')

---

### **Data Cleaning 🧹**

Clean the raw data and remove unnecessary data

In [None]:
final_youtube_df = pd.read_csv('./data/raw_youtube_final_data.csv', index_col=0)

# drop unnecessary columns
cleaned_df = final_youtube_df.drop(['video_id', 'channel_id', 'channel_title'], axis=1)  

# remove everything in parentheses and after ft. to get standardised video titles
cleaned_df['title'] = cleaned_df['title'].str.replace(r'\(.*\)|\s+ft\..*', '', regex=True)      

# format wikipedia categories into a joined string
cleaned_df['wikipedia_categories'] = cleaned_df['wikipedia_categories'].apply(lambda x: ', '.join([url.split('/')[-1].replace('_', ' ') for url in ast.literal_eval(x)]))

# convert duration to seconds
cleaned_df['duration'] = cleaned_df['duration'].apply(lambda x: isodate.parse_duration(x).total_seconds())

# rename columns appropriately
cleaned_df = cleaned_df.rename(columns={'title': 'video_title', 'duration': 'duration_seconds'})

Save cleaned data as a CSV file

In [None]:
cleaned_df.to_csv('./data/cleaned_youtube_final_data.csv')