# ETL - Scraping YouTube Video Playlist Stats with Python

One of the most challenging aspects of working in social media marketing for entertainment is getting an accurate portrayal of reach when a new video is released as part of a major beat.

A big part of your reach will come from owned channels, while others may come from media parters, and even superfans who rip video content for their own channels. Believe me when I say I've spent a large deal of time between YouTube and a manual spreadsheet trying to record the internet's reception of our campaign's content after the first day, second day, third day and so on...

In this project, I'll walk you through a webscraping tool I developed to automate YouTube statistics recording for loading into a dataframe or database.

## Importing Essential Libraries

In [195]:
##browser controller
from selenium import webdriver
import time

#webscraping and string cleaning
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import re

#dataframe management
import pandas as pd

## String Cleaning Functions

Before we fire up the browser controller, it's important to first walk through a few of these string cleaning functions that are unique to the YouTube platform.

![

In [None]:
def km_cleaner(value):
    if "subscribers" in value:
        if 'K' in value:
            value = float(re.sub('K subscribers', '', value))
            value *= 1000
            return int(value)
        elif 'M' in value:
            value = float(re.sub('M subscribers', '', value))
            value *= 1000000
            return int(value)
        else:
            return int(value)
    elif "views" in value:
        if 'K' in value:
            value = float(re.sub('K views', '', value))
            value *= 1000
            return int(value)
        elif 'M' in value:
            value = float(re.sub('M views', '', value))
            value *= 1000000
            return int(value)
        else:
            return int(value)
    else:
        if 'K' in value:
            value = float(re.sub('K', '', value))
            value *= 1000
            return int(value)
        elif 'M' in value:
            value = float(re.sub('M', '', value))
            value *= 1000000
            return int(value)
        else:
            return int(value)

- title
- url
- account name
- subscribers
- views
- date uploaded
- number of likes
- number of dislikes

In [172]:
def get_stats(youtube_url):
    driver = webdriver.Chrome('resources/chromedriver')
    driver.get(youtube_url)
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source)
    driver.quit()
    
    def km_cleaner(value):
        if "subscribers" in value:
            if 'K' in value:
                value = float(re.sub('K subscribers', '', value))
                value *= 1000
                return int(value)
            elif 'M' in value:
                value = float(re.sub('M subscribers', '', value))
                value *= 1000000
                return int(value)
            else:
                return int(value)
        elif "views" in value:
            if 'K' in value:
                value = float(re.sub('K views', '', value))
                value *= 1000
                return int(value)
            elif 'M' in value:
                value = float(re.sub('M views', '', value))
                value *= 1000000
                return int(value)
            else:
                return int(value)
        else:
            if 'K' in value:
                value = float(re.sub('K', '', value))
                value *= 1000
                return int(value)
            elif 'M' in value:
                value = float(re.sub('M', '', value))
                value *= 1000000
                return int(value)
            else:
                return int(value)
            
    title_value = soup.find('h1').text
    account_value = soup.find('yt-formatted-string', class_='ytd-channel-name').text
    subs_value = km_cleaner(soup.find('yt-formatted-string', id='owner-sub-count').text)
    view_count = km_cleaner(soup.find('span', class_='short-view-count').text)
    
    def date_uploaded(value):
        value = re.sub('•', '', value)
        if 'ago' in value:
            value = int(re.search(r'\d+', value).group())
            uploaded = (datetime.now() - timedelta(hours=value))
            return datetime.strftime(uploaded, '%b %-d, %Y')
        else:
            return value
    
    date_string = soup.find('div', id='date').text
    date_value = date_uploaded(date_string)
    likes_value = km_cleaner(soup.find_all('yt-formatted-string', class_='ytd-toggle-button-renderer')[0].text)
    dislikes_value = km_cleaner(soup.find_all('yt-formatted-string', class_='ytd-toggle-button-renderer')[1].text)
    
    stats = {'date_uploaded': date_value,
             'account': account_value,
             'video_title': title_value,
             'views': view_count,
             'subscribers': subs_value,
             'likes': likes_value,
             'dislikes': dislikes_value,
             'url': youtube_url
            }
    
    return stats 

In [177]:
def get_video_urls(youtube_playlist_url):
    #webdriver setup & html scrape
    driver = webdriver.Chrome('resources/chromedriver')
    driver.get(youtube_playlist_url)
    soup = BeautifulSoup(driver.page_source)

    #isolate the video stats
    video_stats = soup.find_all('yt-formatted-string')[5].text
    #use regex to isolate video number
    no_of_videos = re.sub(',', '', video_stats)
    no_of_videos = re.sub(' videos', '', video_stats)
    #number of scroll based on 100 thumbnails per scroll
    no_of_scrolls = int(int(no_of_videos) / 100 + 1)

    #scroll page for every 100 videos
    for i in range(no_of_scrolls):
        driver.execute_script("window.scrollBy(0, 12000);")
        time.sleep(3)

    soup = BeautifulSoup(driver.page_source)
    driver.quit()

    #extract youtube playlist urls
    matches = soup.find_all('a', class_='yt-simple-endpoint style-scope ytd-playlist-video-renderer')
    youtube_urls = []
    for url in matches:
        string = url.get('href')
        substring = re.search('\/watch\?v=([^&]+)', string).group()
        youtube_urls.append('http://www.youtube.com' + substring)    
    return youtube_urls

In [173]:
def create_export(url):
    all_urls = get_video_urls(url)
    all_stats = []
    for video in all_urls:
        all_stats.append(get_stats(video))
    return all_stats

In [189]:
wwe = create_export('https://www.youtube.com/playlist?list=PL7qtZGedQPadDiw6Y-XgiQIk035CQc3pm')

In [190]:
df = pd.DataFrame(wwe)

In [193]:
df.sort_values("views", ascending=False)

Unnamed: 0,account,video_title,date_uploaded,views,subscribers,likes,dislikes,url
0,WWE 2K,WWE 2K Battlegrounds Teaser Trailer,"Apr 27, 2020",506000,493000,6200,8000,http://www.youtube.com/watch?v=CTqED7mOzrU
1,WWE,WWE 2K Battlegrounds coming this fall,"Apr 27, 2020",362000,59400000,14000,3300,http://www.youtube.com/watch?v=iXj1yrZM9YI
2,IGN,WWE Battlegrounds - Reveal Trailer,"Apr 27, 2020",211000,13300000,3100,3800,http://www.youtube.com/watch?v=FowI9yms1NE
3,2K United Kingdom,WWE 2K Battlegrounds Teaser Trailer,"Apr 27, 2020",36000,44100,97,378,http://www.youtube.com/watch?v=o5tk6FU1hXE
6,Bestintheworld,WWE 2K NEW Game WWE 2K Battlegrounds Teaser Tr...,"Apr 27, 2020",25000,770000,724,43,http://www.youtube.com/watch?v=X6RjvTGJ9TM
7,GameSpot Trailers,WWE 2K Battlegrounds - Official Teaser Trailer,"Apr 27, 2020",9900,732000,340,139,http://www.youtube.com/watch?v=B1uCMFUQCZA
5,GameTrailers,WWE Battlegrounds - Reveal Trailer,"Apr 27, 2020",8700,851000,107,192,http://www.youtube.com/watch?v=ip_CjSmThhI
4,JeuxActu,WWE 2K Battlegrounds Bande Annonce Officielle ...,"Apr 27, 2020",8600,463000,101,69,http://www.youtube.com/watch?v=TakTOsa0wsM
8,GameNews,"WWE 2K Battlegrounds Trailer (2020) The Rock, ...","Apr 27, 2020",5300,1260000,116,68,http://www.youtube.com/watch?v=D23q-2vrpP8


## Authors

**Gerard Tieng** — *Data Analyst & Social Media Marketer*
- [http://www.twitter.com/gerardtieng](http://www.twitter.com/gerardtieng)
- [http://www.linkedin.com/in/gerardtieng](http://www.linkedin.com/in/gerardtieng)
- [http://www.github.com/gtieng](http://www.github.com/gtieng)