<h1>Using Selenium to Crawl Yle.fi Subtitles</h1>

This project assumes use of Google Chrome as the browser.

in macOS Catalina see here for workaround to make chromedriver.exe work:
https://stackoverflow.com/questions/60362018/macos-catalinav-10-15-3-error-chromedriver-cannot-be-opened-because-the-de

Some reference to this YouTube scraping tutorial: https://codereview.stackexchange.com/questions/166010/scraping-all-closed-captions-subtitles-of-a-youtubes-creators-video-library

Also Useful was the official Selenium documentation.

In [1]:
import sys
import time
import os

sys.setrecursionlimit(10000) #python has maximum recursion depth

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver import ActionChains
from selenium.webdriver.support.ui import Select

Create an instance of 'scrape_subs_yle' with the required parameters. 

Note that the script is not so robust when activating subtitles and play/pausing the video, due mostly to changes in html structure from page to page. However, it should work both with the children's and adults' yle.fi websites.

make sure the correct subtitles in the correct language are activated and the episode is playing after you call 'self.get_lines()'. 

Most errors are timeouts: if the video has periods > 120 seconds of no dialogue, it will timeout. 

Do not worry if it times out - just run assign 'self.episode_lines' to a separate variable, then call 'self.get_lines()' again from the point in the episode where the error was thrown.

In [2]:
class scrape_subs_yle:
    '''
    should work for most shows on yle.fi's website.
    however, this only scrapes subtitles while you watch: it is not accelerated.
    this is due to how the javascript loads the subtitle elements.
    '''
    
    def __init__(self, show_title, directory, webdriver_path, home_url, crawl_all = False, timeout =10):
        
        self.driver = webdriver.Chrome(webdriver_path)
        '''
        creates an instance of the webdriver.
        if webdriver_path=None, webdriver should be in system PATH
        '''
        self.driver.get(home_url)
        self.timeout = timeout
        self.wait = WebDriverWait(self.driver, timeout)
        self.url = self.driver.current_url
        self.crawl_all = crawl_all
        self.directory = directory
        self.show_title = show_title
        self.episode_lines = ['']
        
        if not os.path.isdir(os.path.join(self.directory, self.show_title)):
            self.create_dir()
        else:
            self.show_folder = os.path.join(self.directory, self.show_title)
        
        #adult and children's platforms have different html structure.
        self.lapset = 'lapset' in self.driver.current_url
        
        if self.lapset:
            #xpaths are relative to the nearest defining class
            self.episode_xpath = '//*[@class="se7zml-9 blyWGO"]'
            self.xpath_subtitle = '//*[@class="sc-16grvol-1 iomMwC"]'
            self.subtitle_button = '//*[@class="oj6fcc-0 kwYJza"]/button[6]'
            self.subtitle_button_2 = '//*[@class="t36nnb-1 gcSUoD"]'
            self.play_button = '//*[@class="oj6fcc-1 czdUiX"]'
            self.mouse_nudge = ActionChains(self.driver).move_by_offset(1, 0).move_by_offset(-1, 0)
        else:
            self.player = '//*[@class="player-holder"]'
            self.episode_xpath = '//*[@class="program-content"]/header/h1'
            self.xpath_subtitle = "//*[@class='playkit-subtitles']/div/div/div"
            self.subtitle_menu = '//*[@id="player-gui"]/div[3]/div[4]/div[3]/div[2]/div[1]/button'
            self.subtitle_button = '//*[@class="playkit-control-button-container playkit-control-language"]/div/button[1]'
            self.subtitle_button_2 = '//*[@class="playkit-dropdown"]'
            self.subtitle_button_3 = '//*[@class="playkit-dropdown-menu-item"]'
            self.play_button = self.player#'//*[@id="player-gui"]/div[3]/div[4]/div[2]/div[1]/div/div/button'
            self.play_status = '//*[@id="player-gui"]/div[3]/div[4]/div[2]/div[1]/div/div/span'
            self.mouse_nudge = ActionChains(self.driver).move_by_offset(1, 0).move_by_offset(-1, 0)

    def __exit__(self):
        '''
        Closes driver window.
        '''
        self.driver.close()   


    def create_dir(self):
        '''
        makes a folder with the show title as name.
        each episode will have its own txt file here
        '''
        self.show_folder = os.path.join(self.directory, self.show_title)
        os.mkdir(self.show_folder)
        
        print(self.show_title, ' directory created')
        
        return self.show_folder

    def play_pause(self):
        time.sleep(1)
        self.mouse_nudge.perform()
        self.wait.until(EC.element_to_be_clickable((By.XPATH, self.play_button))).click()

        
    def activate_subtitles(self):
        '''
        activates subtitles by clicking the subtitle icon
        and clicking 'soumi' from the dropdown menu
        '''
        
        #wake up the menus
        self.mouse_nudge.perform()
        
        #click outer menu subtitle button
        subtitles = self.wait.until(EC.element_to_be_clickable((By.XPATH, self.subtitle_button)))
        subtitles.click()
        
        #click to activate subs from inner menu
        subtitles_2 = self.wait.until(EC.element_to_be_clickable((By.XPATH, self.subtitle_button_2)))
        subtitles_2.click()
        
        if not self.lapset:
            subtitles_3 = self.wait.until(EC.element_to_be_clickable((By.XPATH, self.subtitle_button_3)))
            subtitles_3.click()


    def get_lines(self):
        '''
        initiates the episode
        saves episode title
        '''

        #wake up the menus
        self.mouse_nudge.perform()
        self.episode_lines = ['']
        self.url = self.driver.current_url
        
        #click play. lapset website doesn't automatically play when opened.
        if self.lapset:
            self.play_pause()
        elif self.driver.find_element_by_xpath(self.play_status).text == 'Toista':
            self.play_pause()
            
        self.activate_subtitles()
        
        #get episode title
        self.episode_title = self.driver.find_element_by_xpath(self.episode_xpath).text

        assert self.episode_title != '' 

        #initiate subtitles list with title first
        #self.episode_lines = [self.episode_title, '\n\n']

        print('subtitle crawling initiated for ', self.episode_title, 
              "\n\nCheck that: \n\n1.) Episode is Playing and \n2. Subtitles are enabled.) ")

        #call get_line to retrieve subtitles recursively.
        self.episode_lines = self.get_line() 
        
        if not os.path.isdir(self.show_folder):
            self.create_dir()

        return self.episode_lines

    def get_line(self): 
        '''
        retrieves lines of text as the episode plays.
        appends them to a list.
        returns list when episode changes or navigation away from page.
        '''

        #exceptions and sleeps are needed to work around stale elements
        try:
            line = WebDriverWait(self.driver, 120)\
            .until(EC.presence_of_element_located((By.XPATH, self.xpath_subtitle))).text 

            if line != self.episode_lines[-1]:
                self.episode_lines.append(line)

            else:
                time.sleep(1)

        except StaleElementReferenceException:
            time.sleep(1)

        #check that the title of the episode or url hasn't changed.
        #how can you make this more robust?
        if self.driver.find_element_by_xpath(self.episode_xpath).is_displayed():
            if self.driver.find_element_by_xpath(self.episode_xpath).text != self.episode_title or \
            self.url != self.driver.current_url:

                print(self.episode_title, ' is finished copying')
                self.play_pause()
                self.write_to_txt()
                print(self.episode_title, ' saved to ', self.show_folder)

                if self.crawl_all:
                    self.get_lines()
                else:
                    print("Continue crawling next episode? enter 'y' or 'n'")
                    def cont():
                        x = input()
                        if x == 'y':
                            self.get_lines() 
                        if x == 'n':
                            print('Crawling stopped')
                            sys.exit()
                        else:
                            print("please enter only 'y' or 'n'")
                            cont()
                    cont()
        #call the function recursively until the episode is over.
            else:
                self.get_line()         
        else:
            try:
                self.get_line() 
            except RecursionError:
                self.play_pause()
                print('recursion limit reached. \nSave progress using ".write_to_txt()". 
                      \nThen call ".get_lines()" again.')

        return self.episode_lines

    def write_to_txt(self):
        '''after scraping episode, write subtitles to txt file'''
        
        file_name = os.path.join(self.show_folder, self.episode_title)
        
        with open(file_name, 'w') as file:
            file.write(self.episode_title+'/n/n')
            file.writelines(self.episode_lines)
        


Below is an example of how to call the class, beginning with assigning the needed variables.


In [3]:
#corpus directory
directory = '/Desktop/finnish_shows/subtitles'

#path to chromedriver .exe file
webdriver_path = "/Users/paulp/chromedriver 4"

#chromedriver will navigate to home_url initially.
#Then you can navigate manually to the episode/season you want.
home_url='https://areena.yle.fi/1-804340?autoplay=true'

#this saves all crawled info into a subfolder in the corpus directory
show_title='Uusi_Päivä'

In [9]:
Uusi_Päivä = scrape_subs_yle(show_title, directory, webdriver_path, home_url)

In [10]:
Uusi_Päivä.get_lines()

subtitle crawling initiated for  Jakso 14: Etsii kuin neulaa heinäsuovasta
Jakso 14: Etsii kuin neulaa heinäsuovasta  is finished copying
Jakso 14: Etsii kuin neulaa heinäsuovasta  saved to  /Users/paulp/Desktop/finnish_shows/subtitles/Uusi_Päivä
Continue crawling next episode? enter 'y' or 'n'
y
subtitle crawling initiated for  Jakso 15: Ihminen vanhentuu, vaivat nuorentuu
Jakso 15: Ihminen vanhentuu, vaivat nuorentuu  is finished copying
Jakso 15: Ihminen vanhentuu, vaivat nuorentuu  saved to  /Users/paulp/Desktop/finnish_shows/subtitles/Uusi_Päivä
Continue crawling next episode? enter 'y' or 'n'
n
Crawling stopped


SystemExit: 

In [9]:
Uusi_Päivä.episode_lines

In [None]:
Uusi_Päivä.write_to_txt()