# Scraping "Invest Like The Best" Transcripts

This script downloads the text transcripts of the podcast "Invest Like The Best" from the Colossus website: https://www.joincolossus.com/episodes

Note: the exercise contained in this repo is for exploratory purposes only. ALL CREDIT for the production of this great podcast goes to the team at https://www.joincolossus.com/.

As a first step, I visited the podcast page from Apple Podcasts: https://podcasts.apple.com/us/podcast/invest-like-the-best-with-patrick-oshaughnessy/id1154105909 and saved the webpage HTML (the url links to each podcast page are visible in the HTML). The HTML page is included in the repo (`apple_podcasts_html.html`)

For the purposes of this script, the path to the HTML page is `C:\code\podcasts\apple_podcasts_html.html`

I used BeautifulSoup to pull the urls from the original HTML page. To get the url of each episode's transcript, I needed to use Selenium to search by the xpath, perform a click action, and note the current url after the click action.

Then, for each episode link, we navigate to the page, copy (using ctrl + a and ctrl + c keystrokes through Selenium), and save the contents in a file called transcript_[n].



In [1]:
# installs
%pip install beautifulsoup4
%pip install selenium
%pip install pandas
%pip install requests

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Selenium Setup

I am using Chrome Version 105.0.5195.127 (Official Build) (64-bit). I downloaded the current release of Chrome driver from https://chromedriver.chromium.org/downloads (specifically https://chromedriver.storage.googleapis.com/index.html?path=105.0.5195.52/). 

I saved the file at `C:\code\podcasts\chromedriver.exe` (also included in repo).

In [6]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.common.action_chains import ActionChains

# point Selenium to chrome driver
driver = webdriver.Chrome('c:/code/podcasts/chromedriver.exe')

# test the driver
driver.get('https://podcasts.apple.com/us/podcast/invest-like-the-best-with-patrick-oshaughnessy/id1154105909')

  driver = webdriver.Chrome('c:/code/podcasts/chromedriver.exe')


We need to ensure that we are logged into the Colossus website. To make an account, visit `https://www.joincolossus.com/register`

The below code should open a new Chrome window and take us to an episode transcript.

In [8]:
# test that we can view a transcript
driver.get('https://www.joincolossus.com/episodes/87713997/f-investing-in-enterprise-software?tab=transcript')

Below is the main code block. 

For each link in the Apple podcasts page:
- perform a click action on the xpath for the "Episode Webpage section" (which takes us to the Colossus transcript page for that episode)
- simply ctrl+a, ctrl_c to copy the text on the page
- save clipboard data in a file called transcript_[n]
    


In [9]:
from bs4 import BeautifulSoup
import urllib.request
import re

# open the HTML page we saved manually from Apple Podcasts and parse with BeautifulSoup
with open('C:/code/podcasts/apple_podcasts_html.html', encoding='utf8') as fp:
    outer_soup = BeautifulSoup(fp, 'html.parser')

# For each podcast link (one that starts with https://podcasts.apple.com/us/podcast/), pull the transcript using Selenium
counter = 1
for apple_link_html in outer_soup.findAll('a', attrs={'href': re.compile('^https://podcasts.apple.com/us/podcast/')}):
    apple_link = apple_link_html.get('href')

    # navigate to the podcast's transcript page

    driver.get(apple_link)
    driver.implicitly_wait(5)

    # perform a click action on the xpath for the "Episode Webpage section"
    if driver.find_elements('xpath', '/html/body/div[5]/main/div[2]/div/div/section/div[1]/div[2]/div/div[2]/header/div/ul[1]/li[1]/a'):
        driver.find_element('xpath', '/html/body/div[5]/main/div[2]/div/div/section/div[1]/div[2]/div/div[2]/header/div/ul[1]/li[1]/a').click()

        # note: information on how to get the xpath of an element: https://stackoverflow.com/a/42194160

        # print('inner url:')
        # print(driver.current_url)

        # simply ctrl+a, ctrl_c to copy the text
        element = driver.find_element('tag name', 'body')

        # ctrl + a
        actions_select_all = ActionChains(driver)
        actions_select_all.key_down(Keys.CONTROL)
        actions_select_all.send_keys("a")
        actions_select_all.key_up(Keys.CONTROL)
        actions_select_all.perform()

        # ctrl + c
        actions_copy = ActionChains(driver)
        actions_copy.key_down(Keys.CONTROL)
        actions_copy.send_keys("c")
        actions_copy.key_up(Keys.CONTROL)
        actions_copy.perform()

        # save clipboard data
        import win32clipboard
        win32clipboard.OpenClipboard()
        transcript = win32clipboard.GetClipboardData()
        win32clipboard.CloseClipboard()
        # print(data)

        file_name = 'transcript_{}'.format(counter)
        text_file = open(file_name, 'w')
        text_file.write(transcript)
        text_file.close()

        counter = counter + 1