# This notebook scraps full episodes from PBS Newshour website

The URLs for each episode are highly predictable so this is easy although not completely straightforward.  Here's some examples from the past week:

Sunday June 24 2018  
https://www.pbs.org/newshour/show/pbs-newshour-weekend-full-episode-june-24-2018

Monday June 25 2018  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-25-2018

Tuesday June 26 2018  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-26-2018

Wednesday June 27 2018  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-27-2018

Thursday June 28 2018  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-28-2018

Friday June 29 2018  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-29-2018

Saturday June 30 2018  
https://www.pbs.org/newshour/show/pbs-newshour-weekend-full-episode-june-30-2018

#### The pattern is pretty formulaic

If it is a weekday the URL is simply...  
`https://www.pbs.org/newshour/show/pbs-newshour-full-episode-MONTH-DAY-YEAR`

And if day is a weekend the URL is slightly different...  
`https://www.pbs.org/newshour/show/pbs-newshour-`<strong>`weekend-`</strong>`full-episode-MONTH-DAY-YEAR`

### Let's see if this pattern actually works for older videos

I'm going to go back in time year by year until this pattern breaks.

---

Friday June 30 2017 - <strong>Exists!</strong>  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-30-2017

Thursday June 30 2016 - <strong>Exists!</strong>  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-30-2016

Tuesday June 30 2015 - <strong>Exists!</strong>  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-30-2015

Monday June 30 2014  - <strong>unavailable</strong>  

Sunday June 30 2013 - <strong>unavailable</strong>  

---

The earliest date I could find is the March 2 2016 episode  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-march-2-2016

There also appears to be pockets of missing episodes.  For instance Wednesday February 1st 2017 doesn't exist.  
https://www.pbs.org/newshour/show/pbs-newshour-full-episode-february-1-2017

### Let's scrap one video and see what it looks like

In [2]:
# Removes warnings that occassionally show in imports
import warnings
warnings.filterwarnings('ignore')

# Libraries for website scraping
from bs4 import BeautifulSoup
import requests
import urllib

# URL of investment team information
website = "https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-29-2018"

# Retrieving website data
page = urllib.request.urlopen(website)
page = BeautifulSoup(page)

# Display pulled page
page

<!DOCTYPE html>
<html class="no-js" lang="en-us">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>PBS NewsHour full episode June 29, 2018 | PBS NewsHour</title>
<meta content="en_US" property="og:locale"/>
<meta content="website" property="og:type"/>
<meta content="PBS NewsHour" property="og:site_name"/>
<meta content="PBS NewsHour full episode June 29, 2018" property="og:title"/>
<meta content="https://d3i6fh83elv35t.cloudfront.net/static/2018/06/Main15-1024x682.jpg" property="og:image"/>
<meta content="682" property="og:image:height"/>
<meta content="1024" property="og:image:width"/>
<meta content="https://www.pbs.org/newshour/show/pbs-newshour-full-episode-june-29-2018" property="og:url"/>
<meta content="https://www.facebook.com/newshour/" property="article:publisher"/>
<meta content="2018-06-29T19:34:59-04:00" property="article:published_time"/>
<meta content="Episode" property="article:section"/>
<meta content="1141508786

In [22]:
from selenium import webdriver
import time
from selenium.webdriver.firefox.options import Options

def render_page(url):
    options = Options()
    options.add_argument("--headless")
    
    driver = webdriver.Firefox(firefox_options=options)
    driver.get(url)
    
    time.sleep(3)
    r = driver.page_source
    #driver.quit()
    return r

pbs = render_page(website)

In [3]:
import re

soup = BeautifulSoup(pbs, "html.parser")
sound = soup.find_all(text=re.compile('Listen to the Broadcast'))
sounds = [s.parent.parent.parent for s in sound]
assert len(sounds) == 1
sounds = sounds[0]

sounds

<div class="video-single__audio" id="audio">
<header class="video-single__audio-header-wrap">
<h2 class="video-single__audio-header video-single__audio-header--small">Listen to the Broadcast</h2>
<a class="video-single__audio-subscribe" href="https://www.pbs.org/newshour/podcasts">Subscribe<span> to the Full Show Podcast</span></a>
</header>
<div class="audioplayer"><audio class="js-audio" controls="" preload="none" style="width: 0px; height: 0px; visibility: hidden;">
<source src="https://d3i6fh83elv35t.cloudfront.net/static/2018/06/20180629_Fullshow.mp3"/>
</audio><div class="audioplayer-playpause" title="Play"><a href="#">Play</a></div><div class="audioplayer-time audioplayer-time-current">00:00</div><div class="audioplayer-bar"><div class="audioplayer-bar-loaded"></div><div class="audioplayer-bar-played"></div></div><div class="audioplayer-time audioplayer-time-duration">…</div><div class="audioplayer-volume"><div class="audioplayer-volume-button" title="Volume"><a href="#">Volume<

### Open MP3 from PBS

(Open for different test MP3s)

<div hidden>
Pinkfloyd  
'https://archive.org/download/PinkFloyd07CarefullWithThatAxeEugene/02%20-%20Learning%20To%20Fly.mp3'

DubStep  
"http://www.bensound.org/bensound-music/bensound-dubstep.mp3"

This seemed close to what I want...

"""
from pygame import mixer

mixer.init()
r = requests.get('https://www.sample-videos.com/audio/mp3/wave.mp3', stream=False)
mixer.music.load(r.raw)
mixer.music.play()
"""
</div>

In [10]:
from pydub import AudioSegment
from pydub.playback import play

website = "https://d3i6fh83elv35t.cloudfront.net/static/2018/06/20180629_Fullshow.mp3"
page = urllib.request.urlopen(website)
with open('/Users/pbezuhov/Desktop/test.mp3','wb') as output:
  output.write(page.read())

song = AudioSegment.from_mp3("/Users/pbezuhov/Desktop/test.mp3")
play(song)

KeyboardInterrupt: 

In [18]:
import urllib
import speech_recognition as sr
import subprocess
import os

url = 'https://d3i6fh83elv35t.cloudfront.net/static/2018/06/20180629_Fullshow.mp3'
mp4file = urllib.request.urlopen(url)

with open("test.mp4", "wb") as handle:
    handle.write(mp4file.read())

cmdline = ['avconv',
           '-i',
           'test.mp4',
           '-vn',
           '-f',
           'wav',
           'test.wav']
subprocess.call(cmdline)

r = sr.Recognizer()
with sr.AudioFile('test.wav') as source:
    audio = r.record(source)

command = r.recognize_google(audio)
print(command)

os.remove("test.mp4")
os.remove("test.wav")

FileNotFoundError: [Errno 2] No such file or directory: 'avconv': 'avconv'