# This notebook takes the transcripts from clips of PBS Newshour

This is a lot easier and works better than full episodes.  I can't find transcripts for full episodes which adds to their difficulty however a clip on PBS Newshour always has a transcript.  This is also a lot less data (text is less info than sound or video).

### Imports

In [1]:
### Removes warnings that occassionally show in imports
import warnings
warnings.filterwarnings('ignore')

### Libraries for website scraping
from bs4 import BeautifulSoup
import requests
import urllib

# Regular expressions
import re

# Progress bar
from tqdm import tqdm

### Standard imports
import pandas as pd
import numpy as np

### Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

### Clips Timeline

The [earliest clip](https://www.pbs.org/newshour/show/robert-macneil-and-jim-lehrer-and-the-watergate-hearings) is from Nixon's Watergate scandal in May 17 1973.

- Unfortunately this clip does not have a transcript (the transcript is blank).  I have a feeling this might be common so our scraping should watch out for this.
 
The early clips have quite a bit of pockets as the [tenth oldest clip](https://www.pbs.org/newshour/show/early-years-of-aids-deaths-fuel-fear) is from Sep 4 1985.  The [100th oldest clip](https://www.pbs.org/newshour/show/george-shultz-working-age-89) is relatively recent and is from Dec 16 2009.

### Scraping Structure

#### Page iteration

We should first iterate over each page of clips.  The url for each page is https://www.pbs.org/newshour/video/page/<strong>PAGE_NUMBER</strong>.  As of typing, there are 980 pages of clips.  The videos for each page appear to be stored in `<div id="all-videos"> ... </div>`.  Of course the link to each clip is stored in a `<a href="URL"> ... </a>` block.  This is very convenient!

<u>Note</u>: One tricky part is I must skip full episodes shown on the page.  Full episodes contain no transcripts.

In [2]:
PAGE_NUM = 5 # How many pages to scrap

url_dir = "https://www.pbs.org/newshour/video/page/"

for i in range(1, PAGE_NUM+1):
    print(f"\nPage {i}:\n")
    
    # Retrieving website data
    page = urllib.request.urlopen(f"{url_dir}{i}")
    page = BeautifulSoup(page)

    # Display pulled video titles
    videos = page.find("div", {"id": "all-videos"})
    videos = videos.findAll("article", {"class": "card-sm"})
    for video in videos:
        is_full = video.find_all(text=re.compile('Full Episode'))
        if len(is_full) == 1:
            continue
        video = video.find("a", {"class": "card-sm__title"})
        print(video.span.text)


Page 1:

WATCH: White House declines to name Supreme Court candidates
Mexico votes amid violence with hopes for change
Vietnam-era cultural luminaries reunite and reflect on the power of protest
Comedian Cameron Esposito tackles sexual assault in new special “Rape Jokes”
Stiffer bonds keeping some migrant families apart longer
Stockton’s young mayor giving city’s youth more opportunities
News Wrap: Trump narrows list of potential Supreme Court nominees
‘I didn’t know we could be separated’: Migrant teen has no idea when he will see his dad
White House: Family separation executive order is ‘temporary reprieve,’ but immigration law ‘ties our hands’
In the middle of tragedy, Capital Gazette sticks to its mission
EU’s migrant compromise leaves many unanswered questions
Tapping national rage, Mexican election frontrunner promises to overturn the system
Shields and Brooks on Trump’s Supreme Court politics, Ocasio-Cortez’s primary upset
WATCH: Trump says attack on Capital Gazette newspaper ‘

### Transcript scanning

Once we arrive at a clip we want a few things: (1) page URL, this will be nice for debugging, (2) text stored under "Story", this is essentially the summary of the clip, (3) date and time, (4) video title, (5) transcript.  Let's attack each of these strategically.  It's important to note some clips don't have transcripts ([example](https://www.pbs.org/newshour/show/voices-from-sandy-madeleine-conway-breezy-point)) and some clips have blank transcripts ([example]())

<u>Note</u>: It would also be nice to get video tags if that's availiable.  Such as politics, opinion, technology, etc.

#### 1.) Page URL

We already have this info since we arrived at the proper page!

#### 2.) "Story" text

The structure seems to be...

`<div id="story">
  <div class="body-text">
    <p>
      OUR TEXT
    </p>
  </div>
</div>`

Other times it's stored as ([example](https://www.pbs.org/newshour/show/shields-and-brooks-on-trumps-supreme-court-politics-ocasio-cortezs-primary-upset))...

`<div class="vt__excerpt body-text">
  <p>
    OUR_TEXT
  </p>
</div>`

#### 3.) Date and time

This seems to be stored in a `<time> ... </time>` block.  I wasn't even aware this existed in HTML, hopefully it's unique for each clip page.

This looks like it might need some formatting later, but let's worry about that then and just store it as a string for now.

#### 4.) Title

We already have this as well.  It was stored next to the clip URL

#### 5.) Transcript

This might be the trickiest part as some don't have transcripts.  However the ones that do ([example](https://www.pbs.org/newshour/show/shields-and-brooks-on-trumps-supreme-court-politics-ocasio-cortezs-primary-upset)) are stored as...


`
    <ul class="video-transcript">
    <li>
        <div>
          <p><strong>SPEAKER_NAME</strong></p>
          <p>SPEAKER_TEXT</p>
          ...
          <p>SPEAKER_TEXT</p>
        </div>
    </li>
  </ul>
`

<u>Note</u>: Each `<li></li>` is an individual person talking

Older videos ([example](https://www.pbs.org/newshour/show/troubles-for-the-holy-see-ahead-of-papal-elections)) seem to need a click on 'Transcripts' to unhid the text.  Also it seems possible to add `#story` and `#transcript` to the URL to get the information needed.  The transcript is stored differently as well.

`
    <div id="transcript">
    <ul class="video-transcript">
      <li>PERSON</li>
    </ul>
  </div>`

This [clip](https://www.pbs.org/newshour/show/troubles-for-the-holy-see-ahead-of-papal-elections) also seems to make a mistake... one person's text carries over into the next `li`.  I think this can be rememdied by checking if a `<strong>` exists

In [8]:
def get_story(url):
    page = urllib.request.urlopen(f"{url}#story")
    page = BeautifulSoup(page)
    
    # Layout 1
    story = page.find("div", {"id": "story"})
    if story is not None:
        story = story.find("p")
        if story is not None:
            return story.text
    
    # Layout 2
    story = page.find("div", {"id": "transcript"})
    if story is not None:
        story = story.find("div", {"class": "vt__excerpt body-text"})
        if story is not None:
            story = story.find("p")
            if story is not None:
                return story.text
    
    # Some clips don't have a summary story        
    return None
    
def get_date(page):
    date = page.find("time").text
    return date

def get_comments(page):
    comments = page.find("button", {"class": "comments__btn"})
    comments = comments.find("span").text
    return comments

def get_transcript(page):
    speakers = set()
    
    transcript = page.find("div", {"id": "transcript"})
    
    if transcript is None:
        return None, None
    
    transcript = transcript.find_all("li")
    
    speeches = list()
    
    for speech in transcript:
        paragraphs = speech.find_all("p")
        text = list()
        
        if paragraphs[0].find("strong") != None:
            speaker = paragraphs[0].find("strong").text.replace(":", "").split(",")[0]
            if not speaker.isupper():
                continue
            speakers.add(speaker)
            paragraphs = paragraphs[1:]
            
        for paragraph in paragraphs:
            text.append(paragraph.text)
            
        speeches.append([speaker, text])
        
    return speakers, speeches

def get_page_info(url):
    page = urllib.request.urlopen(f"{url}#transcript")
    page = BeautifulSoup(page)
    
    story = get_story(url)
    date  = get_date(page)
    speakers, transcript = get_transcript(page)
    comments = get_comments(page)
    
    return story, date, transcript, speakers, comments

### Edge cases:

- Speakers need to be all caps and no titles (Ex: "from the Washington Post") ([link](https://www.pbs.org/newshour/show/brooks-and-marcus-on-gop-backlash-to-trump-and-cruz-clinton-sanders-practicality-debate))
- Some clips don't have a transcript ([link](http://localhost:7000/notebooks/git/misc/in_progress/speech_pbs_scraping/transcripts.ipynb))
- This page has many clips with broken links ([link](https://www.pbs.org/newshour/video/page/650))
- Some clips have a story, but no transcript ([link](https://www.pbs.org/newshour/show/danica-patrick-inspires-families-to-the-racetrack))
- This clip has a div for story, but no text ([link](https://www.pbs.org/newshour/show/news-wrap-russia-expels-diplomats-in-reprisal-against-u-s-and-others#story))

In [None]:
pbs_df = pd.DataFrame(columns = ["URL", "Story", "Date", "Title", "Transcript", "Speakers", "Number of Comments"])

START    = 1     # Start at this page
PAGE_NUM = 100   # How many pages to scrap
url_dir = "https://www.pbs.org/newshour/video/page/"

for i in tqdm(range(START, START+PAGE_NUM)):
    
    # Retrieving website data
    page = urllib.request.urlopen(f"{url_dir}{i}")
    page = BeautifulSoup(page)

    # Display pulled video titles
    videos = page.find("div", {"id": "all-videos"})
    videos = videos.findAll("article", {"class": "card-sm"})
    for video in videos:
        is_full = video.find_all(text=re.compile('Full Episode'))
        if len(is_full) == 1:
            continue
        video = video.find("a", {"class": "card-sm__title"})
        
        url   = video["href"]
        title = video.span.text
                
        if url == "https://www.pbs.org/newshour/show/0":
            continue
        
        if url.replace("https://www.pbs.org/newshour/", "").split("/")[0] in ["politics", "nation", "world"]:
            continue
        
        try:
            story, date, transcript, speakers, comments = get_page_info(url)

            # Store info in df
            row = [url, story, date, title, transcript, speakers, comments]
            pbs_df.loc[len(pbs_df)] = row
        except:
            print("FAILED")
            print(url)
            
pbs_df.head()




  0%|          | 0/100 [00:00<?, ?it/s][A[A[A


  1%|          | 1/100 [00:40<1:06:29, 40.30s/it][A[A[A


  2%|▏         | 2/100 [01:23<1:08:35, 42.00s/it][A[A[A


  3%|▎         | 3/100 [02:10<1:10:35, 43.66s/it][A[A[A


  4%|▍         | 4/100 [02:59<1:11:38, 44.77s/it][A[A[A

In [88]:
pbs_df.shape
pbs_df.isnull().sum()

URL                   0
Story                 0
Date                  0
Title                 0
Transcript            0
Speakers              0
Number of Comments    0
dtype: int64

In [85]:
pbs_df[pbs_df.Story.isnull()].URL

Series([], Name: URL, dtype: object)

In [25]:
pbs_df.loc[2].Transcript

[['HARI SREENIVASAN',
  [' And now to the analysis of Brooks and Marcus. Judy spoke with them earlier today.']],
 ['JUDY WOODRUFF',
  [' And that is New York Times columnist David Brooks and Washington Post columnist Ruth Marcus. Mark Shields is away.',
   'And welcome to you both.',
   'So, as we just heard, a new front has opened up in this battle inside the Republican Party.',
   'David, you have this iconic magazine of the conservative movement, “The National Review,” going after Donald Trump, saying he’s a menace to conservatism. He’s coming back. Where is this headed?']],
 ['DAVID BROOKS',
  [' Well, the split is interesting.',
   'It’s sort of between people who are more ideologically- or philosophically-minded and those who are more rogue- and chaos-minded. And so the rogue side is Sarah Palin going with Trump. The people who are more ideologically conservative, whether it’s “National Review” or the Wall Street Journal editorial page, are suspicious of Trump because he’s ideolo

In [17]:
pbs_df.loc[2].URL

'https://www.pbs.org/newshour/show/brooks-and-marcus-on-gop-backlash-to-trump-and-cruz-clinton-sanders-practicality-debate'