# This notebook takes the transcripts from clips of PBS Newshour

This is a lot easier and works better than full episodes.  I can't find transcripts for full episodes which adds to their difficulty however a clip on PBS Newshour always has a transcript.  This is also a lot less data (text is less info than sound or video).

### Imports

In [1]:
### Removes warnings that occassionally show in imports
import warnings
warnings.filterwarnings('ignore')

### Libraries for website scraping
from bs4 import BeautifulSoup
import requests
import urllib

# Regular expressions
import re

# Progress bar
from tqdm import tqdm

### Standard imports
import pandas as pd
import numpy as np

### Plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

### Clips Timeline

The [earliest clip](https://www.pbs.org/newshour/show/robert-macneil-and-jim-lehrer-and-the-watergate-hearings) is from Nixon's Watergate scandal in May 17 1973.

- Unfortunately this clip does not have a transcript (the transcript is blank).  I have a feeling this might be common so our scraping should watch out for this.
 
The early clips have quite a bit of pockets as the [tenth oldest clip](https://www.pbs.org/newshour/show/early-years-of-aids-deaths-fuel-fear) is from Sep 4 1985.  The [100th oldest clip](https://www.pbs.org/newshour/show/george-shultz-working-age-89) is relatively recent and is from Dec 16 2009.

### Scraping Structure

#### Page iteration

We should first iterate over each page of clips.  The url for each page is https://www.pbs.org/newshour/video/page/<strong>PAGE_NUMBER</strong>.  As of typing, there are 980 pages of clips.  The videos for each page appear to be stored in `<div id="all-videos"> ... </div>`.  Of course the link to each clip is stored in a `<a href="URL"> ... </a>` block.  This is very convenient!

<u>Note</u>: One tricky part is I must skip full episodes shown on the page.  Full episodes contain no transcripts.

In [2]:
PAGE_NUM = 5 # How many pages to scrap

url_dir = "https://www.pbs.org/newshour/video/page/"

for i in range(1, PAGE_NUM+1):
    print(f"\nPage {i}:\n")
    
    # Retrieving website data
    page = urllib.request.urlopen(f"{url_dir}{i}")
    page = BeautifulSoup(page)

    # Display pulled video titles
    videos = page.find("div", {"id": "all-videos"})
    videos = videos.findAll("article", {"class": "card-sm"})
    for video in videos:
        is_full = video.find_all(text=re.compile('Full Episode'))
        if len(is_full) == 1:
            continue
        video = video.find("a", {"class": "card-sm__title"})
        print(video.span.text)


Page 1:

News Wrap: Trump interviews Supreme Court candidates
Elected in a landslide, can Mexico’s López Obrador deliver on dramatic promises?
Will U.S.-Mexico policy tensions change under López Obrador?
Yemen’s spiraling hunger crisis is a man-made disaster
#LivingWhileBlack: How does racial bias lead to unnecessary calls to police?
Amy Walter and Susan Page on Supreme Court stakes, ‘Abolish ICE’ politics
The high-wire act of being vice president
WATCH: White House declines to name Supreme Court candidates
Mexico votes amid violence with hopes for change
Vietnam-era cultural luminaries reunite and reflect on the power of protest
Comedian Cameron Esposito tackles sexual assault in new special “Rape Jokes”
Stiffer bonds keeping some migrant families apart longer
Stockton’s young mayor giving city’s youth more opportunities
News Wrap: Trump narrows list of potential Supreme Court nominees
‘I didn’t know we could be separated’: Migrant teen has no idea when he will see his dad
White Hous

### Transcript scanning

Once we arrive at a clip we want a few things: (1) page URL, this will be nice for debugging, (2) text stored under "Story", this is essentially the summary of the clip, (3) date and time, (4) video title, (5) transcript.  Let's attack each of these strategically.  It's important to note some clips don't have transcripts ([example](https://www.pbs.org/newshour/show/voices-from-sandy-madeleine-conway-breezy-point)) and some clips have blank transcripts ([example]())

<u>Note</u>: It would also be nice to get video tags if that's availiable.  Such as politics, opinion, technology, etc.

#### 1.) Page URL

We already have this info since we arrived at the proper page!

#### 2.) "Story" text

The structure seems to be...

`<div id="story">
  <div class="body-text">
    <p>
      OUR TEXT
    </p>
  </div>
</div>`

Other times it's stored as ([example](https://www.pbs.org/newshour/show/shields-and-brooks-on-trumps-supreme-court-politics-ocasio-cortezs-primary-upset))...

`<div class="vt__excerpt body-text">
  <p>
    OUR_TEXT
  </p>
</div>`

#### 3.) Date and time

This seems to be stored in a `<time> ... </time>` block.  I wasn't even aware this existed in HTML, hopefully it's unique for each clip page.

This looks like it might need some formatting later, but let's worry about that then and just store it as a string for now.

#### 4.) Title

We already have this as well.  It was stored next to the clip URL

#### 5.) Transcript

This might be the trickiest part as some don't have transcripts.  However the ones that do ([example](https://www.pbs.org/newshour/show/shields-and-brooks-on-trumps-supreme-court-politics-ocasio-cortezs-primary-upset)) are stored as...


`
    <ul class="video-transcript">
    <li>
        <div>
          <p><strong>SPEAKER_NAME</strong></p>
          <p>SPEAKER_TEXT</p>
          ...
          <p>SPEAKER_TEXT</p>
        </div>
    </li>
  </ul>
`

<u>Note</u>: Each `<li></li>` is an individual person talking

Older videos ([example](https://www.pbs.org/newshour/show/troubles-for-the-holy-see-ahead-of-papal-elections)) seem to need a click on 'Transcripts' to unhid the text.  Also it seems possible to add `#story` and `#transcript` to the URL to get the information needed.  The transcript is stored differently as well.

`
    <div id="transcript">
    <ul class="video-transcript">
      <li>PERSON</li>
    </ul>
  </div>`

This [clip](https://www.pbs.org/newshour/show/troubles-for-the-holy-see-ahead-of-papal-elections) also seems to make a mistake... one person's text carries over into the next `li`.  I think this can be rememdied by checking if a `<strong>` exists

In [2]:
def get_story(url):
    page = urllib.request.urlopen(f"{url}#story")
    page = BeautifulSoup(page)
    
    # Layout 1
    story = page.find("div", {"id": "story"})
    if story is not None:
        story = story.find("p")
        if story is not None:
            return story.text
    
    # Layout 2
    story = page.find("div", {"id": "transcript"})
    if story is not None:
        story = story.find("div", {"class": "vt__excerpt body-text"})
        if story is not None:
            story = story.find("p")
            if story is not None:
                return story.text
    
    # Some clips don't have a summary story        
    return None
    
def get_date(page):
    date = page.find("time").text
    return date

def get_comments(page):
    comments = page.find("button", {"class": "comments__btn"})
    if comments is None:
        return None
    comments = comments.find("span")
    if comments is None:
        return None
    return comments.text

def get_transcript(page):
    speakers = set()
    
    transcript = page.find("div", {"id": "transcript"})
    
    if transcript is None:
        return None, None
    
    transcript = transcript.find_all("li")
    
    speeches = list()
    
    for speech in transcript:
        paragraphs = speech.find_all("p")
        text = list()
        
        if paragraphs[0].find("strong") != None:
            speaker = paragraphs[0].find("strong").text.replace(":", "").split(",")[0]
            
            # This allows for some mistakes, see edge case (1) below
#             if not speaker.isupper():
#                 continue

            speakers.add(speaker)
            paragraphs = paragraphs[1:]
            
        for paragraph in paragraphs:
            text.append(paragraph.text)
            
        speeches.append([speaker, text])
        
    return speakers, speeches

def get_page_info(url):
    page = urllib.request.urlopen(f"{url}#transcript")
    page = BeautifulSoup(page)
    
    story = get_story(url)
    date  = get_date(page)
    speakers, transcript = get_transcript(page)
    comments = get_comments(page)
    
    return story, date, transcript, speakers, comments

### Edge cases:

- 1.) Speakers need to be all caps and no titles (Ex: "from the Washington Post") ([link](https://www.pbs.org/newshour/show/brooks-and-marcus-on-gop-backlash-to-trump-and-cruz-clinton-sanders-practicality-debate))
- 2.) Some clips don't have a transcript ([link](http://localhost:7000/notebooks/git/misc/in_progress/speech_pbs_scraping/transcripts.ipynb))
- 3.) This page has many clips with broken links ([link](https://www.pbs.org/newshour/video/page/650))
- 4.) Some clips have a story, but no transcript ([link](https://www.pbs.org/newshour/show/danica-patrick-inspires-families-to-the-racetrack))
- 5.) This clip has a div for story, but no text ([link](https://www.pbs.org/newshour/show/news-wrap-russia-expels-diplomats-in-reprisal-against-u-s-and-others#story))


### FIXME Check this:
- ([link](https://www.pbs.org/newshour/show/gun-owning-group-in-oregon-advocates-for-firearm-safety))
- ([link](https://www.pbs.org/newshour/show/what-happens-when-i-try-to-talk-race-with-white-people))

It'd be nice if this used multiprocessing to speed this up!

In [3]:
pbs_df = pd.DataFrame(columns = ["URL", "Story", "Date", "Title", "Transcript", "Speakers", "Number of Comments"])

START    = 297    # Start at this page
PAGE_NUM = 100    # How many pages to scrap
url_dir = "https://www.pbs.org/newshour/video/page/"

for i in tqdm(range(START, START+PAGE_NUM)):
    
    # Retrieving website data
    page = urllib.request.urlopen(f"{url_dir}{i}")
    page = BeautifulSoup(page)

    # Display pulled video titles
    videos = page.find("div", {"id": "all-videos"})
    videos = videos.findAll("article", {"class": "card-sm"})
    for video in videos:
        is_full = video.find_all(text=re.compile('Full Episode'))
        if len(is_full) == 1:
            continue
        video = video.find("a", {"class": "card-sm__title"})
        
        url   = video["href"]
        title = video.span.text
                
        if url == "https://www.pbs.org/newshour/show/0":
            continue
        
        if url.replace("https://www.pbs.org/newshour/", "").split("/")[0] in ["politics", "nation", "world"]:
            continue
        
        try:
            story, date, transcript, speakers, comments = get_page_info(url)

            # Store info in df
            row = [url, story, date, title, transcript, speakers, comments]
            pbs_df.loc[len(pbs_df)] = row
        except:
            print("FAILED")
            print(url)
            
pbs_df.head()

100%|██████████| 100/100 [1:25:17<00:00, 51.17s/it]


Unnamed: 0,URL,Story,Date,Title,Transcript,Speakers,Number of Comments
0,https://www.pbs.org/newshour/show/an-orlando-m...,"Rubana Khan of Orlando, in heartfelt verse, se...","Jun 15, 2016 7:16 PM EDT",An Orlando Muslim’s heartfelt words on nightcl...,"[[GWEN IFILL, [ We close tonight with a person...","{GWEN IFILL, RUBANA KHAN, JUDY WOODRUFF}",0
1,https://www.pbs.org/newshour/show/did-killer-o...,The investigation into the Orlando mass shooti...,"Jun 14, 2016 7:36 PM EDT",New details of Orlando gunman’s life raise mor...,"[[JUDY WOODRUFF, [ The search for answers in O...","{WILLIAM BRANGHAM, QUESTION, ANGEL COLON, KEVI...",0
2,https://www.pbs.org/newshour/show/news-wrap-tr...,"In our news wrap Tuesday, presumptive GOP nomi...","Jun 14, 2016 7:35 PM EDT",News Wrap: Trump’s ‘Muslim ban’ draws sharp re...,"[[JUDY WOODRUFF, [ Good evening. I’m Judy Wood...","{RIZWAN JAKA, Still to come on the “NewsHour”,...",0
3,https://www.pbs.org/newshour/show/its-the-weap...,The AR-15 is the most popular rifle in America...,"Jun 14, 2016 7:34 PM EDT",It’s the weapon of choice for U.S. mass murder...,"[[Editor’s note, [ This story focused on the A...","{PHILIP CARTER, GWEN IFILL, Editor’s note, JOH...",0
4,https://www.pbs.org/newshour/show/inside-russi...,"For nearly a year, Russian hackers have been p...","Jun 14, 2016 7:33 PM EDT",Inside Russian hacking of Democrats’ oppositio...,"[[GWEN IFILL, [ But, first: The Democratic Nat...","{GWEN IFILL, We get some insight on how this h...",0


In [4]:
print(f"Shape of PBS data: {pbs_df.shape}")
pbs_df.isnull().sum()

Shape of PBS data: (1807, 7)


URL                     0
Story                 199
Date                    0
Title                   0
Transcript            147
Speakers              147
Number of Comments      0
dtype: int64

In [10]:
pbs_df["Number of Comments"].value_counts()

0    1807
Name: Number of Comments, dtype: int64

In [9]:
pbs_df.iloc[261]

URL                   https://www.pbs.org/newshour/show/growing-wild...
Story                 At least 1,600 homes and buildings have burned...
Date                                            May 7, 2016 4:38 PM EDT
Title                 Growing wildfire in Canadian oil town displace...
Transcript                                                           []
Speakers                                                             {}
Number of Comments                                                    0
Name: 261, dtype: object

In [6]:
for i, t in enumerate(list(pbs_df.Transcript)):
    try:
        if t is not None and len(t) < 1:
            print(i, t)
    except:
        print(i)
        break

16 []
20 []
261 []
314 []
408 []
411 []
463 []
464 []
492 []
496 []
610 []
659 []
661 []
701 []
748 []
839 []
841 []
886 []
887 []
1029 []
1104 []
1105 []
1129 []
1149 []
1150 []
1151 []
1152 []
1183 []
1320 []
1440 []
1442 []
1443 []
1593 []
1767 []


In [21]:
pbs_df.iloc[261]

URL                   https://www.pbs.org/newshour/science/watch-liv...
Story                                                              None
Date                  \n                May 22, 2018 12:19 PM EDT\n ...
Title                 WATCH LIVE: Explosive eruption at Kilauea volc...
Transcript                                                         None
Speakers                                                           None
Number of Comments                                                    0
Name: 261, dtype: object

In [11]:
pbs_df.to_csv("/Users/pbezuhov/Desktop/PBS_297-397.csv", index=False)

In [53]:
import ast

def string_literal(x):
    if x is np.nan:
        return np.nan
    else:
        try:
            return ast.literal_eval(x)
        except:
            return x

test = pd.read_csv("/Users/pbezuhov/Desktop/PBS_1-100.csv")
for col in ["Transcript", "Story", "Speakers"]:
    test[col] = test[col].map(string_literal)

In [58]:
test.shape

(1466, 7)

In [60]:
test.iloc[1000].Transcript

[['HARI SREENIVASAN',
  [' The Centers for Disease Control and Prevention has declared this year’s flu outbreak to be the worst on record since swine flu in 2009. And it only seems to be getting worse. Last week alone one in 15 doctors appointments in the United States was for the flu this week. Seven children died of the flu bringing the total for the season to thirty seven children. The CDC predicts there are still several weeks to go in the flu season. For some perspective I’m joined by Stephen Ferrara assistant professor of nursing at Columbia University Medical Center here in New York. What’s different about this year’s flu.']],
 ['STEPHEN FERRARA',
  [' What we’re seeing this year is a lot of the strain of flu is H3N2 and that is a strain of influenza and this strain we know it tends to be more severe.']],
 ['HARI SREENIVASAN',
  [' Earlier this year we heard that there was a low level of efficacy based on something that was happening in Australia. Clear that up for us.']],
 ['ST

In [38]:


t = test.iloc[0].Transcript
t = ast.literal_eval(t)