# This notebook takes the transcripts from clips of PBS Newshour

This is a lot easier and works better than full episodes.  I can't find transcripts for full episodes which adds to their difficulty however a clip on PBS Newshour always has a transcript.  This is also a lot less data (text is less info than sound or video).

### Imports

In [1]:
import re
import json

### Removes warnings that occassionally show in imports
import warnings
warnings.filterwarnings('ignore')

### Libraries for website scraping
from bs4 import BeautifulSoup
import urllib

### Clips Timeline

The [earliest clip](https://www.pbs.org/newshour/show/robert-macneil-and-jim-lehrer-and-the-watergate-hearings) is from Nixon's Watergate scandal in May 17 1973.

- Unfortunately this clip does not have a transcript (the transcript is blank).  I have a feeling this might be common so our scraping should watch out for this.
 
The early clips have quite a bit of pockets as the [tenth oldest clip](https://www.pbs.org/newshour/show/early-years-of-aids-deaths-fuel-fear) is from Sep 4 1985.  The [100th oldest clip](https://www.pbs.org/newshour/show/george-shultz-working-age-89) is relatively recent and is from Dec 16 2009.

### Scraping Structure

#### Page iteration

We should first iterate over each page of clips.  The url for each page is https://www.pbs.org/newshour/video/page/<strong>PAGE_NUMBER</strong>.  As of typing, there are 980 pages of clips.  The videos for each page appear to be stored in `<div id="all-videos"> ... </div>`.  Of course the link to each clip is stored in a `<a href="URL"> ... </a>` block.  This is very convenient!

<u>Note</u>: One tricky part is I must skip full episodes shown on the page.  Full episodes contain no transcripts.

In [2]:
PAGE_NUM = 5 # How many pages to scrap

url_dir = "https://www.pbs.org/newshour/video/page/"

for i in range(1, PAGE_NUM+1):
    print(f"\nPage {i}:\n")
    
    # Retrieving website data
    page = urllib.request.urlopen(f"{url_dir}{i}")
    page = BeautifulSoup(page)

    # Display pulled video titles
    videos = page.find("div", {"id": "all-videos"})
    videos = videos.findAll("article", {"class": "card-sm"})
    for video in videos:
        is_full = video.find_all(text=re.compile('Full Episode'))
        if len(is_full) == 1:
            continue
        video = video.find("a", {"class": "card-sm__title"})
        print(video.span.text)


Page 1:

U.S. launched second drone airstrike against ISIS-K
Ida: Louisiana braces for impact as evacuations continue
‘Anti-LGBT ideology zones’ are being enacted in Polish towns
U.S. launches air strike in response to Kabul airport attack, ahead of full withdrawal
The challenge of retrofitting millions of aging homes to battle global warming
Gulf Coast preps for Ida landfall, possibly a Category 4 storm
As Afghans bury those killed in Kabul attack, sense of abandonment and anger at U.S. rises
Comparing strategies and challenges of evacuating Afghanistan with Vietnam exit
News Wrap: Florida judge reverses DeSantis order banning school mask mandates
As 1.2 million households face eviction, only 11% of federal rental assistance distributed
Brooks and Capehart on Kabul attack, Jan. 6 investigation, voting rights
Paralympic athletes to watch in Tokyo
In ‘Flag Day,’ Sean and Dylan Penn aim to break cinema’s ‘three thought rule’
WATCH: New Orleans officials give news briefing on Tropical St

### Transcript scanning

Once we arrive at a clip we want a few things: (1) page URL, this will be nice for debugging, (2) text stored under "Story", this is essentially the summary of the clip, (3) date and time, (4) video title, (5) transcript.  Let's attack each of these strategically.  It's important to note some clips don't have transcripts ([example](https://www.pbs.org/newshour/show/voices-from-sandy-madeleine-conway-breezy-point)) and some clips have blank transcripts ([example]())

<u>Note</u>: It would also be nice to get video tags if that's availiable.  Such as politics, opinion, technology, etc.

#### 1.) Page URL

We already have this info since we arrived at the proper page!

#### 2.) "Story" text

The structure seems to be...

`<div id="story">
  <div class="body-text">
    <p>
      OUR TEXT
    </p>
  </div>
</div>`

Other times it's stored as ([example](https://www.pbs.org/newshour/show/shields-and-brooks-on-trumps-supreme-court-politics-ocasio-cortezs-primary-upset))...

`<div class="vt__excerpt body-text">
  <p>
    OUR_TEXT
  </p>
</div>`

#### 3.) Date and time

This seems to be stored in a `<time> ... </time>` block.  I wasn't even aware this existed in HTML, hopefully it's unique for each clip page.

This looks like it might need some formatting later, but let's worry about that then and just store it as a string for now.

#### 4.) Title

We already have this as well.  It was stored next to the clip URL

#### 5.) Transcript

This might be the trickiest part as some don't have transcripts.  However the ones that do ([example](https://www.pbs.org/newshour/show/shields-and-brooks-on-trumps-supreme-court-politics-ocasio-cortezs-primary-upset)) are stored as...


`
    <ul class="video-transcript">
    <li>
        <div>
          <p><strong>SPEAKER_NAME</strong></p>
          <p>SPEAKER_TEXT</p>
          ...
          <p>SPEAKER_TEXT</p>
        </div>
    </li>
  </ul>
`

<u>Note</u>: Each `<li></li>` is an individual person talking

Older videos ([example](https://www.pbs.org/newshour/show/troubles-for-the-holy-see-ahead-of-papal-elections)) seem to need a click on 'Transcripts' to unhid the text.  Also it seems possible to add `#story` and `#transcript` to the URL to get the information needed.  The transcript is stored differently as well.

`
    <div id="transcript">
    <ul class="video-transcript">
      <li>PERSON</li>
    </ul>
  </div>`

This [clip](https://www.pbs.org/newshour/show/troubles-for-the-holy-see-ahead-of-papal-elections) also seems to make a mistake... one person's text carries over into the next `li`.  I think this can be rememdied by checking if a `<strong>` exists

### Edge cases:

- 1.) Speakers need to be all caps and no titles (Ex: "from the Washington Post") ([link](https://www.pbs.org/newshour/show/brooks-and-marcus-on-gop-backlash-to-trump-and-cruz-clinton-sanders-practicality-debate))
- 2.) Some clips don't have a transcript ([link](http://localhost:7000/notebooks/git/misc/in_progress/speech_pbs_scraping/transcripts.ipynb))
- 3.) This page has many clips with broken links ([link](https://www.pbs.org/newshour/video/page/650))
- 4.) Some clips have a story, but no transcript ([link](https://www.pbs.org/newshour/show/danica-patrick-inspires-families-to-the-racetrack))
- 5.) This clip has a div for story, but no text ([link](https://www.pbs.org/newshour/show/news-wrap-russia-expels-diplomats-in-reprisal-against-u-s-and-others#story))


### FIXME Check this:
- no comments allowed ([link](https://www.pbs.org/newshour/show/gun-owning-group-in-oregon-advocates-for-firearm-safety))
- no comments allowed ([link](https://www.pbs.org/newshour/show/what-happens-when-i-try-to-talk-race-with-white-people))
- comment testing here ([link](https://www.pbs.org/newshour/show/report-top-facebook-apps-lack-privacy-protection)).  It doesn't appear the number by comments reflects the current number of comments.

### Broken links
- [link](https://www.pbs.org/newshour/show/improving-achievement-with-focus-on-scholarly-expectations), [link](https://www.pbs.org/newshour/show/greece-russia-dominate-g-7-summit-2-day-talks), [link](https://www.pbs.org/newshour/show/teen-flees-somalia-plans-to-return-as-doctor), [link](https://www.pbs.org/newshour/show/sec-sebelius-defends-the-ipab), [link](https://www.pbs.org/newshour/show/plumpy-nut-the-peanut-paste-that-could-save-millions), [link](https://www.pbs.org/newshour/show/the-doubleheader-switcheroos-occupy-opportunity-world-series-picks), [link](https://www.pbs.org/newshour/show/inside-softball-press-politicians-play-for-good-cause), [link](https://www.pbs.org/newshour/show/introduction-to-cocorahs)

### Check downloaded data

Using the `download.py` script, we were able to obtain all the transcripts with the methods mentioned here. Let's inspect one of the news stories and see what it looks like

In [3]:
fp = "/Users/pbezuhov/Desktop/pbs/1/african-nations-struggle-with-vaccine-access-public-mistrust-and-disinformation.json"
with open(fp, "r") as f:
    data = json.load(f)

import pandas as pd
pd.DataFrame([data])

Unnamed: 0,url,story,date,title,transcript,speakers
0,https://www.pbs.org/newshour/show/african-nati...,Record numbers of COVID-19 cases are being rep...,"Aug 26, 2021 6:35 PM EDT","African nations struggle with vaccine access, ...","[[Amna Nawaz, [The Delta variant is ravaging t...","[Asumpta Bahenda, Strive Masiyiwa, Patrick Mak..."


In [4]:
data["story"]

'Record numbers of COVID-19 cases are being reported across Africa as the delta variant pushes hospitals to a breaking point. ICU beds and oxygen are in desperately short supply, vaccines are increasingly scarce and according to the World Health Organization, there’s little hope even 10% of Africans will get a shot before 2021 ends. Special correspondent Isabel Nakirya reports from Kampala, Uganda.'

In [5]:
data["transcript"][:5]  # First 5 lines of transcript

[['Amna Nawaz',
  ['The Delta variant is ravaging the continent of Africa. ICU beds and oxygen are in desperately short supply. Vaccines are increasingly scarce, with less than 10 percent of people expected to be vaccinated by the end of the year.',
   'Special correspondent Isabel Nakirya in Kampala, Uganda.']],
 ['Isabel Nakirya',
  ['Asumpta Bahenda has been trying to wash away her near death experience for two months now. She suffered from a severe case of COVID-19 in June.']],
 ['Asumpta Bahenda', ['I started feeling like I was going to die.']],
 ['Isabel Nakirya',
  ['An ambulance evacuated her more than 170 miles from Western Uganda to the capital, Kampala, but finding a bed in a hospital was almost impossible.',
   'With damaged lungs, Asumpta needed immediate admission to an ICU. When she finally found a bed in a private hospital, oxygen was in short supply.']],
 ['Asumpta Bahenda',
  ["There's a moment where they were rationing oxygen. They come and remove the oxygen from you

### Next steps: Processing

I like keeping a copy of the raw data before processing it.

There's a lot to be done such as formatting times, speakers, removing any errors in the transcripts, and sorting the dataframe.  However I'll fo that in a new notebook