# Speech Understanding 
# Lecture 12: Automatic News Announcer


### Mark Hasegawa-Johnson, KCGI, January 14, 2023

1. <a href="#section_1">How HTML works</a>
1. <a href="#section_2">Using BeautifulSoup to extract the content you want</a>
1. <a href="#section_3">Automatic news announcer</a>
1. <a href="#homework">Homework</a>


<a id='section_1'></a>

## 1. How HTML works

Web pages are written using the **hypertext markup language (HTML)**.  You can write HTML using special tools, but you can also write it using any plaintext editor.

For example, consider the following text:

```
<html>
    <head>
        <title>Example Web Page</title>
        <style>
            .bluetext { color: blue; }
            .leftmargin { margin-left: 10px; }
        </style>
    </head>
    <body>
        <h1>Test web page</h1>
        <p>
        Web pages are written using the hypertext markup language (HTML).  You can write HTML using special tools, but you can also write it using any plaintext editor.   The markup in an HTML file is done using tags.  Each tag either opens an envelope, or closes an envelope.  The text in between the opening tag and the closing tag is called the content of the envelope.  Envelopes can be nested, one inside another, as this <b>p</b> tag is nested inside the <b>body</b> tag.
        </p>
        <p>
        "Hypertext" is text that includes links.  For example, here are some links:
        </p>
        <p class="leftmargin">
            <a class="bluetext" href="https://wikipedia.org">Wikipedia</a>
        </p>
        <p class="leftmargin">
            <a class="bluetext" href="https://www.npr.org">NPR</a>
        </p>
    </body>
</html>
```

Create a plaintext file, call it something like "testpage.html", and cut and paste the text above into it.  Then click on it in your web browser, to see how the browser renders it.

### Tags, Envelopes, and Nesting

The formatting commands in HTML come in the form of tags.  There are two types of tags:
* An opening command, such as `<p>` opens an envelope (in this case a paragraph)
* A closing command, such as `</p>`, closes the envelope

Every envelope has **content**.  Some envelopes also have **attributes**.
* The **content** of the envelope is the text between the opening-tag and closing-tag.
* The **attributes** of the envelope are defined inside the tag.  For example, the text `<a class="bluetext" href="https://www.npr.org">` means that the `<a>` tag has 2 attributes: `class="bluetext"` and `href="https://www.npr.org"`.

Envelopes can be nested.  For example, in the example file above, the envelopes are nested like this:


```
<html>
  ├─ <head>
  │   ├─ <title>
  │   └─ <style>
  └─ <body>
      ├─ <h1>
      ├─ <p>
      │  ├─ <b>
      │  └─ <b>
      ├─ <p>
      ├─ <p>
      │  └─ <a>
      └─ <p>
         └─ <a>
```

### Types of HTML tags

There are many different tags.  A complete listing is here: https://html.spec.whatwg.org/multipage/

The tags used in the example above include:

| Tag name | Description |
| :- | :- |
| \<html> | A file marked up using HTML |
| \<head> | Header: information that's not visible in the web page |
| \<title> | Title of the page |
| \<style> | Formatting class definitions |
| \<body> | Body: the part that's visible in the web page |
| \<h1> | A top-level header (\<h2>, \<h3>, and \<h4> are lower-level headers) |
| \<p> | A paragraph |
| \<b> | Boldface text |
| \<a> | A hypertext link |
 


#### Real web pages

A real web page is just like the one above, but more complicated.  To see a useful example, go to <a href="https://www.npr.org/">https://www.npr.org/</a>.  In your browser menu, find the option that says **View Page Source** (in Firefox, that's inside the **Tools** menu), and click on it.

Notice that the top of the file is a very long header, including `<script>` and `<style>` tags that will be used later in the page.

After the very long header you will find a body, with lists formatted using `<ul>` and `<li>` tags, and with news content in plaintext between the tags.

<a id='section_2'></a>

## 2. Using BeautifulSoup to extract the content you want

<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beatiful Soup</a> is a python package that makes it relatively easy to extract content from web pages.

In [7]:
import bs4

with open("testpage.html") as f:
    example_soup = bs4.BeautifulSoup(f, "html.parser")

ptags = example_soup.findAll("p")
print("There are", len(ptags), "paragraphs in the document.\n")
print("The first one is:\n")
print(ptags[0], "\n")
print("The third one is:\n")
print(ptags[2], "\n")

print("The children of the third paragraph are:\n")
print(ptags[2].contents)

There are 4 paragraphs in the document.

The first one is:

<p>
        Web pages are written using the hypertext markup language (HTML).  You can write HTML using special tools, but you can also write it using any plaintext editor.   The markup in an HTML file is done using tags.  Each tag either opens an envelope, or closes an envelope.  The text in between the opening tag and the closing tag is called the content of the envelope.  Envelopes can be nested, one inside another, as this <b>p</b> tag is nested inside the <b>body</b> tag.
        </p> 

The third one is:

<p class="leftmargin">
<a class="bluetext" href="https://wikipedia.org">Wikipedia</a>
</p> 

The children of the third paragraph are:

['\n', <a class="bluetext" href="https://wikipedia.org">Wikipedia</a>, '\n']


In [8]:
atag = ptags[2].find("a")

print("The href attribute of the hyperlink in paragraph 3 is:",atag['href'])

print("The text content of that hyperlink is:", atag.text)


The href attribute of the hyperlink in paragraph 3 is: https://wikipedia.org
The text content of that hyperlink is: Wikipedia


#### Using BeautifulSoup to explore a real-world web page

Now let's use BeautifulSoup to explore the NPR web page.

In [9]:
import bs4, requests
webpage = requests.get("https://npr.org")
npr_soup = bs4.BeautifulSoup(webpage.text, "html.parser")

ptags = npr_soup.findAll("p")
print("There are", len(ptags), "paragraphs in the document.\n")
print("The first one is:\n")
print(ptags[0], "\n")
print("The third one is:\n")
print(ptags[2], "\n")


There are 69 paragraphs in the document.

The first one is:

<p>
                Walking five minutes every half-hour can reduce the risk of high blood pressure, diabetes and heart disease.
                <b aria-label="Image credit" class="credit">
                    
                    EschCollection/Getty Images
                    
                </b>
<b class="hide-caption"><b>hide caption</b></b>
</p> 

The third one is:

<p>
                "I like the rug. It brings the room together." Ellie (Bella Ramsey) and Tess (Anna Torv) in HBO's <em>The Last of Us.</em>
<b aria-label="Image credit" class="credit">
                    
                    Liane Hentscher/HBO
                    
                </b>
<b class="hide-caption"><b>hide caption</b></b>
</p> 



The news items in the NPR web page are stored in `<div>` envelopes with a special class: they are called `<div class="story-text">`.  Let's list those.

In [11]:
div_tags = npr_soup.find_all('div', 'story-text')

print("There are", len(div_tags), "story text sections.\n")
print("The first one is:\n")
print(div_tags[0],"\n")
print("The third one is:\n")
print(div_tags[2],"\n")

There are 37 story text sections.

The first one is:

<div class="story-text">
<div class="slug-wrap">
<h2 class="slug">
<a data-metrics='{"action":"Click Slug"}' data-metrics-ga4='{"action":"homepage_curation_click","clickPosition":1,"clickType":"section slug","clickUrl":"https://www.npr.org/sections/health-shots/"}' href="https://www.npr.org/sections/health-shots/">
                        Shots - Health News
                        </a>
</h2>
</div>
<a data-metrics='{"action" : "Click Story 1"}' data-metrics-ga4='{"action":"homepage_curation_click","clickPosition":1,"clickType":"curated story","clickUrl":"https://www.npr.org/sections/health-shots/2023/01/12/1148503294/sitting-all-day-can-be-deadly-5-minute-walks-can-offset-harms"}' href="https://www.npr.org/sections/health-shots/2023/01/12/1148503294/sitting-all-day-can-be-deadly-5-minute-walks-can-offset-harms">
<h3 class="title">Sitting all day can be deadly. 5-minute walks can offset harms</h3>
</a>
<a data-metrics='{"action" : "

If you look through those `story-text` sections, you can see that there are only two parts that might sound good if spoken out loud:

* Each of them has a title, called `<h3 class="title">`
* One of them also has a teaser, called `<p class="teaser">`.

Let's write a function that extracts a list of story-texts from the NPR web page, and returns the title and (if it exists) the teaser for each of them.

In [12]:
def get_stories(soup):
    stories = []
    for div_tag in soup.find_all('div', 'story-text'):
        title = div_tag.find('h3', 'title')
        teaser = div_tag.find('p', 'teaser')
        story = title.text + ". "
        if teaser != None:
            story += teaser.text
        stories.append(story)
    return stories

In [13]:
stories = get_stories(npr_soup)
print("There are", len(stories), "stories.\n")
for n in range(5):
    print("Story number %d:"%(n))
    print(stories[n], "\n")


There are 37 stories.

Story number 0:
Sitting all day can be deadly. 5-minute walks can offset harms. A new study finds that taking regular, short bouts of movement during the day can reduce the risk of developing conditions like diabetes and heart disease. 

Story number 1:
At the end of humanity, 'The Last of Us' locates what makes us human.  

Story number 2:
Exxon climate predictions were accurate decades ago. The company still sowed doubt.  

Story number 3:
Look to the night sky in 2023.  

Story number 4:
At least 6 people were killed as a giant storm system hits the Southern U.S..  



<a id='section_3'></a>

## 3. Automatic news announcer

#### Making speech_package available everywhere on your machine

Last week we created the `speech_package`.  Unfortunately, in order to make it available everywhere on your machine, there is one step we left out!

Please use a terminal to navigate to the parent directory of your `speech_package`, and type the following:

```
python setup.py install
```

(Last week, we missed the word `install`).


#### Using speech_package to automatically read the news

Now that we've installed your `speech_package`, it should be available everywhere on your PC, including the directory where you're doing this week's work.  Let's use it to make an automatic news announcer.



In [15]:
import speech_package

def read_nth_story(stories, n, filename):
    speech_package.synthesize(stories[n],"en",filename)


In [16]:
import librosa, IPython

read_nth_story(stories, 10, 'test.mp3')
x, fs = librosa.load('test.mp3')

IPython.display.Audio(data=x, rate=fs)




<a id='homework'></a>

## Homework for Week 12

Create a plaintext file called `week12.py`.  Copy into it the following template code:

```
import bs4, speech_package

def extract_stories_from_NPR_text(text):
    '''
    Input: 
    text (string): the text of a webpage
    
    Output:
    stories (list of strings): a list of the news stories in the web page
    '''
    raise RuntimeError('You need to write this part!')
    return stories
    
def read_nth_story(stories, n, filename):
    '''
    Input:
    stories (list of strings): a list of the news stories from a web page
    n (int): the index of the story you want me to read
    filename (str): filename in which to store the synthesized audio

    Output: None
    '''
    raise RuntimeError('You need to write this part!')
```

Notice that: `extract_stories_from_NPR_text` is almost the same as `get_stories`, but it starts from the webpage text, not from the soup.  So you need to run `soup = bs4.BeautifulSoup(webpage_text, "html.parser")`, and then you re-use the rest of the code from `get_stories`.

Once you've created the file,

1. Replace the `raise RuntimeError` lines with code that works
1. Try it in the following code block.  Once you get it to work here, then
1. Try uploading it to Gradescope

In [4]:
import week12, requests, IPython, importlib, librosa
importlib.reload(week12)

webpage = requests.get("https://npr.org")
stories = week12.extract_stories_from_NPR_text(webpage.text)
week12.read_nth_story(stories, 0, 'test.mp3')

x, fs = librosa.load('test.mp3')
IPython.display.Audio(data=x, rate=fs)


