The Beautiful Soup code from last time is at the bottom of this notebook so that it's available for re-use and also as a reminder that we'll need two functions, **`parse`** and **`to_csv`**, at minimum. We may *want* a function that scrapes the TED website directly, using the CSV from the Google Doc. 

What follows is an attempt to build the **`BeautifulSoup`** code from scratch because TED has completely re-written the HTML for the transcript pages. Most importantly, and a minor annoyance, they have dispensed with embedding the relative times for the talks. All that is available now is the total time, which is not placed in a `<meta>` tag in the head of the document. 

In [15]:
import re

In [1]:
# It looks like, for now, all we need is BS4. 
from bs4 import BeautifulSoup, Comment

# NB: no need to read() the file: BS4 reads it into its own kind of object
thesoup = BeautifulSoup(open("transcript.0.html"), "html5lib")

# Talk metadata is in <meta> tags in the <head>. 
# This finds all <meta> tags
metas = thesoup.find_all("meta")

# Let's see what this object is...
print(type(metas))

<class 'bs4.element.ResultSet'>


In [None]:
# ... and what's inside of it:
print([meta for meta in metas])

Some early work focused on parsing the data as the list seen above:

    print(metas[0])
    >>> <meta charset="utf-8"/>
    
    print(type(metas), type(metas[0]))
    >>> <class 'bs4.element.ResultSet'> <class 'bs4.element.Tag'>

Even used a list comprehension based on some BS 4 functionality:

    metalist = [meta.attrs for meta in metas]

The problem was getting the value of one attribute based on the value of another attribute, which seemed an impossibility, until I found a helpful thread on [SO][]. That resulted in the code below that produces very clean output. The `tag.get("", None)` syntax is the BS4 `Tag.find()` function that returns None if there is no child tag within the `<meta>` tag. 

[SO]: https://stackoverflow.com/questions/36768068/get-meta-tag-content-property-with-beautifulsoup-and-python

In [None]:
for tag in thesoup.find_all("meta"):
    if tag.get("name", None) == "author":
        speaker = tag.get("content", None)
    if tag.get("itemprop", None) == "duration":
        length = tag.get("content", None)
    if tag.get("itemprop", None) == "uploadDate":
        published = tag.get("content", None)
    if tag.get("itemprop", None) == "interactionCount":
        views = tag.get("content", None)
    if tag.get("itemprop", None) == "description":
        description = tag.get("content", None)

print(speaker, length, published, views, description)

## Solution 1: Between 2 Comments

The first block of code below returns all the transcropt, which is indeed conveniently housed in `<p>` tags, but so is some extraneous information in the footer. Fortunately, the paragraphs we need occur between two comments:

```html
    <!-- Transcript text -->
        <p> All the text we want.</p>
    <!-- /Transcript text -->
```

Parsing between two comments seems to be possible, according to these two discussions: ["Extracting Text Between HTML Comments with BeautifulSoup"][SO1] and ["How do I parse just html between two comments using Python 3 and Beautiful Soup"][SO2], but in the case of the former discussion, the solution is to grab the line after a comment and in the case of the second, the discussion focuses on using other tags. Because of that, I am trying Solution 2 below, but I still think there might be a way to build a better function that focused on grabbing all the paragraphs between the two comments.

[SO1]: https://stackoverflow.com/questions/34673851/extracting-text-between-html-comments-with-beautifulsoup
[SO2]: https://stackoverflow.com/questions/48794294/how-do-i-parse-just-html-between-two-comments-using-python-3-and-beautiful-soup

In [None]:
# Returns transcript, but also some footer information which is in paragraph tags
text = thesoup.find_all("p")
print(text)

In [None]:
for comment in thesoup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['Transcript text']:
        print(comment.next_element.strip())

## Solution 2: Use the Div Class

All the transcript paragraphs appear to be formatted like this:

```html
    <!-- Transcript text -->
        <div class="Grid Grid--with-gutter d:f@md p-b:4">
			<div class="Grid__cell d:f h:full m-b:.5 m-b:0@md w:12"></div>
                <div class="Grid__cell flx-s:1 p-r:4">
					<p> By raising your hand, how many of you know at least 
                        ...
```

It's not semantic, and I suspect that focusing on **`Grid__cell flx-s:1 p-r:4`** we have only a temporary solution, but if that works in this moment, then I'll take it.

In [2]:
# This produces a BS resultset which is 12 <div>s long
text = thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"})
print(text)

[<div class="Grid__cell flx-s:1 p-r:4">
									<p>
											By raising your hand,
											how many of you know
at least one person on the screen?
											Wow, it's almost a full house.
											It's true, they are very famous
in their fields.
											And do you know what
all of them have in common?
											They all died of pancreatic cancer.
											However, although it's very,
very sad this news,
											it's also thanks to their personal stories
											that we have raised awareness
of how lethal this disease can be.
									</p>
								</div>, <div class="Grid__cell flx-s:1 p-r:4">
									<p>
											It's become the third cause
of cancer deaths,
											and only eight percent of the patients
will survive beyond five years.
											That's a very tiny number,
											especially if you compare it
with breast cancer,
											where the survival rate
is almost 90 percent.
											So it doesn't really come as a surprise
											that being

In [8]:
for div in thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"}):
        print(div.text)


									
											By raising your hand,
											how many of you know
at least one person on the screen?
											Wow, it's almost a full house.
											It's true, they are very famous
in their fields.
											And do you know what
all of them have in common?
											They all died of pancreatic cancer.
											However, although it's very,
very sad this news,
											it's also thanks to their personal stories
											that we have raised awareness
of how lethal this disease can be.
									
								

									
											It's become the third cause
of cancer deaths,
											and only eight percent of the patients
will survive beyond five years.
											That's a very tiny number,
											especially if you compare it
with breast cancer,
											where the survival rate
is almost 90 percent.
											So it doesn't really come as a surprise
											that being diagnosed
with pancreatic cancer
											means facing an almost
certain death sentence.
		

In [32]:
strung = ' '.join([div.text for div in 
            thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"})])
text   = re.sub(r"[\t]", "", strung).replace("\n", " ")
print(text)

  By raising your hand, how many of you know at least one person on the screen? Wow, it's almost a full house. It's true, they are very famous in their fields. And do you know what all of them have in common? They all died of pancreatic cancer. However, although it's very, very sad this news, it's also thanks to their personal stories that we have raised awareness of how lethal this disease can be.     It's become the third cause of cancer deaths, and only eight percent of the patients will survive beyond five years. That's a very tiny number, especially if you compare it with breast cancer, where the survival rate is almost 90 percent. So it doesn't really come as a surprise that being diagnosed with pancreatic cancer means facing an almost certain death sentence. What's shocking, though, is that in the last 40 years, this number hasn't changed a bit, while much more progress has been made with other types of tumors. So how can we make pancreatic cancer treatment more effective? As a 

## Assembling the Functions

Now that we have a working solution to get the data out of the html files, we can re-assemble the functions we need. 

In [41]:
import re
import csv
import os
from bs4 import BeautifulSoup

def parse(thesoup):
    for tag in thesoup.find_all("meta"):
        if tag.get("name", None) == "author":
            speaker = tag.get("content", None)
        if tag.get("itemprop", None) == "duration":
            length = tag.get("content", None)
        if tag.get("itemprop", None) == "uploadDate":
            published = tag.get("content", None)
        if tag.get("itemprop", None) == "interactionCount":
            views = tag.get("content", None)
        if tag.get("itemprop", None) == "description":
            description = tag.get("content", None)
    strung = ' '.join([div.text for div in 
            thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"})])
    text   = re.sub(r"[\t]", "", strung).replace("\n", " ")
    return speaker, length, published, views, description, text

def to_csv(pth, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["author", "length", "published", "views", "description", "text"])
        # get all our html files.
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                # parse the file are write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "html5lib")))
                
# to_csv("./test","test.csv")

In [42]:
to_csv("./test","test.csv")