## Parsing the Transcripts

In this notebook, we are going to parse the downloaded transcripts found in the directory of the same name. This script was originally written to use the `requests` library and to fetch the transcripts from the website itself. When that failed, I used `wget` and a list of URLs to download the html files, which are here ordered by their place in the list. 

In order to facilitate a merge, I need to elicit a **key** that these files share with other CSVs. I think the easiest is going to be to focus on the **`public_URL`**, and it looks like the following HTML will give me what I need:

    <link rel="canonical" href="https://www.ted.com/talks/al_gore_on_averting_climate_crisis/transcript" />

I can use BeautifulSoup to locate the `link` tag and then the `rel="canonical"` attribute, and then I'll need to chomp `/transcript` off the string.

Sounds doable, yeah?

## Sandbox

There's a lot going on in one line of code below, which I built up over a series of trials. To be clear, I wanted to get it down to one-line to make the `parse` function more compact. Here's a description of what happens in the series of things that happen below.

Once `thesoup` object is created, `find` finds the first instance of the `<link>` tag where the `"rel"` attribute has the value "`canonical`". The `['href']` acts, as I understand it, like an index to a list, pulling out that href value, which in the case of `transcript.0` is:

    https://www.ted.com/talks/al_gore_on_averting_climate_crisis/transcript

We don't want the `/transcript` part of this string, since we will be merging on the main URL. Some googling and stackoverflowing revealed that the `split` method is faster than any regex we could use. Fortunately, all we needed was to peel off the part after the last slant, so this use of `rsplit('/', 1)` (reverse split) goes one slant back from the end and creates a list with two items. We want the first item in that list, and the zeroth index, `[0]`, gives us that.

In [None]:
# Walking through the code to get the link tag above

# Load HTML into a BS4 object
thesoup = BeautifulSoup(open("./transcripts/transcript.0"), "html5lib")

public_html = thesoup.find("link", {'rel': 'canonical'})['href'].rsplit('/', 1)[0]
print(public_html)

In [1]:
# =-=-=-=-=-=-=-=-=-=-=
# Import libraries
# =-=-=-=-=-=-=-=-=-=-= 
import pandas, re, csv, os
from bs4 import BeautifulSoup

In [2]:
# =-=-=-=-=-=-=-=-=-=-=
# Define functions
# =-=-=-=-=-=-=-=-=-=-= 

def parse(thesoup):
    public_URL = thesoup.find("link", {'rel': 'canonical'})['href'].rsplit('/', 1)[0]
    for tag in thesoup.find_all("meta"):
        if tag.get("name", None) == "author":
            speaker = tag.get("content", None)
        if tag.get("itemprop", None) == "duration":
            duration = tag.get("content", None)
        if tag.get("itemprop", None) == "uploadDate":
            uploaded = tag.get("content", None)
        if tag.get("itemprop", None) == "interactionCount":
            views = tag.get("content", None)
        if tag.get("itemprop", None) == "description":
            description = tag.get("content", None)
    strung = ''.join([div.text for div in 
            thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"})])
    text   = re.sub(r"[\t]", "", strung).replace("\n", " ")
    return public_URL, speaker, duration, uploaded, views, description, text

def to_csv(pth, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["public_URL", "speaker", "duration", "uploaded", "views", 
                     "xss_description", "text"])
        # get all our html files.
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                # parse the file and write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "html5lib")))

In [3]:
# =-=-=-=-=-=-=-=-=-=-=
# Write the CSV
# =-=-=-=-=-=-=-=-=-=-= 

to_csv("./transcripts","transcripts.csv")

TypeError: 'NoneType' object is not subscriptable

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# LOAD the CSV into a Pandas dataframe to check our work
# =-=-=-=-=-=-=-=-=-=-= 

# Let python create the column names list:
with open('./test_transcripts.csv') as f:
    colnames = f.readline().strip().split(",")

# Now will import the csv as a dataframe with the column names specified
TEDtalks = pandas.read_csv('./test_transcripts.csv', names=colnames)

# Check for success:
TEDtalks.head()

In [None]:
URLs = TEDtalks.public_URL.tolist()

In [None]:
print(URLs)