## Parsing the Descriptions

In this notebook, we are going to parse the main pages downloaded and saved in the `descriptions` directory.

In order to facilitate a merge, I need to elicit a **key** that these files share with other CSVs. I think the easiest is going to be to focus on the **`public_URL`**, and it looks like the following HTML will give me what I need:

    <link rel="canonical" href="https://www.ted.com/talks/al_gore_on_averting_climate_crisis/transcript" />

I can use BeautifulSoup to locate the `link` tag and then the `rel="canonical"` attribute, and then I'll need to chomp `/transcript` off the string.

Sounds doable, yeah?

## Sandbox

In [14]:
# Walking through the code to get the link tag above

# Load HTML into a BS4 object
thesoup = BeautifulSoup(open("./transcripts/transcript.0"), "html5lib")

public_URL = thesoup.find_all("link", {'rel': 'canonical'})
print(public_URL)

[<link href="https://www.ted.com/talks/al_gore_on_averting_climate_crisis/transcript" rel="canonical"/>]


I'm a little concerned that the code above appears to return a list. BS's `.text` method doesn't appear to work if I append it. For now, we'll see what happens with a test run.

In [2]:
# =-=-=-=-=-=-=-=-=-=-=
# Import libraries
# =-=-=-=-=-=-=-=-=-=-= 
import pandas, re, csv, os
from bs4 import BeautifulSoup

In [34]:
# =-=-=-=-=-=-=-=-=-=-=
# Define functions
# =-=-=-=-=-=-=-=-=-=-= 

def parse(thesoup):
    public_URL = thesoup.find_all("link", {'rel': 'canonical'})
    for tag in thesoup.find_all("meta"):
        if tag.get("name", None) == "author":
            speaker = tag.get("content", None)
        if tag.get("itemprop", None) == "duration":
            duration = tag.get("content", None)
        if tag.get("itemprop", None) == "uploadDate":
            uploaded = tag.get("content", None)
        if tag.get("itemprop", None) == "interactionCount":
            views = tag.get("content", None)
        if tag.get("itemprop", None) == "description":
            description = tag.get("content", None)
    strung = ''.join([div.text for div in 
            thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"})])
    text   = re.sub(r"[\t]", "", strung).replace("\n", " ")
    return public_URL, speaker, duration, uploaded, views, description, text

def to_csv(pth, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["public_URL", "speaker", "duration", "uploaded", "views", 
                     "xss_description", "text"])
        # get all our html files.
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                # parse the file and write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "html5lib")))

In [30]:
# =-=-=-=-=-=-=-=-=-=-=
# Write the CSV
# =-=-=-=-=-=-=-=-=-=-= 

to_csv("./test","test_transcripts.csv")

In [31]:
# =-=-=-=-=-=-=-=-=-=-=
# LOAD the CSV into a Pandas dataframe to check our work
# =-=-=-=-=-=-=-=-=-=-= 

# Let python create the column names list:
with open('./test_transcripts.csv') as f:
    colnames = f.readline().strip().split(",")

# Now will import the csv as a dataframe with the column names specified
TEDtalks = pandas.read_csv('./test_transcripts.csv', names=colnames)

# Check for success:
TEDtalks.head()

Unnamed: 0,public_URL,speaker,duration,uploaded,views,description,text
0,public_URL,speaker,duration,uploaded,views,description,text
1,"[<link href=""https://www.ted.com/talks/erica_s...",Erica Stone,PT9M44S,2018-03-28T15:00:07+00:00,724708,"In the US, your taxes fund academic research a...",Do you ever find yourself referencing a stud...
2,"[<link href=""https://www.ted.com/talks/mennat_...",Mennat El Ghalid,PT4M36S,2018-03-27T19:56:43+00:00,811626,"Each year, the world loses enough food to feed...","""Will the blight end the chestnut? The farme..."
3,"[<link href=""https://www.ted.com/talks/ndidi_n...",Ndidi Nwuneli,PT13M12S,2018-03-29T14:53:59+00:00,662380,Ndidi Nwuneli has advice for Africans who beli...,I was born to two amazing professors who wer...


In [32]:
descriptions = TEDtalks.description.tolist()

In [33]:
print(descriptions)

['description', 'In the US, your taxes fund academic research at public universities. Why then do you need to pay expensive, for-profit journals for the results of that research? Erica Stone advocates for a new, open-access relationship between the public and scholars, making the case that academics should publish in more accessible media. "A functioning democracy requires that the public be well-educated and well-informed," Stone says. "Instead of research happening behind paywalls and bureaucracy, wouldn\'t it be better if it was unfolding right in front of us?"', 'Each year, the world loses enough food to feed half a billion people to fungi, the most destructive pathogens of plants. Mycologist and TED Fellow Mennat El Ghalid explains how a breakthrough in our understanding of the molecular signals fungi use to attack plants could disrupt this interaction — and save our crops.', 'Ndidi Nwuneli has advice for Africans who believe in God — and Africans who don\'t. To the religious, she