## Parsing the Descriptions

In this notebook, we are going to parse the main pages downloaded and saved in the `descriptions` directory.

Like the transcripts, we are going to use the `public_URL` as the key to merge the CSVs: we are probably going to use the `pandas` dataframe merge functionality to do this, but there may very well be something in the `csv` library. 

    <link rel="canonical" href="https://www.ted.com/talks/al_gore_on_averting_climate_crisis" />
I can re-use the BeautifulSoup code I wrote for the transcripts to locate the `link` tag with the `rel="canonical"` attribute, but I won't need to delete trailing `/transcript`.

In [1]:
# =-=-=-=-=-=-=-=-=-=-=
# Import libraries
# =-=-=-=-=-=-=-=-=-=-= 
import pandas, re, csv, os
from bs4 import BeautifulSoup

## Sandbox

In [4]:
# Walking through the code to get the link tag above

# =-=-=-=-=-=-=-=-=-=-=
#  LOAD the file into a BS4 object
# =-=-=-=-=-=-=-=-=-=-= 

thesoup = BeautifulSoup(open("./descriptions/al_gore_on_averting_climate_crisis"), "html5lib")

# =-=-=-=-=-=-=-=-=-=-=
# public_URL
# =-=-=-=-=-=-=-=-=-=-= 
public_URL = thesoup.find("link", {'rel': 'canonical'})['href']
print(public_URL)

https://www.ted.com/talks/al_gore_on_averting_climate_crisis


In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Read the HTML & get the section we want
# =-=-=-=-=-=-=-=-=-=-= 
soup = BeautifulSoup(text, "html5lib")
my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
           .rstrip('})')
           for i in soup.select('script') 
           if i.string and i.string.startswith('q')]

pre_json = '{"' + "".join(my_list)
my_json = json.loads(pre_json)

out_slug = my_json['slug']
out_vcount = my_json['viewed_count']
out_event = my_json['event']

talks_listed = str(my_json['talks']).split(",")

properties = "filmed,published" # No spaces between terms!
regex_list = [".*("+i+").*" for i in properties.split(",")]

matches = []
for e in regex_list:
    filtered = filter(re.compile(e).match, talks_listed)
    indexed = "".join(filtered).split(":")[1]
    matches.append(indexed)

out_filmed =  matches[0]
out_published = matches[1]

print(out_slug, out_vcount, out_event, out_filmed, out_published)

I'm a little concerned that the code above appears to return a list. BS's `.text` method doesn't appear to work if I append it. For now, we'll see what happens with a test run.

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Define functions
# =-=-=-=-=-=-=-=-=-=-= 

def parse(thesoup):
    public_URL = thesoup.find_all("link", {'rel': 'canonical'})
    for tag in thesoup.find_all("meta"):
        if tag.get("name", None) == "author":
            speaker = tag.get("content", None)
        if tag.get("itemprop", None) == "duration":
            duration = tag.get("content", None)
        if tag.get("itemprop", None) == "uploadDate":
            uploaded = tag.get("content", None)
        if tag.get("itemprop", None) == "interactionCount":
            views = tag.get("content", None)
        if tag.get("itemprop", None) == "description":
            description = tag.get("content", None)
    strung = ''.join([div.text for div in 
            thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"})])
    text   = re.sub(r"[\t]", "", strung).replace("\n", " ")
    return public_URL, speaker, duration, uploaded, views, description, text

def to_csv(pth, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["public_URL", "speaker", "duration", "uploaded", "views", 
                     "xss_description", "text"])
        # get all our html files.
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                # parse the file and write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "html5lib")))

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Write the CSV
# =-=-=-=-=-=-=-=-=-=-= 

to_csv("./test","test_transcripts.csv")

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# LOAD the CSV into a Pandas dataframe to check our work
# =-=-=-=-=-=-=-=-=-=-= 

# Let python create the column names list:
with open('./test_transcripts.csv') as f:
    colnames = f.readline().strip().split(",")

# Now will import the csv as a dataframe with the column names specified
TEDtalks = pandas.read_csv('./test_transcripts.csv', names=colnames)

# Check for success:
TEDtalks.head()

In [None]:
descriptions = TEDtalks.description.tolist()

In [None]:
print(descriptions)