## Parsing the Descriptions

In this notebook, we are going to parse the main pages downloaded and saved in the `descriptions` directory.

Like the transcripts, we are going to use the `public_URL` as the key to merge the CSVs: we are probably going to use the `pandas` dataframe merge functionality to do this, but there may very well be something in the `csv` library. 

    <link rel="canonical" href="https://www.ted.com/talks/al_gore_on_averting_climate_crisis" />
I can re-use the BeautifulSoup code I wrote for the transcripts to locate the `link` tag with the `rel="canonical"` attribute, but I won't need to delete trailing `/transcript`.

The first thing I'm going to do is to check the extant code against the new files with a small test directory of 3 files.

In [1]:
# =-=-=-=-=-=-=-=-=-=-=
# Import libraries
# =-=-=-=-=-=-=-=-=-=-= 
import pandas, re, csv, os
from bs4 import BeautifulSoup

In [7]:
import json

In [24]:
# =-=-=-=-=-=-=-=-=-=-=
# Define functions
# =-=-=-=-=-=-=-=-=-=-= 

def get_description(thesoup):
    # This is the most normal part of this function: it follows BS4 methodology strictly.
    public_URL = thesoup.find("link", {'rel': 'canonical'})['href']
    # The data we want is wrapped inside a script tag and is not formatted in a way
    # that BS4 understands. This is a list comprehension that will then allow us to 
    # create a JSON object from which we can readily call the data we want
    my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
           .rstrip('})')
           for i in thesoup.select('script') 
           if i.string and i.string.startswith('q')]
    # One line of kinda ugly code to turn the list into a json object
    the_json = json.loads('{"' + "".join(my_list))

    talk_id = the_json['current_talk']
    description = the_json['description']
    views = the_json['viewed_count']
    event = the_json['event']
    return public_URL, talk_id, description, views, event

def to_csv(the_path, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["public_URL", "talk_id", "description", "views", "event"])
        # get all our html files.
        for the_file in os.listdir(the_path):
            with open(os.path.join(the_path, the_file)) as f:
                # parse the file and write the data to a row.
                wr.writerow(get_description(BeautifulSoup(f, "html5lib")))

In [25]:
# =-=-=-=-=-=-=-=-=-=-=
# Write the CSV
# =-=-=-=-=-=-=-=-=-=-= 

to_csv("./test_descriptions","test_descriptions.csv")

In [27]:
# =-=-=-=-=-=-=-=-=-=-=
# LOAD the CSV into a Pandas dataframe to check our work
# =-=-=-=-=-=-=-=-=-=-= 

# Let python create the column names list:
with open('./test_descriptions.csv') as f:
    colnames = f.readline().strip().split(",")

# Now will import the csv as a dataframe with the column names specified
TEDtalks = pandas.read_csv('./test_descriptions.csv', names=colnames)

# Check for success:
TEDtalks.head()

Unnamed: 0,public_URL,talk_id,description,views,event
0,public_URL,talk_id,description,views,event
1,https://www.ted.com/talks/a_j_jacobs_year_of_l...,301,"Author, philosopher, prankster and journalist ...",2598301,EG 2007
2,https://www.ted.com/talks/a_choir_as_big_as_th...,832,185 voices from 12 countries join a choir that...,402529,Eric Whitacre's Virtual Choir
3,https://www.ted.com/talks/9_11_healing_the_mot...,1136,Phyllis Rodriguez and Aicha el-Wafi have a pow...,879798,TEDWomen 2010


In [28]:
to_csv("./descriptions","descriptions.csv")

TypeError: 'NoneType' object is not subscriptable