## Parsing the Descriptions

In this notebook, we are going to parse the main pages downloaded and saved in the `descriptions` directory.

Like the transcripts, we are going to use the `public_URL` as the key to merge the CSVs: we are probably going to use the `pandas` dataframe merge functionality to do this, but there may very well be something in the `csv` library. 

    <link rel="canonical" href="https://www.ted.com/talks/al_gore_on_averting_climate_crisis" />
I can re-use the BeautifulSoup code I wrote for the transcripts to locate the `link` tag with the `rel="canonical"` attribute, but I won't need to delete trailing `/transcript`.

The first thing I'm going to do is to check the extant code against the new files with a small test directory of 3 files.

In [1]:
# =-=-=-=-=-=-=-=-=-=-=
# Import libraries
# =-=-=-=-=-=-=-=-=-=-= 
import pandas, re, csv, os, json
from bs4 import BeautifulSoup

In [3]:
# =-=-=-=-=-=-=-=-=-=-=
# Define functions
# =-=-=-=-=-=-=-=-=-=-= 

def get_description(thesoup):
    # This is the most normal part of this function: it follows BS4 methodology strictly.
    # UPDATE: Commenting out ['href'] to see if that fixes the TypeError
    public_URL = thesoup.find("link", {'rel': 'canonical'}) #['href']
    # The data we want is wrapped inside a script tag and is not formatted in a way
    # that BS4 understands. This is a list comprehension that will then allow us to 
    # create a JSON object from which we can readily call the data we want
    my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
           .rstrip('})')
           for i in thesoup.select('script') 
           if i.string and i.string.startswith('q')]
    # One line of kinda ugly code to turn the list into a json object
    the_json = json.loads('{"' + "".join(my_list))

    talk_id = the_json['current_talk']
    description = the_json['description']
    views = the_json['viewed_count']
    event = the_json['event']
    return public_URL, talk_id, description, views, event

In [1]:
def to_csv_test(the_path, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["public_URL", "talk_id", "description", "views", "event"])
        # get all our html files.
        for the_file in os.listdir(the_path):
            print(the_file)
            with open(os.path.join(the_path, the_file)) as f:
                # parse the file and write the data to a row.
                # Try/Else needs to go here?
                try:
                    wr.writerow(get_description(BeautifulSoup(f, "html5lib")))
                except ValueError:
                    print('Error with ' + the_file)        
                finally:
                    wr.writerow(get_description(BeautifulSoup(f, "html5lib")))

In [3]:
def to_csv(the_path, out):
    # open file to write to
    with open(out, "w") as out:
        # create csv.writer
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["public_URL", "talk_id", "description", "views", "event"])
        # get all our html files
        for the_file in os.listdir(the_path):
            with open(os.path.join(the_path, the_file)) as f:
                # parse the file and write the data to a row.
                wr.writerow(get_description(BeautifulSoup(f, "html5lib")))

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Write the CSV
# =-=-=-=-=-=-=-=-=-=-= 

to_csv("./test_descriptions","test_descriptions.csv")

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# LOAD the CSV into a Pandas dataframe to check our work
# =-=-=-=-=-=-=-=-=-=-= 

# Let python create the column names list:
with open('./test_descriptions.csv') as f:
    colnames = f.readline().strip().split(",")

# Now will import the csv as a dataframe with the column names specified
TEDtalks = pandas.read_csv('./test_descriptions.csv', names=colnames)

# Check for success:
TEDtalks.head()

In [4]:
to_csv_test("./descriptions","descriptions.csv")

ValueError: Unterminated string starting at: line 1 column 2 (char 1)

In [7]:
# Early work to create a try/except workaround for the errors we are getting.

import logging

try:
    to_csv("./descriptions","descriptions.csv")
except Exception as ex:
    logging.exception('Caught an error')

ERROR:root:Caught an error
Traceback (most recent call last):
  File "<ipython-input-7-0e20793f003b>", line 6, in <module>
    to_csv("./descriptions","descriptions.csv")
  File "<ipython-input-3-f7b42c7689c8>", line 12, in to_csv
    wr.writerow(get_description(BeautifulSoup(f, "html5lib")))
  File "<ipython-input-2-e433924df7a0>", line 17, in get_description
    the_json = json.loads('{"' + "".join(my_list))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/__init__.py", line 318, in loads
    return _default_decoder.decode(s)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py", line 343, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/json/decoder.py", line 359, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting ':' delimiter: line 1 column 21 (char 20)
