## Getting the data

**Task**: Get the metadata for each of the talks which will tell us when the talk actually occurred along with the particular TED event at which it occurred. Bonus: get the view count.

I consulted my own [notes][] (from May 2016) on how I previously downloaded the transcripts, and, as it turns out, the Google Doc is still available and up to date. This time I only copied the URLs and pasted them into a text file.  

[notes]: http://johnlaudun.org/20160518-wgetting-ted-talk-transcripts/

I tested the `wget` command twice:

    wget -w 2 -i tedtalks_test.txt

The first time I used the top three entries -- what now feels like the triumvirate of Gore, Pogue, and Carter -- but then discovered that all three are from the same TED event in 2006. I then re-tested the script with URLs for talks by Dawkins and Gladwell:

    https://www.ted.com/talks/richard_dawkins_on_our_queer_universe
    https://www.ted.com/talks/malcolm_gladwell_on_spaghetti_sauce

The script worked both times. I appended HTML to the file names and opened in a text editor. I eventually located the information we want in a massive `<script>` block located near the end of a file. In one location is this:

    "recorded_at":"2005-07-07T00:00:00.000+00:00"

But a more interesting clump of data occurs just before this:

    "canonical":"https://www.ted.com/talks/richard_dawkins_on_our_queer_universe",
    "external":null,"name":"Richard Dawkins: Why the universe seems so strange",
    "title":"Why the universe seems so strange",
    "speaker":"Richard Dawkins",
    "thumb":"https://pi.tedcdn.com/r/pe.tedcdn.com/images/ted/160_480x360.jpg?quality=89&w=600",
    "slug":"richard_dawkins_on_our_queer_universe",
    "event":"TEDGlobal 2005",
    "filmed":1120694400,
    "published":1158019860,

Those last two are, of course, UNIX time codes that translate as follows:

* `"filmed":1120694400` = Wednesday 6th July 2005
* `"published":1158019860` = Monday 11th September 2006

Elsewhere: `"viewed_count":2952227`.

## Downloaded Files

The results of `wget -w 2 -i tedtalks_URLs.txt`:

    FINISHED --2018-01-17 17:55:25--
    Total wall clock time: 1h 54m 14s
    Downloaded: 2628 files, 172M in 2m 27s (1.17 MB/s)

## Experiments in Parsing

In [1]:
from bs4 import BeautifulSoup
import json
import re 


# LOAD the test file
prefix = "/Users/john/Code/tedarchives/"
text = open( prefix+'test_html/richard_dawkins_on_our_queer_universe.html',
            'r').read()

# We read the HTML into BS and then select only the script section we want
soup = BeautifulSoup(text, "html5lib")
my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
           .rstrip('})')
           for i in soup.select('script') 
           if i.string and i.string.startswith('q')]

# `my_list` is a list with only one item in it, but it is everything we want.
# Sadly the JSON module wants a string and the stripping above gets taking out
# the opening curly brace, which JSON is very fussy about. 

# Add the opening brace, stringify, parse into JSON (which is a Python dictionary).
# For those counting: we've gone string to list to string to dictionary.

pre_json = '{"' + "".join(my_list)
my_json = json.loads(pre_json)

# for key in my_json.keys():
#     print(key) 
# viewed_count, current_talk, talks, media, speakers, description
# url, name, slug, language, comments, threadId, event

In [19]:
my_json['slug']

'richard_dawkins_on_our_queer_universe'

In [3]:
# talks is a giant list with improper JSON using single quote marks
# So we're going to string it and split it into a list (of 682 items!)
# and then grab items out of the list using regex.

talks_listed = str(my_json['talks']).split(",")

properties = "filmed,published" # No spaces between terms!
regex_list = [".*("+i+").*" for i in properties.split(",")]

matches = []
for e in regex_list:
    filtered = filter(re.compile(e).match, talks_listed)
    indexed = "".join(filtered).split(":")[1]
    matches.append(indexed)

In [4]:
print(matches)

[' 1120694400', ' 1158019860']


In [23]:
from datetime import datetime

datetime.utcfromtimestamp(1120694400).strftime('%Y-%m-%d')

'2005-07-07'

**2018-01-20**. At long last, results. I think the part that addresses what's inside the talks will need to go inside a function, but I think the rest can be in a `for` loop that works through all the texts. Now to compile these results into an executable script and try it on the `test_html`. From there, I will need to see how to capture the results -- it's not clear to me what data structure (unless it's straight to a dataframe) and then figure out how to sync that dataframe with the main one.

In [11]:
from bs4 import BeautifulSoup
import json
import re 


# =-=-=-=-=-=-=-=-=-=-=
#  LOAD the file
# =-=-=-=-=-=-=-=-=-=-= 
prefix = "/Users/john/Code/tedarchives/"
text = open( prefix+'test_html/richard_dawkins_on_our_queer_universe.html',
            'r').read()

# =-=-=-=-=-=-=-=-=-=-=
# Read the HTML & get the section we want
# =-=-=-=-=-=-=-=-=-=-= 
soup = BeautifulSoup(text, "html5lib")
my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
           .rstrip('})')
           for i in soup.select('script') 
           if i.string and i.string.startswith('q')]

pre_json = '{"' + "".join(my_list)
my_json = json.loads(pre_json)

out_slug = my_json['slug']
out_vcount = my_json['viewed_count']
out_event = my_json['event']

talks_listed = str(my_json['talks']).split(",")

properties = "filmed,published" # No spaces between terms!
regex_list = [".*("+i+").*" for i in properties.split(",")]

matches = []
for e in regex_list:
    filtered = filter(re.compile(e).match, talks_listed)
    indexed = "".join(filtered).split(":")[1]
    matches.append(indexed)

out_filmed =  matches[0]
out_published = matches[1]

print(out_slug, out_vcount, out_event, out_filmed, out_published)

richard_dawkins_on_our_queer_universe 2952227 TEDGlobal 2005  1120694400  1158019860


In [26]:
def get_metadata(the_file):
    
    # Load the modules we need
    from bs4 import BeautifulSoup
    import json
    import re
    from datetime import datetime
    
    # Read the file, load it into BS, then grab section we want
    text = the_file.read()
    soup = BeautifulSoup(text, "html5lib")
    my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
               .rstrip('})')
               for i in soup.select('script') 
               if i.string and i.string.startswith('q')]
    
    # Read first layer of JSON and get out those elements we want
    pre_json = '{"' + "".join(my_list)
    my_json = json.loads(pre_json)
    slug = my_json['slug']
    vcount = my_json['viewed_count']
    event = my_json['event']
    
    # Read second layer of JSON and get out listed elements:
    properties = "filmed,published" # No spaces between terms!
    talks_listed = str(my_json['talks']).split(",")
    regex_list = [".*("+i+").*" for i in properties.split(",")]
    matches = []
    for e in regex_list:
        filtered = filter(re.compile(e).match, talks_listed)
        indexed = "".join(filtered).split(":")[1]
        matches.append(indexed)
    filmed = datetime.utcfromtimestamp(float(matches[0])).strftime('%Y-%m-%d')
    published = datetime.utcfromtimestamp(float(matches[1])).strftime('%Y-%m-%d')
    return slug, vcount, event, filmed, published

def to_csv(pth, out):
    # LOAD required modules
    import csv
    import os
    # OPEN file to which to write:
    with open(out, "w") as out:
        # create csv.writer.
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["slug", "view_count", "event", "filmed", "published"])
        # get all our html files.
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                print(html)
                # parse the file and write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "lxml")))
                
# to_csv("./talks","talks.csv") # This is to the test directory!

In [27]:
with open("/Users/john/Code/tedarchives/test_html/richard_dawkins_on_our_queer_universe.html") as my_file:
    print(get_metadata(my_file))

('richard_dawkins_on_our_queer_universe', 2952227, 'TEDGlobal 2005', '2005-07-07', '2006-09-12')
