## Getting the data

**Task**: Get the metadata for each of the talks which will tell us when the talk actually occurred along with the particular TED event at which it occurred. Bonus: get the view count.

I consulted my own [notes][] (from May 2016) on how I previously downloaded the transcripts, and, as it turns out, the Google Doc is still available and up to date. This time I only copied the URLs and pasted them into a text file.  

[notes]: http://johnlaudun.org/20160518-wgetting-ted-talk-transcripts/

I tested the `wget` command twice:

    wget -w 2 -i tedtalks_test.txt

The first time I used the top three entries -- what now feels like the triumvirate of Gore, Pogue, and Carter -- but then discovered that all three are from the same TED event in 2006. I then re-tested the script with URLs for talks by Dawkins and Gladwell:

    https://www.ted.com/talks/richard_dawkins_on_our_queer_universe
    https://www.ted.com/talks/malcolm_gladwell_on_spaghetti_sauce

The script worked both times. I appended HTML to the file names and opened in a text editor. I eventually located the information we want in a massive `<script>` block located near the end of a file. In one location is this:

    "recorded_at":"2005-07-07T00:00:00.000+00:00"

But a more interesting clump of data occurs just before this:

    "canonical":"https://www.ted.com/talks/richard_dawkins_on_our_queer_universe",
    "external":null,"name":"Richard Dawkins: Why the universe seems so strange",
    "title":"Why the universe seems so strange",
    "speaker":"Richard Dawkins",
    "thumb":"https://pi.tedcdn.com/r/pe.tedcdn.com/images/ted/160_480x360.jpg?quality=89&w=600",
    "slug":"richard_dawkins_on_our_queer_universe",
    "event":"TEDGlobal 2005",
    "filmed":1120694400,
    "published":1158019860,

Those last two are, of course, UNIX time codes that translate as follows:

* `"filmed":1120694400` = Wednesday 6th July 2005
* `"published":1158019860` = Monday 11th September 2006

Elsewhere: `"viewed_count":2952227`.

## Downloaded Files

The results of `wget -w 2 -i tedtalks_URLs.txt`:

    FINISHED --2018-01-17 17:55:25--
    Total wall clock time: 1h 54m 14s
    Downloaded: 2628 files, 172M in 2m 27s (1.17 MB/s)

## Experiments in Parsing

In [1]:
from bs4 import BeautifulSoup
import json

# LOAD the test file
prefix = "/Users/jjl5766/Dropbox/research/TEDTalks/tedarchives/"
text = open( prefix+'test_html/richard_dawkins_on_our_queer_universe.html',
            'r').read()

soup = BeautifulSoup(text, "html5lib")
my_list = [i.string.lstrip('q("talkPage.init", {\n\t"el": "[data-talk-page]",\n\t "__INITIAL_DATA__":')
           .rstrip('})')
           for i in soup.select('script') 
           if i.string and i.string.startswith('q')]

# `my_list` is a list with only one item in it, but it is everything we want. 
# This could probably be a string, but I don't have that fu, so I now convert it 
# back to a string because that's what the JSON module wants.

pre_json = '{"' + "".join(my_list)
# translation_table = dict.fromkeys(map(ord, '[]'), None)
# delisted = pre_json.translate(translation_table)
# print(delisted)

In [2]:
my_json = json.loads(pre_json) # my_json is a Python dictionary
# my_json = json.loads(delisted) # my_json is a Python dictionary

# For those counting: we've gone string to list to string to dictionary.

for key in my_json.keys():
    print(key)

event
url
viewed_count
language
description
talks
media
speakers
threadId
comments
current_talk
name
slug


In [9]:
talks_listed = str(my_json['talks']).split(",")
len(talks_listed)

682

In [43]:
import re 

re_list = [
    ".*(filmed).*",
    ".*(published).*" ]

matches = []
for e in re_list:
    result = filter (re.compile(e).match, talks_listed)
    matches.append(result)

for i in range(len(matches)):
    print(list(matches[i]))

[" 'filmed': 1120694400"]
[" 'published': 1158019860"]


In [60]:
# And here's an attempt to build a dictionary.

import re 

properties = "filmed,published" # No spaces between terms!
regex_list = [".*("+i+").*" for i in properties.split(",")]

In [72]:
dict(string.split(':') for string in a_list)

matches = []
for prop in regex_list:
    result = ''.join(list(filter(re.compile(prop).match, talks_listed)))#.split(":")



In [54]:
print(regex_list)

['.*(filmed).*', '.*( published).*']


## Older Attempts

In [4]:
import ast

data_dict = ast.literal_eval(str(my_json['talks']))

len(data_dict)

for key in data_dict.keys():
    print(key)

In [None]:
json_talks = json.loads(str(my_json['talks']))