In this notebook, we are going to load the CSV exported from the Google Doc into a `pandas` dataframe, then use the URL to scrape the HTML for the data we want, and then add that data to the dataframe. At the end, we will save the dataframe back to the CSV.

Understanding what the `requests` library returns was made easier thanks to this [post][].

[post]: http://www.compjour.org/tutorials/intro-to-python-requests-and-json/

In [47]:
# =-=-=-=-=-=-=-=-=-=-=
# Import libraries
# =-=-=-=-=-=-=-=-=-=-= 
import pandas, re, requests, csv, os
from bs4 import BeautifulSoup

In [53]:
# =-=-=-=-=-=-=-=-=-=-=
# Define functions
# =-=-=-=-=-=-=-=-=-=-= 

def parse(thesoup):
    for tag in thesoup.find_all("meta"):
        if tag.get("name", None) == "author":
            speaker = tag.get("content", None)
        if tag.get("itemprop", None) == "duration":
            length = tag.get("content", None)
        if tag.get("itemprop", None) == "uploadDate":
            published = tag.get("content", None)
        if tag.get("itemprop", None) == "interactionCount":
            views = tag.get("content", None)
        if tag.get("itemprop", None) == "description":
            description = tag.get("content", None)
    strung = ''.join([div.text for div in 
            thesoup.findAll("div", {"class": "Grid__cell flx-s:1 p-r:4"})])
    text   = re.sub(r"[\t]", "", strung).replace("\n", " ")
    return speaker, length, published, views, description, text

def to_csv(csv_file, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["author", "length", "published", "views", "description", "text"])
        # get all our html files.
        with open(csv_file) as f:
            colnames = f.readline().strip().split(",")
            df = pandas.read_csv(csv_file, names=colnames)
            # Create a tuple from the DF 
            rows = TEDtalks.itertuples()
            # Skip the first (header) row
            next(rows) # skip the row header
            # Here's where all the lifting occurs
            for row in rows:
                # Append transcript to get the correct HTML
                the_url = row.public_url + "/transcript"
                # Use the requests library to get the data
                the_html = requests.get(the_url)
                # We're skipping the JSON niftiness of the requests response 
                # and just converting the response to text that we can parse
                the_soup = BeautifulSoup(the_html.text, "html5lib")
                # This is our parse function above that returns a 6-part tuple
                # parse the file are write the data to a row.
                wr.writerow(parse(the_soup))

In [4]:
# =-=-=-=-=-=-=-=-=-=-=
# LOAD the CSV
# =-=-=-=-=-=-=-=-=-=-= 

# Let python create the column names list:
with open('./TED_to_add_test.csv') as f:
    colnames = f.readline().strip().split(",")
print(colnames)

# Now will import the csv as a dataframe with the column names specified
TEDtalks = pandas.read_csv('./TED_to_add_test.csv', names=colnames)

# Check for success:
TEDtalks.head()

['Talk ID', 'public_url', 'speaker_name', 'headline', 'description', 'event', 'duration', 'language', 'published', 'tags']


Unnamed: 0,Talk ID,public_url,speaker_name,headline,description,event,duration,language,published,tags
0,Talk ID,public_url,speaker_name,headline,description,event,duration,language,published,tags
1,1,https://www.ted.com/talks/al_gore_on_averting_...,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,TED2006,0:16:17,en,6/27/06,"alternative energy,cars,global issues,climate ..."
2,7,https://www.ted.com/talks/david_pogue_says_sim...,David Pogue,Simplicity sells,New York Times columnist David Pogue takes aim...,TED2006,0:21:26,en,6/27/06,"simplicity,entertainment,interface design,soft..."
3,53,https://www.ted.com/talks/majora_carter_s_tale...,Majora Carter,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",TED2006,0:18:36,en,6/27/06,"MacArthur grant,cities,green,activism,politics..."
4,66,https://www.ted.com/talks/ken_robinson_says_sc...,Ken Robinson,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,TED2006,0:19:24,en,6/27/06,"children,teaching,creativity,parenting,culture..."


In [31]:
# =-=-=-=-=-=-=-=-=-=-=
# Grab the URL and then the HTML
# =-=-=-=-=-=-=-=-=-=-= 

# Create a tuple from the DF 
rows = TEDtalks.itertuples()

# Skip the first (header) row
next(rows) # skip the row header

# Here's where all the lifting occurs
for row in rows:
    # Append transcript to get the correct HTML
    the_url = row.public_url + "/transcript"
    # Use the requests library to get the data
    the_html = requests.get(the_url)
    # We're skipping the JSON niftiness of the requests response 
    # and just converting the response to text that we can parse
    the_soup = BeautifulSoup(the_html.text, "html5lib")
    # This is our parse function above that returns a 6-part tuple
    parse(the_soup)

https://www.ted.com/talks/al_gore_on_averting_climate_crisis/transcript
https://www.ted.com/talks/david_pogue_says_simplicity_sells/transcript
https://www.ted.com/talks/majora_carter_s_tale_of_urban_renewal/transcript
https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity/transcript


## Write the data to CSV / DF

In [56]:
to_csv("./TEDtalks_to_add.csv","added.csv")