In [2]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 
import matplotlib as plt
plt.style.use('ggplot')

import urllib2 # module to read in HTML
import bs4 # BeautifulSoup: module to parse HTML and XML
import json # 
import datetime as dt # module for manipulating dates and times
import pandas as pd
import numpy as np

## Recall from from lab last week 09/19/2014

Previously discussed: 

* More pandas, matplotlib for exploratory data analysis
* Brief introduction to numpy and scipy
* Working on the command line
* Overview of git and Github

## Today, we will discuss the following:

* urllib2 - reads in HTML
* BeautifulSoup - use to parse HTML and XML code
    * Reddit
* JSON examples
    * World Cup

<a href=https://raw.githubusercontent.com/cs109/2014/master/labs/Lab4_Notes.ipynb download=Lab4_Notes.ipynb> Download this notebook from Github </a>

# urllib2

[urllib2](https://docs.python.org/2/library/urllib2.html) is a useful module to get information about and retrieving data from the web. The function `urlopen()` opens a URL (similar to opening a file). The file-like object has some of the methods as a file object. For example, to read the entire HTML of the webpage into a single string, use the method `read()`. `readlines()` can read in the text line by line. While `read()` reads in the HTML code and and `close()` closes the URL connection. 


In [3]:
x = urllib2.urlopen("http://www.google.com")
htmlSource = x.read()
x.close()

In [4]:
type(htmlSource)

str

In [5]:
print htmlSource

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="/images/google_favicon_128.png" itemprop="image"><title>Google</title><script>(function(){window.google={kEI:'wl1nVYLQBc-koQT2uIHYCA',kEXPI:'18168,3700332,4014789,4020726,4026111,4029570,4029815,4030124,4032064,4032500,4032522,4032643,4032645,4032678,4033183,4033307,4033344,4034425,4034617,4034884,4035816,4036346,4036486,4036531,4036539,4036665,4037452,4037457,4037500,4037650,8300096,8300175,8500394,8500851,8501248,8501295,8501351,8501407,8501489,8501498,8501594,10200083,10201087,10201180,10201192,10201218',authuser:0,kSID:'c9c918f0_10'};google.kHL='en';})();(function(){google.lc=[];google.li=0;google.getEI=function(a){for(var b;a&&(!a.getAttribute||!(b=a.get

# BeautifulSoup

Once you have the HTML source code, you have to parse it and clean it up.

[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a really useful python module for parsing HTML and XML files.  Let's try a few examples. 

For this section, we will be working with the HTML code from [Reddit](http://www.reddit.com). 

In [6]:
x = urllib2.urlopen("http://www.reddit.com") # Opens URLS
htmlSource = x.read()
x.close()
print htmlSource

<!doctype html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>reddit: the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, comment, submit " /><meta name="description" content="reddit: the front page of the internet" /><meta name="referrer" content="always"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.reddit.com/" /><meta name="viewport" content="width=1024"><link rel='icon' href="//www.redditstatic.com/icon.png" sizes="256x256" type="image/png" /><link rel='shortcut icon' href="//www.redditstatic.com/favicon.ico" type="image/x-icon" /><link rel='apple-touch-icon-precomposed' href="//www.redditstatic.com/icon-touch.png" /><link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.reddit.com/.rss" /><link rel="stylesheet" type="text/css" href="//www.redditstatic.com/reddit.KNkFjKkn_1Y.css" 

### prettify()

Beautiful Soup gives us a `BeautifulSoup` object, which represents the document as a nested data structure. We can use the `prettify()` function to show the different levels of the HTML code. 

In [7]:
soup = bs4.BeautifulSoup(htmlSource)
print soup.prettify()

<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   reddit: the front page of the internet
  </title>
  <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
  <meta content="reddit: the front page of the internet" name="description"/>
  <meta content="always" name="referrer"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="https://m.reddit.com/" media="only screen and (max-width: 640px)" rel="alternate"/>
  <meta content="width=1024" name="viewport"/>
  <link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/>
  <link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/>
  <link href="http://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/rss+xml"/>
  <link href="//www.redditstatic.com/reddit.KNkFjKkn_1

### Navigating the tree using tags

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `<head>` tag, just say `soup.head`:

In [8]:
print soup.head.prettify()

<head>
 <title>
  reddit: the front page of the internet
 </title>
 <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
 <meta content="reddit: the front page of the internet" name="description"/>
 <meta content="always" name="referrer"/>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <link href="https://m.reddit.com/" media="only screen and (max-width: 640px)" rel="alternate"/>
 <meta content="width=1024" name="viewport"/>
 <link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/>
 <link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
 <link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/>
 <link href="http://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/rss+xml"/>
 <link href="//www.redditstatic.com/reddit.KNkFjKkn_1Y.css" media="all" rel="stylesheet" type="text/css"/>
 <link href="//www.redditstatic.com/thebutton

### .contents and .children

A tag’s children are available in a list called `.contents` which returns a list.  

In [9]:
soup.head.contents

[<title>reddit: the front page of the internet</title>,
 <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>,
 <meta content="reddit: the front page of the internet" name="description"/>,
 <meta content="always" name="referrer"/>,
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>,
 <link href="https://m.reddit.com/" media="only screen and (max-width: 640px)" rel="alternate"/>,
 <meta content="width=1024" name="viewport"/>,
 <link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/>,
 <link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>,
 <link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/>,
 <link href="http://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/rss+xml"/>,
 <link href="//www.redditstatic.com/reddit.KNkFjKkn_1Y.css" media="all" rel="stylesheet" type="text/css"/>,
 <link href="//www.redditstatic.com/thebutton

In [10]:
len(soup.head.contents)

24

In [11]:
# Extract first three elements from the list of contents
soup.head.contents[0:3]

[<title>reddit: the front page of the internet</title>,
 <meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>,
 <meta content="reddit: the front page of the internet" name="description"/>]

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

In [12]:
soup.head.children

<listiterator at 0x1079a2990>

In [13]:
for child in soup.head.children:
    print(child)

<title>reddit: the front page of the internet</title>
<meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
<meta content="reddit: the front page of the internet" name="description"/>
<meta content="always" name="referrer"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://m.reddit.com/" media="only screen and (max-width: 640px)" rel="alternate"/>
<meta content="width=1024" name="viewport"/>
<link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/>
<link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/>
<link href="http://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/rss+xml"/>
<link href="//www.redditstatic.com/reddit.KNkFjKkn_1Y.css" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.redditstatic.com/thebutton.i-G5y6pAHNE.css" media="

In [14]:
# print the title of reddit
soup.head.title

<title>reddit: the front page of the internet</title>

In [15]:
# print the string in the title
soup.head.title.string

u'reddit: the front page of the internet'

### .descendants

Attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [16]:
for child in soup.head.descendants:
    print child

<title>reddit: the front page of the internet</title>
reddit: the front page of the internet
<meta content=" reddit, reddit.com, vote, comment, submit " name="keywords"/>
<meta content="reddit: the front page of the internet" name="description"/>
<meta content="always" name="referrer"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://m.reddit.com/" media="only screen and (max-width: 640px)" rel="alternate"/>
<meta content="width=1024" name="viewport"/>
<link href="//www.redditstatic.com/icon.png" rel="icon" sizes="256x256" type="image/png"/>
<link href="//www.redditstatic.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="//www.redditstatic.com/icon-touch.png" rel="apple-touch-icon-precomposed"/>
<link href="http://www.reddit.com/.rss" rel="alternate" title="RSS" type="application/rss+xml"/>
<link href="//www.redditstatic.com/reddit.KNkFjKkn_1Y.css" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.redditstatic

### .strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator

In [17]:
soup.strings

<generator object _all_strings at 0x109d34b90>

In [18]:
for string in soup.strings:
    print(repr(string))

u'reddit: the front page of the internet'
u'r.setup({"ajax_domain": "www.reddit.com", "server_time": 1432837495.0, "post_site": "", "clicktracker_url": "//pixel.redditmedia.com/click", "logged": false, "stats_domain": "https://stats.redditmedia.com", "cur_domain": "reddit.com", "is_sponsor": false, "https_forced": false, "user_id": false, "eventtracker_url": "//pixel.redditmedia.com/pixel/of_delight.png", "is_fake": true, "renderstyle": "html", "over_18": false, "vote_hash": "BEEBru/5LqzLGFj0SoXqO/afl0cXr2mWCipyXxZVf0a4Hh+zeGW9rFizYtcbFuQxk4AXGc/JfTQ8T/ZeWc3nHaU9gqWSKL1wEEjTr8A35tXSst3RgFlormByNs9Q0dcwxfaXWvnUg+6zUNev8YwVoLoB1TfemYoiWW29POqwlYpLtJs3MbxmkyI=", "adtracker_url": "//pixel.redditmedia.com/pixel/of_doom.png", "uitracker_url": "//pixel.redditmedia.com/pixel/of_discovery.png", "modhash": false, "store_visits": false, "anon_eventtracker_url": "//pixel.redditmedia.com/pixel/of_diversity.png", "new_window": false, "send_logs": true, "gold": false, "pageInfo": {"actionName": "hot.

### .stripped_strings

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead

In [19]:
for string in soup.stripped_strings:
    print(repr(string))

u'reddit: the front page of the internet'
u'r.setup({"ajax_domain": "www.reddit.com", "server_time": 1432837495.0, "post_site": "", "clicktracker_url": "//pixel.redditmedia.com/click", "logged": false, "stats_domain": "https://stats.redditmedia.com", "cur_domain": "reddit.com", "is_sponsor": false, "https_forced": false, "user_id": false, "eventtracker_url": "//pixel.redditmedia.com/pixel/of_delight.png", "is_fake": true, "renderstyle": "html", "over_18": false, "vote_hash": "BEEBru/5LqzLGFj0SoXqO/afl0cXr2mWCipyXxZVf0a4Hh+zeGW9rFizYtcbFuQxk4AXGc/JfTQ8T/ZeWc3nHaU9gqWSKL1wEEjTr8A35tXSst3RgFlormByNs9Q0dcwxfaXWvnUg+6zUNev8YwVoLoB1TfemYoiWW29POqwlYpLtJs3MbxmkyI=", "adtracker_url": "//pixel.redditmedia.com/pixel/of_doom.png", "uitracker_url": "//pixel.redditmedia.com/pixel/of_discovery.png", "modhash": false, "store_visits": false, "anon_eventtracker_url": "//pixel.redditmedia.com/pixel/of_diversity.png", "new_window": false, "send_logs": true, "gold": false, "pageInfo": {"actionName": "hot.

### .parent

You can access an element’s parent with the `.parent` attribute. In the example “three sisters” document, the `<head>` tag is the parent of the `<title>` tag:

In [20]:
soup.title

<title>reddit: the front page of the internet</title>

In [21]:
soup.title.string

u'reddit: the front page of the internet'

In [22]:
soup.title.string.parent

<title>reddit: the front page of the internet</title>

# Searching the Tree

Now, let's consider examples of different filters you can use to search this nested tree of HTML. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

#### Use `find_all()` to find all tags

One common task is extracting all the URLs found within a page's tags:

In [23]:
# search for all <a> tags; returns a list
soup.find_all('a')

[<a href="#content" id="jumpToContent" tabindex="1">jump to content</a>,
 <a class="choice" href="http://www.reddit.com/r/announcements/">announcements</a>,
 <a class="choice" href="http://www.reddit.com/r/Art/">Art</a>,
 <a class="choice" href="http://www.reddit.com/r/AskReddit/">AskReddit</a>,
 <a class="choice" href="http://www.reddit.com/r/askscience/">askscience</a>,
 <a class="choice" href="http://www.reddit.com/r/aww/">aww</a>,
 <a class="choice" href="http://www.reddit.com/r/blog/">blog</a>,
 <a class="choice" href="http://www.reddit.com/r/books/">books</a>,
 <a class="choice" href="http://www.reddit.com/r/creepy/">creepy</a>,
 <a class="choice" href="http://www.reddit.com/r/dataisbeautiful/">dataisbeautiful</a>,
 <a class="choice" href="http://www.reddit.com/r/DIY/">DIY</a>,
 <a class="choice" href="http://www.reddit.com/r/Documentaries/">Documentaries</a>,
 <a class="choice" href="http://www.reddit.com/r/EarthPorn/">EarthPorn</a>,
 <a class="choice" href="http://www.reddit.co

In [24]:
# your turn
# search for all the paragragh tags
soup.find_all('p')

[<p>use the following search parameters to narrow your results:</p>,
 <p>e.g. <code>subreddit:aww site:imgur.com dog</code></p>,
 <p><a href="http://www.reddit.com/wiki/search">see the search faq for details.</a></p>,
 <p><a href="http://www.reddit.com/wiki/search" id="search_showmore">advanced search: by author, subreddit...</a></p>,
 <p>49%</p>,
 <p><span class="gold-branding">reddit gold</span> gives you extra features and helps keep our servers running. We believe the more reddit can be user-supported, the freer we will be to make reddit the best it can be.</p>,
 <p class="buy-gold">Buy gold for yourself to gain access to <a href="/gold/about" target="_blank">extra features</a> and <a href="/r/goldbenefits" target="_blank">special benefits</a>. A month of gold pays for  <b>231.26 minutes</b> of reddit server time!</p>,
 <p class="give-gold">Give gold to thank exemplary people and encourage them to post more.</p>,
 <p class="aside">This daily goal updates every 10 minutes and is res

In [25]:
# your turn
# search for all the table tags
soup.find_all('table')

[]

Other arguments to the `.find_all()` function include `limit` and `text`. What do those do? 

In [26]:
# your turn 
# search for all the <a> tags and use the limit argument 
soup.find_all('a', limit=3)

[<a href="#content" id="jumpToContent" tabindex="1">jump to content</a>,
 <a class="choice" href="http://www.reddit.com/r/announcements/">announcements</a>,
 <a class="choice" href="http://www.reddit.com/r/Art/">Art</a>]

In [27]:
# your turn 
# What does the using the text argument do? 
soup.find_all('a', limit=10, text='Art')

[<a class="choice" href="http://www.reddit.com/r/Art/">Art</a>,
 <a class="choice" href="http://www.reddit.com/r/Art/">Art</a>]

#### Use `.get()` to extract an attribute

In [28]:
soup.find_all('a')[1].get('href')

'http://www.reddit.com/r/announcements/'

In [29]:
soup.find_all('a')[1].get('class')

['choice']

#### Looping through tags

In [30]:
# your turn
# write a for loop printing all the links from reddit
for anchor_tag in soup.find_all('a'):
    print anchor_tag.get('href')

#content
http://www.reddit.com/r/announcements/
http://www.reddit.com/r/Art/
http://www.reddit.com/r/AskReddit/
http://www.reddit.com/r/askscience/
http://www.reddit.com/r/aww/
http://www.reddit.com/r/blog/
http://www.reddit.com/r/books/
http://www.reddit.com/r/creepy/
http://www.reddit.com/r/dataisbeautiful/
http://www.reddit.com/r/DIY/
http://www.reddit.com/r/Documentaries/
http://www.reddit.com/r/EarthPorn/
http://www.reddit.com/r/explainlikeimfive/
http://www.reddit.com/r/Fitness/
http://www.reddit.com/r/food/
http://www.reddit.com/r/funny/
http://www.reddit.com/r/Futurology/
http://www.reddit.com/r/gadgets/
http://www.reddit.com/r/gaming/
http://www.reddit.com/r/GetMotivated/
http://www.reddit.com/r/gifs/
http://www.reddit.com/r/history/
http://www.reddit.com/r/IAmA/
http://www.reddit.com/r/InternetIsBeautiful/
http://www.reddit.com/r/Jokes/
http://www.reddit.com/r/LifeProTips/
http://www.reddit.com/r/listentothis/
http://www.reddit.com/r/mildlyinteresting/
http://www.reddit.com/r

In [31]:
# your turn
# write a for loop, but use a list comprehension this time
# show the first 5 elements

for anchor_tag in soup.find_all('a')[:5]:
    print anchor_tag.get('href')

#content
http://www.reddit.com/r/announcements/
http://www.reddit.com/r/Art/
http://www.reddit.com/r/AskReddit/
http://www.reddit.com/r/askscience/


In [43]:
# your turn
# split the first url by "/"
anchor = soup.find_all('a')[1]
link = anchor.get('href')
link.split('/')

['http:', '', 'www.reddit.com', 'r', 'announcements', '']

Another common task is extracting all the text from a page:

In [71]:
print(soup.get_text())

reddit: the front page of the internetr.setup({"ajax_domain": "www.reddit.com", "server_time": 1432771660.0, "post_site": "", "clicktracker_url": "//pixel.redditmedia.com/click", "logged": false, "stats_domain": "https://stats.redditmedia.com", "cur_domain": "reddit.com", "is_sponsor": false, "https_forced": false, "user_id": false, "eventtracker_url": "//pixel.redditmedia.com/pixel/of_delight.png", "is_fake": true, "renderstyle": "html", "over_18": false, "vote_hash": "yHIFW1bD5tONvTe6VjAdh+tZUC5jaLJAanhJgR1QbjeBP/IyfgkhOJuy5QwWTKMq9c6TM3Ncpr+H6hgFSRm4s72ZTVyhnyaK9ELHX3IXQht590fWuijzVLABCM44hRGGb9o66zWoYunSEvnm6dQckEL5vnuwnFmNzz3B7u5bbn75Q662an0tr/U=", "adtracker_url": "//pixel.redditmedia.com/pixel/of_doom.png", "uitracker_url": "//pixel.redditmedia.com/pixel/of_discovery.png", "modhash": false, "store_visits": false, "anon_eventtracker_url": "//pixel.redditmedia.com/pixel/of_diversity.png", "new_window": false, "send_logs": true, "gold": false, "pageInfo": {"actionName": "hot.GET_li

# JSON

#### Working with Web APIs
Web APIs are a more convenient way for programs to interact with websites. Many webistes now have a nice API that gives access to it's data in JSON format.


In [44]:
a = {'a': 1, 'b':2}
s = json.dumps(a)
a2 = json.loads(s)

In [45]:
a # a dictionary

{'a': 1, 'b': 2}

In [46]:
s # s is a string containing a in JSON encoding

'{"a": 1, "b": 2}'

In [47]:
a2 # reading back the keys are now in unicode

{u'a': 1, u'b': 2}

## World Cup in JSON!

The [2014 FIFA World Cup](http://en.wikipedia.org/wiki/2014_FIFA_World_Cup) was held this summer in Brazil at several different venues.  There was an [API created for the World Cup](http://worldcup.sfg.io) that scraped current match results and output match data as JSON. Possible output includes events such as goals, substitutions, and cards. The [actual matches are listed here](http://worldcup.sfg.io/matches) in JSON. 

* Example from [Fernando Masanori](https://gist.github.com/fmasanori/1288160dad16cc473a53)

In [48]:
url = "http://worldcup.sfg.io/matches"
data = urllib2.urlopen(url).read()
wc = json.loads(data.decode('utf-8'))

In [63]:
type(data)

str

In [62]:
type(wc)

list

In [70]:
wc

[{u'away_team': {u'code': u'CRO', u'country': u'Croatia', u'goals': 1},
  u'away_team_events': [{u'id': 677,
    u'player': u'Brozovi\u0106',
    u'time': u'61',
    u'type_of_event': u'substitution-in'},
   {u'id': 674,
    u'player': u'Corluka',
    u'time': u'66',
    u'type_of_event': u'yellow-card'},
   {u'id': 675,
    u'player': u'Lovren',
    u'time': u'69',
    u'type_of_event': u'yellow-card'},
   {u'id': 676,
    u'player': u'Rebi\u0106',
    u'time': u'78',
    u'type_of_event': u'substitution-in'}],
  u'datetime': u'2014-06-12T17:00:00.000-03:00',
  u'home_team': {u'code': u'BRA', u'country': u'Brazil', u'goals': 3},
  u'home_team_events': [{u'id': 662,
    u'player': u'Marcelo',
    u'time': u'11',
    u'type_of_event': u'goal-own'},
   {u'id': 665,
    u'player': u'Neymar Jr',
    u'time': u'27',
    u'type_of_event': u'yellow-card'},
   {u'id': 666,
    u'player': u'Neymar Jr',
    u'time': u'29',
    u'type_of_event': u'goal'},
   {u'id': 664,
    u'player': u'Paulinho

In [69]:
"Number of matches in 2014 World Cup: %i" % len(wc)

'Number of matches in 2014 World Cup: 64'

In [78]:
gameIndex = 60
wc[gameIndex]['home_team']['country']

u'Brazil'

In [74]:
wc[gameIndex]['status']

u'completed'

In [76]:
wc[gameIndex]['match_number']

61

In [79]:
wc[gameIndex]['away_team']

{u'code': u'GER', u'country': u'Germany', u'goals': 7}

In [80]:
wc[gameIndex]['away_team_events']

[{u'id': 1354,
  u'player': u'M\xdcller',
  u'time': u'11',
  u'type_of_event': u'goal'},
 {u'id': 1355, u'player': u'Klose', u'time': u'23', u'type_of_event': u'goal'},
 {u'id': 1356, u'player': u'Kroos', u'time': u'24', u'type_of_event': u'goal'},
 {u'id': 1357, u'player': u'Kroos', u'time': u'26', u'type_of_event': u'goal'},
 {u'id': 1358,
  u'player': u'Khedira',
  u'time': u'29',
  u'type_of_event': u'goal'},
 {u'id': 1363,
  u'player': u'Hummels',
  u'time': u'46',
  u'type_of_event': u'substitution-out halftime'},
 {u'id': 1364,
  u'player': u'Mertesacker',
  u'time': u'46',
  u'type_of_event': u'substitution-in halftime'},
 {u'id': 1365,
  u'player': u'Klose',
  u'time': u'58',
  u'type_of_event': u'substitution-out'},
 {u'id': 1366,
  u'player': u'Sch\xdcrrle',
  u'time': u'58',
  u'type_of_event': u'substitution-in'},
 {u'id': 1370,
  u'player': u'Sch\xdcrrle',
  u'time': u'69',
  u'type_of_event': u'goal'},
 {u'id': 1372,
  u'player': u'Draxler',
  u'time': u'76',
  u'type_o

In [42]:
wc[gameIndex]['home_team']

{u'code': u'BRA', u'country': u'Brazil', u'goals': 1}

The [Brazil v Germany (2014 FIFA World Cup)](http://en.wikipedia.org/wiki/Brazil_v_Germany_(2014_FIFA_World_Cup) match on July 8, 2014 where Germany score the most goals in World Cup tournament history.  Germany led 5–0 at half time, with 4 goals scored in a span of 6 minutes, and subsequently brought the score up to 7–0 in the second half. Brazil scored a goal at the last minute, ending the match 7–1. 

Print the team names and goals scored for each match

In [81]:
for elem in wc:
    print elem['home_team']['country'], elem['home_team']['goals'], elem['away_team']['country'], elem['away_team']['goals']

Brazil 3 Croatia 1
Mexico 1 Cameroon 0
Spain 1 Netherlands 5
Chile 3 Australia 1
Colombia 3 Greece 0
Ivory Coast 2 Japan 1
Uruguay 1 Costa Rica 3
England 1 Italy 2
Switzerland 2 Ecuador 1
France 3 Honduras 0
Argentina 2 Bosnia and Herzegovina 1
Iran 0 Nigeria 0
Germany 4 Portugal 0
Ghana 1 USA 2
Belgium 2 Algeria 1
Russia 1 Korea Republic 1
Brazil 0 Mexico 0
Cameroon 0 Croatia 4
Spain 0 Chile 2
Australia 2 Netherlands 3
Colombia 2 Ivory Coast 1
Japan 0 Greece 0
Uruguay 2 England 1
Italy 0 Costa Rica 1
Switzerland 2 France 5
Honduras 1 Ecuador 2
Argentina 1 Iran 0
Nigeria 1 Bosnia and Herzegovina 0
Germany 2 Ghana 2
USA 2 Portugal 2
Belgium 1 Russia 0
Korea Republic 2 Algeria 4
Cameroon 1 Brazil 4
Croatia 1 Mexico 3
Australia 0 Spain 3
Netherlands 2 Chile 0
Japan 1 Colombia 4
Greece 2 Ivory Coast 1
Italy 0 Uruguay 1
Costa Rica 0 England 0
Honduras 0 Switzerland 3
Ecuador 0 France 0
Nigeria 2 Argentina 3
Bosnia and Herzegovina 3 Iran 1
USA 0 Germany 1
Portugal 2 Ghana 1
Korea Republic 0 

### Create a pandas DataFrame from JSON

In [82]:
data = pd.DataFrame(wc, columns = ['match_number', 'location', 'datetime', 'home_team', 'away_team', 'winner', 'home_team_events', 'away_team_events'])
data.head()

Unnamed: 0,match_number,location,datetime,home_team,away_team,winner,home_team_events,away_team_events
0,1,Arena de Sao Paulo,2014-06-12T17:00:00.000-03:00,"{u'country': u'Brazil', u'code': u'BRA', u'goa...","{u'country': u'Croatia', u'code': u'CRO', u'go...",Brazil,"[{u'type_of_event': u'goal-own', u'player': u'...","[{u'type_of_event': u'substitution-in', u'play..."
1,2,Estadio das Dunas,2014-06-13T13:00:00.000-03:00,"{u'country': u'Mexico', u'code': u'MEX', u'goa...","{u'country': u'Cameroon', u'code': u'CMR', u'g...",Mexico,"[{u'type_of_event': u'yellow-card', u'player':...",[{u'type_of_event': u'substitution-in halftime...
2,3,Arena Fonte Nova,2014-06-13T16:00:00.000-03:00,"{u'country': u'Spain', u'code': u'ESP', u'goal...","{u'country': u'Netherlands', u'code': u'NED', ...",Netherlands,"[{u'type_of_event': u'goal-penalty', u'player'...","[{u'type_of_event': u'yellow-card', u'player':..."
3,4,Arena Pantanal,2014-06-13T19:00:00.000-03:00,"{u'country': u'Chile', u'code': u'CHI', u'goal...","{u'country': u'Australia', u'code': u'AUS', u'...",Chile,"[{u'type_of_event': u'goal', u'player': u'Alex...","[{u'type_of_event': u'goal', u'player': u'Cahi..."
4,5,Estadio Mineirao,2014-06-14T13:00:00.000-03:00,"{u'country': u'Colombia', u'code': u'COL', u'g...","{u'country': u'Greece', u'code': u'GRE', u'goa...",Colombia,"[{u'type_of_event': u'goal', u'player': u'P. A...","[{u'type_of_event': u'yellow-card', u'player':..."


In [96]:
foo = data[:5]['home_team']

KeyError: 'country'

#### Convert format of a column

Here we pandas [DatetimeIndex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html) to convert the `datetime` column to two seperate columns: a date and a time for each match.

In [45]:
data['gameDate'] = pd.DatetimeIndex(data.datetime).date
data['gameTime'] = pd.DatetimeIndex(data.datetime).time

In [46]:
data.head()

Unnamed: 0,match_number,location,datetime,home_team,away_team,winner,home_team_events,away_team_events,gameDate,gameTime
0,1,Arena de Sao Paulo,2014-06-12T17:00:00.000-03:00,"{u'country': u'Brazil', u'code': u'BRA', u'goa...","{u'country': u'Croatia', u'code': u'CRO', u'go...",Brazil,"[{u'type_of_event': u'goal-own', u'player': u'...","[{u'type_of_event': u'substitution-in', u'play...",2014-06-12,20:00:00
1,2,Estadio das Dunas,2014-06-13T13:00:00.000-03:00,"{u'country': u'Mexico', u'code': u'MEX', u'goa...","{u'country': u'Cameroon', u'code': u'CMR', u'g...",Mexico,"[{u'type_of_event': u'yellow-card', u'player':...",[{u'type_of_event': u'substitution-in halftime...,2014-06-13,16:00:00
2,3,Arena Fonte Nova,2014-06-13T16:00:00.000-03:00,"{u'country': u'Spain', u'code': u'ESP', u'goal...","{u'country': u'Netherlands', u'code': u'NED', ...",Netherlands,"[{u'type_of_event': u'goal-penalty', u'player'...","[{u'type_of_event': u'yellow-card', u'player':...",2014-06-13,19:00:00
3,4,Arena Pantanal,2014-06-13T19:00:00.000-03:00,"{u'country': u'Chile', u'code': u'CHI', u'goal...","{u'country': u'Australia', u'code': u'AUS', u'...",Chile,"[{u'type_of_event': u'goal', u'player': u'Alex...","[{u'type_of_event': u'goal', u'player': u'Cahi...",2014-06-13,22:00:00
4,5,Estadio Mineirao,2014-06-14T13:00:00.000-03:00,"{u'country': u'Colombia', u'code': u'COL', u'g...","{u'country': u'Greece', u'code': u'GRE', u'goa...",Colombia,"[{u'type_of_event': u'goal', u'player': u'P. A...","[{u'type_of_event': u'yellow-card', u'player':...",2014-06-14,16:00:00


In [118]:
data['home_team'][0]

{u'code': u'BRA', u'country': u'Brazil', u'goals': 3}

In [121]:
for match in data:
    print match

match_number
location
datetime
home_team
away_team
winner
home_team_events
away_team_events
