## JSON files

In [1]:
# We'll load a couple of libraries that we'll be using in this Notebook.
# 'requests' is an HTTP library - it allows us to get data from a URL address
import requests
import json
import pandas as pd

## Using the JSON library
The Python `json` library provides a method, `json.loads()`, for parsing a string containing JSON data, that has been read in from a file via a web address, and converting it to a `dict`.

In [2]:
# We start by getting the data from the web address - we can inspect its content directly.
# This is data from the Ordanance Survey datasets.
url = 'http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/reconciliation?query=MK7&type=http%3A%2F%2Fdata.ordnancesurvey.co.uk%2Fontology%2Fpostcode%2FPostcodeSector&type_strict=any&limit=10'
resp = requests.get(url)
resp.content

b'{"result":[{"id":"http:\\/\\/data.ordnancesurvey.co.uk\\/id\\/postcodesector\\/MK76","name":"MK7 6","score":1,"match":true,"type":["http:\\/\\/data.ordnancesurvey.co.uk\\/ontology\\/postcode\\/PostcodeSector"]},{"id":"http:\\/\\/data.ordnancesurvey.co.uk\\/id\\/postcodesector\\/MK77","name":"MK7 7","score":1,"match":false,"type":["http:\\/\\/data.ordnancesurvey.co.uk\\/ontology\\/postcode\\/PostcodeSector"]},{"id":"http:\\/\\/data.ordnancesurvey.co.uk\\/id\\/postcodesector\\/MK78","name":"MK7 8","score":1,"match":false,"type":["http:\\/\\/data.ordnancesurvey.co.uk\\/ontology\\/postcode\\/PostcodeSector"]}]}'

Sometimes we need to decode the data that is returned.

If the response is preceded by a `b'`, as above, then we need to convert from *bytes* to a string representation by *decoding* the response.

In [3]:
data = json.loads(resp.content.decode('UTF8'))
data

{'result': [{'id': 'http://data.ordnancesurvey.co.uk/id/postcodesector/MK76',
   'match': True,
   'name': 'MK7 6',
   'score': 1,
   'type': ['http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeSector']},
  {'id': 'http://data.ordnancesurvey.co.uk/id/postcodesector/MK77',
   'match': False,
   'name': 'MK7 7',
   'score': 1,
   'type': ['http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeSector']},
  {'id': 'http://data.ordnancesurvey.co.uk/id/postcodesector/MK78',
   'match': False,
   'name': 'MK7 8',
   'score': 1,
   'type': ['http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeSector']}]}

The json library is quite short; additional documentation can be found at http://docs.python-guide.org/en/latest/scenarios/json/

## *pandas* can handle JSON too

The *pandas* `io.json` library has the `read_json()` function - it too reads from a web address.

In [4]:
# The pandas function seems to handle the decoding seemlessly.
# Can you see how the JSON data structure is mapped on to the shape of the resulting DataFrame?
pd.io.json.read_json(url)

Unnamed: 0,result
0,{'type': ['http://data.ordnancesurvey.co.uk/on...
1,{'type': ['http://data.ordnancesurvey.co.uk/on...
2,{'type': ['http://data.ordnancesurvey.co.uk/on...


We can then convert the resulting DataFrame (structured as a dict in each row, in case you hadn't spotted it) into a more regular DataFrame format.

In [6]:
# Here's one way of converting that DataFrame, 
#     with a dict generated from the JSON in each row, to a DataFrame.
d = pd.io.json.read_json(url)
# Create a list to handle the DataFrame generated by each row.
dfl=[]
for row in d['result']:
    # Convert the dict in each row to a DataFrame.
    dfl.append(pd.DataFrame.from_dict(row))
# Concatenate all the DataFrames:
pd.concat(dfl)

Unnamed: 0,id,match,name,score,type
0,http://data.ordnancesurvey.co.uk/id/postcodese...,True,MK7 6,1,http://data.ordnancesurvey.co.uk/ontology/post...
0,http://data.ordnancesurvey.co.uk/id/postcodese...,False,MK7 7,1,http://data.ordnancesurvey.co.uk/ontology/post...
0,http://data.ordnancesurvey.co.uk/id/postcodese...,False,MK7 8,1,http://data.ordnancesurvey.co.uk/ontology/post...


Alternatively, we can use the json library `json_loads()` output with the *pandas* `json_normalise()` function.

In [10]:
# There is a handy pandas io function that can help us normalise this data:
from pandas.io.json import json_normalize
json_normalize(data['result'])

KeyError: 'result'

In [9]:
# Let's have a look at the JSON from a BBC programme episode page 
#   such as http://www.bbc.co.uk/programmes/p01ztvcp.
# NOTE- The ability to access this data directly may be removed at some future date. 
# The BBC is moving to the NITRO API http://developer.bbc.co.uk/content/nitro-quickstart
#    when that happens, access to the content will require registering as a developer.

url = 'http://www.bbc.co.uk/programmes/p01ztvcp.json'
resp = requests.get(url)
data = json.loads(resp.content.decode('utf8'))
data

{'programme': {'categories': [{'broader': {'category': {'broader': {},
      'has_topic_page': False,
      'id': 'C00045',
      'key': 'factual',
      'sameAs': None,
      'title': 'Factual',
      'type': 'genre'}},
    'has_topic_page': False,
    'id': 'C00064',
    'key': 'scienceandnature',
    'narrower': [],
    'sameAs': None,
    'title': 'Science & Nature',
    'type': 'genre'},
   {'broader': {},
    'has_topic_page': False,
    'id': 'PT009',
    'key': 'magazinesandreviews',
    'narrower': [],
    'sameAs': None,
    'title': 'Magazines & Reviews',
    'type': 'format'}],
  'display_title': {'subtitle': 'Privacy or Freedom of Speech?',
   'title': 'Click'},
  'expected_child_count': None,
  'first_broadcast_date': '2014-06-03T19:32:30+01:00',
  'image': {'pid': 'p02090my'},
  'links': [{'title': 'BBC News: Google: Who would want the right to be forgotten?',
    'type': 'related_site',
    'url': 'http://www.bbc.co.uk/news/magazine-27396981'},
   {'title': 'Professor L

In [11]:
# As seen before, the read_json() function can read and parse this data directly.
# But, the above is a nested structure.
# Can you predict what sort of shape the resulting DataFrame is likely to take?
tmp = pd.io.json.read_json('http://www.bbc.co.uk/programmes/p01ztvcp.json')
tmp

Unnamed: 0,programme
categories,"[{'sameAs': None, 'narrower': [], 'key': 'scie..."
display_title,"{'subtitle': 'Privacy or Freedom of Speech?', ..."
expected_child_count,
first_broadcast_date,2014-06-03T19:32:30+01:00
image,{'pid': 'p02090my'}
links,[{'url': 'http://www.bbc.co.uk/news/magazine-2...
long_synopsis,"The case of Mario Costeja Gonzalez, a Spaniard..."
media_type,audio
medium_synopsis,The Court of Justice of the European Union rul...
ownership,"{'service': {'key': 'worldserviceradio', 'id':..."


Not every JSON structure will map easily to a tablular form.  The structure above has several levels of nesting, but there are a few items at the top level that you can pull out.

For deeply nested data structures that do not have a natural tabular representation, it probably makes sense to parse them using 'standard' Python methods - handling each row individually, or grouping commonly structured rows together. 

In [12]:
# For example, we can select particular rows by index value:
rows = ['pid', 'title', 'short_synopsis', 'long_synopsis', 'first_broadcast_date']
pd.io.json.read_json('http://www.bbc.co.uk/programmes/p01ztvcp.json').ix[rows]

Unnamed: 0,programme
pid,p01ztvcp
title,Privacy or Freedom of Speech?
short_synopsis,The Court of Justice of the European Union rul...
long_synopsis,"The case of Mario Costeja Gonzalez, a Spaniard..."
first_broadcast_date,2014-06-03T19:32:30+01:00


There is some degree of control over the way in which the *pandas* `read_json()` function can parse imported data.

See the [pandas.io.json.read_json](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.json.read_json.html) documentation for more information (particularly the `orient` parameter). The documentation also describes parameters for enabling automatic data conversions, where possible.

### Writing DataFrames to JSON files

We can write a DataFrame to a JSON file using the `to_json()` method.

In [16]:
tmp
tmp.to_json('data/tmp.json')
!head data/tmp.json

{"programme":{"categories":[{"sameAs":null,"narrower":[],"key":"scienceandnature","has_topic_page":false,"id":"C00064","title":"Science & Nature","broader":{"category":{"sameAs":null,"id":"C00045","key":"factual","has_topic_page":false,"title":"Factual","broader":{},"type":"genre"}},"type":"genre"},{"sameAs":null,"narrower":[],"key":"magazinesandreviews","has_topic_page":false,"id":"PT009","title":"Magazines & Reviews","broader":{},"type":"format"}],"display_title":{"subtitle":"Privacy or Freedom of Speech?","title":"Click"},"expected_child_count":null,"first_broadcast_date":"2014-06-03T19:32:30+01:00","image":{"pid":"p02090my"},"links":[{"url":"http:\/\/www.bbc.co.uk\/news\/magazine-27396981","title":"BBC News: Google: Who would want the right to be forgotten?","type":"related_site"},{"url":"http:\/\/www.oii.ox.ac.uk\/people\/?id=327","title":"Professor Luciano Floridi","type":"related_site"},{"url":"http:\/\/haxpo.nl\/hitb2014ams-conference\/","title":"Hack In The Box Amsterdam 2014 

See the *pandas* documentation for more details: [pandas.DataFrame.to_json](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html).

## Summary
In this Notebook you have seen how to:
1.  read JSON data using the `json.loads()` function
2.  parse JSON data using *pandas*
3.  use *pandas* to parse and manipulate JSON data
4.  write data in a DataFrame to a JSON file.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to look at `02.2.3 Data file formats - other`. 