## JSON files


JSON, or *Javascript Object Notation*, is a document format that is often used to transport data around the web.

The core Python `json` library provides low level routines for working with JSON data, but many libraries also provide built-in support for conversion to and from JSON into Python data structures.

In [None]:
# We'll load a couple of libraries that we'll be using in this Notebook.
# 'requests' is an HTTP library - it allows us to get data from a URL address
import requests
import json
import pandas as pd

## Using the JSON library

The Python `json` library provides a method, `json.loads()`, for parsing a string containing JSON data, that has been read in from a file via a web address, and converting it to a `dict`.

Let's start by getting the data from the web address - we can inspect its content directly.

The following dataset is grabbed from the Ordnance Survey:

In [None]:
url = 'http://data.ordnancesurvey.co.uk/datasets/os-linked-data/apis/reconciliation?query=MK7&type=http%3A%2F%2Fdata.ordnancesurvey.co.uk%2Fontology%2Fpostcode%2FPostcodeSector&type_strict=any&limit=10'

# Make a query to the URL to retrive some JSON data
resp = requests.get(url)
resp.content

Sometimes we need to decode the data that is returned.

If the response is preceded by a `b'`, as above, then we need to convert from *bytes* to a string representation by *decoding* the response.

In [None]:
data = json.loads(resp.content.decode('UTF8'))
data

The json library is quite short; additional documentation can be found at http://docs.python-guide.org/en/latest/scenarios/json/

## Parsing JSON  directly using `requests`
As well as parsing the JSON data into a Python `dict` usinfg the `json` package, we can access the representation directly from `requests` response object's `.json()` method: 

In [None]:
jdata = resp.json()
jdata

This is typically more convenient than worrying about whether something is a bytestring or not.

Inspecting the JSON, we see that in this case it returns a `dict` with a single `result` element that contains a `list` of other `dict`s:

In [None]:
jdata['result']

We can tunnel into the data as we would indexing into any Python `dict`. For example, we can pull out the second listed item (remember, indexing counts in Python start at `0`):

In [None]:
jdata['result'][1]

Within that, we could pull out the `name`:

In [None]:
jdata['result'][1]['name']

And so on...

## *pandas* can handle JSON too

The *pandas* `io.json` library has the `read_json()` function - it too reads from a web address.

In [None]:
df = pd.io.json.read_json(url)
df

In this case, the top level keys of the response — in this case, the single `result` keyed element — is used to define a column, with each row containing one of the `list` items associated with the `result`.

There is some degree of control over the way in which the *pandas* `read_json()` function can parse imported data resembling the way the *pandas* `.from_dict()` function creates dataframes from the Python dictionaries.

See the [`pandas.read_json()`](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_json.html) documentation for more information. The documentation also describes parameters for enabling automatic data conversions, where possible.

### Looking inside the JSON data in a dataframe context

One of the useful behaviours of a *pandas* Series is that a Series can be defined from a `dict`:

In [None]:
pd.Series( {'id': 'http://data.ordnancesurvey.co.uk/id/postcodesector/MK76',
  'name': 'MK7 6',
  'score': 1,
  'match': True,
  'type': ['http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeSector']} )

Recalling that we can apply a function to a series, what happens if we apply the `pd.Series()` function to each of the `dict`s in the `df['result']` Series?

In [None]:
df['result'].apply(pd.Series)

The `dict`s are unpacked across several columns, although we note that the `type` columm contains lists that have not been unpacked further.

We can then convert the resulting DataFrame (structured as a dict in each row, in case you hadn't spotted it) into a more regular DataFrame format.

Alternatively, we can use the *pandas* `json_normalise()` function to perform a similar operation:

In [None]:
from pandas import json_normalize

json_normalize( df['result'] )

The *pandas* `json_normalize()` function ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html)) provides a range of arguments that allow you to customise how a particular JSON parsed data structure is treated.

Whilst it is beyond the scope of this module to spend too much time looking at how to use the `json_normalize()` function, we can get a feel for some of the operations it supports by considering some JSON data retrieved from the BBC programme episodes API.

In [None]:
bbc_url = 'http://www.bbc.co.uk/programmes/p01ztvcp.json'
bbc_resp = requests.get(bbc_url)

programme = bbc_resp.json()
programme

What happens if we pass the `programme['programme']` object into `json_normalize()`?

In [None]:
df_programme = json_normalize(programme['programme'])
df_programme

In this case, we create a single row with some columns specified by keys of the `programme['programme']` dictionary:

In [None]:
programme['programme'].keys(), df_programme.columns

But we also notice that "grandchild" dictionaries have also been unpacked.

You might also notice that some of the columns themselves contain lists of further `dict`s. We can reference these and unpack them directly by navigating to them using the `record_path=` argument:

In [None]:
related = json_normalize(programme['programme'], record_path='links')
related

Not every JSON structure will map easily to a tabular form.  The structure above has several levels of nesting, but there are a few items at the top level that you can pull out.

For deeply nested data structures that do not have a natural tabular representation, it probably makes sense to parse them in several steps, for example creating a "top level" data frame and then generating additional dataframes, perhaps referenced from a `dict`, containing unpacked levels or lists of data. 

### Writing `DataFrames` to JSON files

We can write a `DataFrame` to a JSON file using the `.to_json()` method.

In [None]:
related.to_json('data/tmp.json')

!head data/tmp.json

And then read it back in again:

In [None]:
pd.read_json('data/tmp.json')

See the *pandas* documentation for more details: [pandas.DataFrame.to_json](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html).

## Summary
In this Notebook you have seen how to:
1.  read JSON data using the `json.loads()` function
2.  parse JSON data using *pandas*
3.  use *pandas* to parse and manipulate JSON data
4.  write data in a DataFrame to a JSON file.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to look at `02.2.3 Data file formats - other`. 