# Scraping JSON (from dynamically generated webpages)

Sometimes webpages can't be scraped using libraries like `scraperwiki` because the HTML is **dynamically generated**. This means some JavaScript adds extra information to the webpage once it's loaded - but all that `scraperwiki` sees is the HTML *before* that generation happens.

The good news is that webpages like this are actually easier to scrape in many ways - because they are typically loading a data file and if we can find *that* then we may have all the data in a convenient structured format.

In fact, you probably don't need a scraper at all in that situation: see the blog post [How to: find the data behind an interactive chart or map using the inspector](https://onlinejournalismblog.com/2017/05/10/how-to-find-data-behind-chart-map-using-inspector/) for more details on this.

However, as the data will probably be in JSON format, you can use Python to fetch that and convert it to CSV.

We're going to do that for the data behind [this BBC page on the 2021 local elections](https://www.bbc.co.uk/news/topics/c481drqqzv7t/england-local-elections-2021) which you can find by opening the Inspector in Chrome or Firefox, switching to the Network tab, and then reloading the page. This will show the files being loaded - when sorted from largest to smallest you can see the biggest file is called 'map' and contains the data behind the election results map. Right-click and select 'open in new tab' to see the URL which you'll need to grab the data. 

## Google it.

Rather than show you how to grab that data from scratch I'm going to show you an approach which is just as common in coding: Googling it.

Search for "loading json from a url python".

[Here's the first result](https://www.geeksforgeeks.org/how-to-read-a-json-response-from-a-link-in-python/). We're going to copy that code and run it here (we'll use two separate code blocks so the library imports are separate from the rest of the code):

In [6]:
# import urllib library
from urllib.request import urlopen
  
# import json
import json

In [7]:
# store the URL in url as 
# parameter for urlopen
url = "https://api.github.com"
  
# store the response of URL
response = urlopen(url)
  
# storing the JSON response 
# from url in data
data_json = json.loads(response.read())
  
# print the json response
print(data_json)

{'current_user_url': 'https://api.github.com/user', 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}', 'authorizations_url': 'https://api.github.com/authorizations', 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}', 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}', 'emails_url': 'https://api.github.com/user/emails', 'emojis_url': 'https://api.github.com/emojis', 'events_url': 'https://api.github.com/events', 'feeds_url': 'https://api.github.com/feeds', 'followers_url': 'https://api.github.com/user/followers', 'following_url': 'https://api.github.com/user/following{/target}', 'gists_url': 'https://api.github.com/gists{/gist_id}', 'hub_url': 'https://api.github.com/hub', 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}', 'issues_url': 'https://api.github.com/issues', 'keys_url': 'https://api.git

## Adapt it.

Now let's adapt it for our URL. (We don't need to re-import the libraries)

In [7]:
# store the URL in url as 
# parameter for urlopen
url = "https://static.files.bbci.co.uk/elections/data/news/election/2021/england/councils/map"
  
# store the response of URL
response = urlopen(url)
  
# storing the JSON response 
# from url in data
data_json = json.loads(response.read())
  
# print the json response
print(data_json)

{'partyColours': {'CON': '#0575C9', 'GRN': '#5FB25F', 'ICHC': '#D32F6C', 'IND': '#FF66A1', 'LAB': '#E91D0E', 'LD': '#EFAC18', 'LIB': '#C7941A', 'MK': '#78721d', 'RA': '#4dadab', 'REF': '#0AD1E0', 'UKIP': '#712F87', 'YP': '#00B8FD'}, 'map': {'E06000001': {'wp': 'NOC', 'wpp': 'NOC', 'flash': 'NOC NO CHANGE', 'url': '/news/topics/cv8k1ezy9m9t/hartlepool-borough-council', 'name': 'Hartlepool', 'yearLast': 2019}, 'E06000006': {'wp': 'LAB', 'wpp': 'LAB', 'flash': 'LAB HOLD', 'url': '/news/topics/cv8k1ezvmg8t/halton-borough-council', 'name': 'Halton', 'yearLast': 2019}, 'E06000007': {'wp': 'LAB', 'wpp': 'LAB', 'flash': 'LAB HOLD', 'url': '/news/topics/c2jq0m4ezm0t/warrington-borough-council', 'name': 'Warrington', 'yearLast': 2016}, 'E06000008': {'wp': 'LAB', 'wpp': 'LAB', 'flash': 'LAB HOLD', 'url': '/news/topics/cmex9y5r55pt/blackburn-with-darwen-borough-council', 'name': 'Blackburn with Darwen', 'yearLast': 2019}, 'E06000010': {'wp': 'LAB', 'wpp': 'LAB', 'flash': 'LAB HOLD', 'url': '/news/

## Converting to a PDF

We can use the same process for converting to a PDF, searching for 'python export json to csv'.

The top result I get is [this StackOverflow question](https://stackoverflow.com/questions/1871524/how-can-i-convert-json-to-csv). These take a bit of getting used to - which is a good reason to use it. 

At the top of the page is the question - you'll have to scroll down to get the answer. In fact, there'll generally be more than one. 

What might appear to be the first 'answers' are actually *comments*: 

> "A simple approach to this is using jq, as described [here](https://stackoverflow.com/questions/32960857/how-to-convert-arbitrary-simple-json-to-csv-using-jq)"

Those can be useful, but look first instead to the first *answer* - this will be in the same size type as the question and to the left it will show how many votes this answer has had (by default it's sorted so the answer with the most votes is top). That answer says:

> "With the `pandas` library, this is as easy as using two commands!"

We can use those now.

Firstly, note that the `pandas` library has been mentioned so we need to import that.

In [1]:
import pandas

Then the rest of the code. Note that we need to fill the parentheses for `.read_json()` with the URL containing the JSON (they haven't said this because they've assumed the reader knows - this can be quite common on StackOverflow so it's worth using other sources and reading around the solution or other solutions if you're having problems):

In [2]:
df = pd.read_json(url)
print(df)

NameError: ignored

OK, we have an error here: `name 'pd' is not defined`

This means that the `pd` in our code hasn't been defined - there's no variable or function or library with that name. (Yes, I've done this on purpose so you can see what to do in this scenario!)

You might remember that `pd` is a name often given to `pandas` when imported. 

That's the problem here: we've just imported `pandas`, not imported pandas `as pd`. So let's fix that:

In [3]:
import pandas as pd

In [8]:
df = pd.read_json(url)
print(df)

          partyColours                                                map
CON            #0575C9                                                NaN
GRN            #5FB25F                                                NaN
ICHC           #D32F6C                                                NaN
IND            #FF66A1                                                NaN
LAB            #E91D0E                                                NaN
...                ...                                                ...
E10000029          NaN  {'wp': 'CON', 'wpp': 'CON', 'flash': 'CON HOLD...
E10000030          NaN  {'wp': 'CON', 'wpp': 'CON', 'flash': 'CON HOLD...
E10000031          NaN  {'wp': 'CON', 'wpp': 'CON', 'flash': 'CON HOLD...
E10000032          NaN  {'wp': 'CON', 'wpp': 'CON', 'flash': 'CON HOLD...
E10000034          NaN  {'wp': 'CON', 'wpp': 'CON', 'flash': 'CON HOLD...

[155 rows x 2 columns]


And then export it - again, remember that we need to specify the name of the CSV in parentheses here. 

In [10]:
df.to_csv("mapdata.csv")

Now you can go to the 'Files' view on the left hand side of the Colab notebook and download that file to your computer by hovering over the file and clicking on the three dots to the right. 

Remember to do that before you close the notebook as any files will be deleted when you do.