Here I will use Python to access EU industry data from the European Union Open Data Portal (EU ODP) through an API and store the data in a pandas dataframe.

The dataset (see http://ec.europa.eu/eurostat/web/products-datasets/-/sts_inpr_m) contains monthly industry production data from 1957 and can also be downloaded online. Here, however, the purpose is to obtain the data automatically via a Python script and store it directly in a pandas dataframe. The data access works through an API (application programming interface).

The documentation on the EU ODP API (see https://data.europa.eu/euodp/en/developerscorner) tells us the URL of the API:
http://data.europa.eu/euodp/data/api

Let's start with importing the necessary packages to Python:

In [2]:
import requests
import pandas as pd

With the requests package, I can can make a connection to the API and perform queries in an easy way. 

Pandas is a Python package that allows to store data efficiently in a tabular format and brings many methods to analyze, visualize, manipulate, restructure, or export these data -- so I will make use of it a lot in the upcoming projects.

Let us now connect to the API using the get method and print out the response with the text attribute:

In [3]:
url = 'http://data.europa.eu/euodp/data/api'
response = requests.get(url)
print(response.text)

{"version": 1}


The response is in the JSON (JavaScript Object Notation) format, which delivers results as 'key: value' pairs, separated by commas. The whole result is enclosed by curly brackets and readable by humans.

Since I didn't specify any query, the result is just a simple statement of something having the version of 1, presumably the API.

How do I perform a query? The EU ODP API website tells me that the API is an implementation of the more general CKAN API and refers me to its documentation (see http://docs.ckan.org/en/latest/api/index.html). There I learn that a query can be specified by adding '/action/' plus some instruction string to the URL. For example, I can request a list of all available datasets with the dataset_list command:

In [4]:
query = '/action/dataset_list'
response = requests.get(url+query)
print(response.text)

{"help": "Return a list of the names of the site's datasets (packages).\n\n    :param limit: if given, the list of datasets will be broken into pages of\n        at most ``limit`` datasets per page and only one page will be returned\n        at a time (optional)\n    :type limit: int\n    :param offset: when ``limit`` is given, the offset to start returning packages from\n    :type offset: int\n\n    :rtype: list of strings\n\n    ", "success": true, "result": ["0026aa70-cc6d-4f6f-8c2f-554a2f9b17f2", "00a87831-3a64-4a08-a681-3929aeca1876", "00nlSr3zHd3S6PiCskoXg", "00yyo0vinq079ZH4FcOqw", "01009127-5ddf-4f69-8a6b-30e6218f17bb", "014be465-c941-4ad0-9817-b4de72e19773", "014HVGAGnu32p17RrVj5KQ", "01Al806on2wfDK73I3Zt4Q", "01d65c42-ec7b-4716-ad01-997db0776f1e", "01gR6AIEivlA5S11A3MCA", "01UdtrDlyqeo2JxuPaEw", "01VNHNMrYRAezdyznUwcGA", "02008597-88e9-43d5-bea8-d4371639e13f", "02136dfd-a71f-40f2-bf6e-c65cab45acbe", "0240fbc2-24b2-4cd6-ae92-50f235c78091", "02764dtQ8W5U2bhw10VBug", "027BWD1UDQ

Whoa, these are indeed a lot of datasets! The key-value structure can be nicely seen, but the names are mostly cryptic. How do I find out if the desired dataset exists? From the dataset webpage, I know that the name of the dataset is "sts_inpr_m". However, it cannot be found in that list.

Let's try a search query instead! This is done with the dataset_search command, followed by a question mark, the characters 'fq=' and the search string. "fq" stands for filter query and lets me perform complex queries (see https://wiki.apache.org/solr/CommonQueryParameters), though here I just want to find datasets with the matching string value. In addition, I convert the returned JSON into a Python dictionary by applying the json() method on the response:

In [5]:
query = '/action/dataset_search?fq=sts_inpr_m'
response = requests.get(url+query)
response_dict = response.json()
print(response_dict)

{'help': '\n    Searches for packages satisfying a given search criteria.\n\n    This action accepts solr search query parameters (details below), and\n    returns a dictionary of results, including dictized datasets that match\n    the search criteria, a search count and also facet information.\n\n    **Solr Parameters:**\n\n    For more in depth treatment of each paramter, please read the `Solr\n    Documentation <http://wiki.apache.org/solr/CommonQueryParameters>`_.\n\n    This action accepts a *subset* of solr\'s search query parameters:\n\n\n    :param q: the solr query.  Optional.  Default: `"*:*"`\n    :type q: string\n    :param fq: any filter queries to apply.  Note: `+site_id:{ckan_site_id}`\n        is added to this string prior to the query being executed.\n    :type fq: string\n    :param sort: sorting of the search results.  Optional.  Default:\n        \'relevance asc, metadata_modified desc\'.  As per the solr\n        documentation, this is a comma-separated string of 

Ah, now I am getting something! It is still quite a complex dictionary though, with several sub-layers (dictionaries of dictionaries). There is only one matching dataset, whose metadata are hidden in "response_dict['result']['results'][0]" Where does the string 'sts_inpr_m' appear? Let's iterate over all keys and search for a match:

In [6]:
found_dataset_dict = response_dict['result']['results'][0]
for key, value in found_dataset_dict.items():
    if value=='sts_inpr_m':
        print(key)

identifier


Okay, so 'sts_inpr_m' is not the name of the dataset, but its "identifier"! But - what IS its name after all?

In [7]:
print(found_dataset_dict['name'])

dZzomwHlfy7S3KYLnVaSLg


Yikes! No wonder I didn't find it... A more useful, human-readable name is the title:

In [8]:
print(found_dataset_dict['title'])

Production in industry - monthly data


This is much better indeed. Now that I can access the metadata - how do I get the actual data? Surprisingly, I cannot use the API to retrieve specific parts of the data in JSON format but have to download the whole dataset as a file. The available download options are listed in a convoluted way in the metadata, more specifically in the 'resource' key. The closest thing to what I want is the data as tab-separated values (TSV) in resource 1:

In [9]:
print(found_dataset_dict['resources'][1])

{'mimetype': None, 'cache_url': None, 'hash': '', 'description': 'Download dataset in TSV format (unzipped)', 'name': None, 'format': 'text/tab-separated-values', 'url': 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz&unzip=true', 'created': '2017-10-05T08:03:02.358231', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-10-05T06:03:02.291532', 'tracking_summary': {'total': 2, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 4, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': '859b84a5-a0d7-4c6a-b5e1-2413d2efcad9', 'id': '28c5b28d-dd99-450f-95f8-9505f2a384d9', 'resource_type': 'http://www.w3.org/TR/vocab-dcat#Download', 'size': None}


The URL can be found in the 'url' key. With this information, I can load the data from the online TSV file directly into a Pandas dataframe, using the read_csv method. However, for the method to work properly, I need to specify that the separators can be tabs OR commas (in the form of a regular expression, where "|" stands for the OR). This is because the index columns are separated by commas, and the data columns are separated by tabs. To make this regular expression work and to prevent a warning, I have to specify "engine='python'". Also, because the first five columns of the file belong to the index, I need to tell this to pandas by passing these column indices as a list to the "index_col" parameter. Then the dataframe is created with a correct MultiIndex:

In [10]:
dataset_tsv_url = found_dataset_dict['resources'][1]['url']
print(dataset_tsv_url)
df = pd.read_csv(dataset_tsv_url, sep='\t|,', index_col=[0,1,2,3,4], engine='python')
print(df.head(3)) # printing the first three rows of the data frame
print(df.info())  # printing infos about the structure and size of the data frame

http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz&unzip=true
                                     2017M08  2017M07  2017M06  2017M05   \
indic_bt nace_r2 s_adj unit geo\time                                       
PROD     B       CA    I10  AT             :   108.8 p   108.8    110.8    
                            BA         122.1    115.3    116.8    109.5    
                            BE             :   102.0 p  121.2 p  111.2 p   

                                     2017M04  2017M03  2017M02  2017M01   \
indic_bt nace_r2 s_adj unit geo\time                                       
PROD     B       CA    I10  AT         114.1    113.0     92.3    100.5    
                            BA         106.9    123.3    104.6     91.1    
                            BE        110.0 p  119.6 p   97.2 p   81.9 p   

                                     2016M12  2016M11    ...   1953M10   \
indic_bt nace_r2 s_adj unit geo\time            

The data columns give the monthly industry indicator values from July 2017 all the way back to January 1953, albeit with many missing entries (denoted by the colons).

There is an additional problem here that is not that obvious. The "df.info()" shows that there are only 3016 rows. However, a manual download of the zipped file (without the "&unzip=true" in the URL) and inspection in Notepad++ revealed that there should be 19198 rows. What happened to all the other rows? Well, apparently pandas works well, but there is a bug in the EU ODP.

A workaround is getting red of the "&unzip=true" part of the URL and let pandas do the unzipping by adding "compression='gzip'" to the "read_csv()" command:

In [14]:
print(dataset_tsv_url[:-11]) # Trimming the URL
df = pd.read_csv(dataset_tsv_url[:-11], compression='gzip', sep='\t|,', index_col=[0,1,2,3,4], engine='python')
print(df.info())

http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 19197 entries, (PROD, B, CA, I10, AT) to (PROD, MIG_NRG_X_E, SCA, PCH_PRE, UK)
Columns: 776 entries, 2017M08  to 1953M01
dtypes: object(776)
memory usage: 113.8+ MB
None


This looks much better!

Since the purpose of this project was to just import the data though, I am finished here!