Openly-shared data is often published in distinct chunks, such as annual summaries of data points with each year's summary published in a separate file. In order to facilitate trend analysis on data published in that fashion, the distinct chunks can be combined. 

This notebook shows an example of retrieving multiple data files, using an Application Programming Interface (API), and combining the data in those files in a single Pandas dataframe. The example shows some of the common issues encountered when using open data.

The data used in the example are available through the U.S. Department of Education's Open Data Platform, at https://data.ed.gov. That system organizes datasets in "data profiles" that describe sets of related data files.
The specific dataset used in the example is the collection of data files in the data profile with title "IDEA Section 618 Data Products: Static Tables Part B Maintenance of Effort Reduction Table 5".

Searching on that title on the Open Data Platform leads to a page which should look like the [IDEA Data Profile](IDEA-data-profile.png) image. Selecting the "Resources" link on that page presents a list of data files, each corresponding to a single school year. This example notebook shows how to retrieve all of the data files in that list and combine them in a way that enables trend analysis over all the school years for which data are available.

The Open Data Platform has an API, described in a link at the bottom right of the web application's footer. At the time of this notebook's creation, the API is documented at [CKAN version 2.9 API](http://docs.ckan.org/en/2.9/api/).

This example uses only the Python modules typically included in a Python distribution. This variation may be helpful in corporate evironments that restrict installation of additional Python modules.

The CKAN API documentation notes that responses to API calls are formatted as JSON documents. The json module will help display those responses.

In [6]:
import json

Make an API call to retrieve a list of data files.

The CKAN API uses the term "package" for what the Open Data Platform calls data profiles. The API funtion named "package_show" returns all the metadata describing a package. That API call needs just one parameter - a unique identifier for the package. The Open Data Platform lists the unique identifer for the package in this example as "cf9bff75-1577-4ca7-a957-1f0c269aeff4". (The "Show more" link may need to be selected to show the unique identifier.)

The following Python code performs the package_show API function call for the specific data profile of interest, and displays the result in a readable format. The standard Python module named "urllib3" is used to perform the API call.

In [19]:
import urllib3 as urll

# Create an object for managing HTTP requests, called a "pool manager" in the
# standard URLLib module.
hp = urll.PoolManager()

# Define the parameters for the API connection.
# The required parameter for the connection is the base URL for the API functions.
# For this example, use the U.S. Department of Education's data portal at data.ed.gov.
# The CKAN API wrapper module appends any other information to the URL specific to the API version.
CKAN_BASE_URL = "https://data.ed.gov/api/3/action"

# Use the API call to retrieve the descriptive metadata for a "package", using the unique identifer
# in the data portal that contains them all.
IDENTIFIER = "cf9bff75-1577-4ca7-a957-1f0c269aeff4"

# Construct a dictionary object containing the parameters to use for the API call.
# For this example, the only parameter needed is the ID.
params = {'id': IDENTIFIER}

# Use the pool manager object to submit an HTTP GET request, naming the specific
# API function to perform, and passing the parameter dictionary.
response = hp.request("GET", f'{CKAN_BASE_URL}/package_show', fields = params)

# The response has a status field indicating whether the HTTP GET request
# succeeded or failed. That field is set to the HTTP response code, so a
# value of 200 means success, 404 means the URL (i.e., API function in this case)
# was not found, etc.
api_result = None
if response.status == 200:
    # Print the response to see what the API returns. 
    # Knowing it's a JSON object, format it for easier reading with the json.dumps function.
    api_result = response.json()
    print(json.dumps(api_result,indent=2))
else:
    print(f'package_show response code: {response.status}')

{
  "help": "https://data.ed.gov/api/3/action/help_show?name=package_show",
  "success": true,
  "result": {
    "access_level": "public",
    "amended_by_user": "true",
    "author": "Office of Special Education Programs - Research to Practice Division",
    "author_email": "OSEPideadata@ed.gov",
    "bureau_code": "018:20",
    "creator_user_id": "d6645315-1307-4d3a-9295-2a4f21b3a7ac",
    "data_dictionary_pkg": "",
    "data_dictionary_pkg_format": "",
    "data_quality": false,
    "end_date": "",
    "id": "cf9bff75-1577-4ca7-a957-1f0c269aeff4",
    "indraft": "false",
    "isopen": true,
    "level_of_data": [
      "national"
    ],
    "license_id": "cc-zero",
    "license_title": "Creative Commons CCZero",
    "license_url": "http://www.opendefinition.org/licenses/cc-zero",
    "maintainer": "",
    "maintainer_email": "odp@ed.gov",
    "metadata_created": "2021-09-24T14:30:08.358273",
    "metadata_modified": "2023-03-15T13:31:15.083103",
    "name": "idea-section-618-data-pr

The response from the HTTP GET request contains its own "success" field, indicating whether the API call succeeded. In this context, a true value for "success" means the API did find a package with the provided unique identifier.

Extract the API result from the HTTP GET response.

In [23]:
result = None
if api_result.get('success',None):
    result = api_result.get('result',None)

All the rest of the code to combine the datafiles is the same as the notebook using the CKAN API wrapper module. A shortened version for producing the final combined dataframe follows.

In [24]:
import pandas as pd

def prepare_datafile(url):
    df = pd.read_excel(url,names=['State','Number LEAs','Number receiving CEIS',
                                  'Number receiving CEIS and special education services'])
    school_year = df.iloc[0].iloc[1]
    df['School year'] = school_year

    df.drop(range(8),inplace=True)

    df['Number LEAs'] = pd.to_numeric(df['Number LEAs'],errors='coerce')
    df['Number receiving CEIS'] = pd.to_numeric(df['Number receiving CEIS'],errors='coerce')
    df['Number receiving CEIS and special education services'] = pd.to_numeric(df['Number receiving CEIS and special education services'],errors='coerce')

    return df

Loop over all the datafiles, preparing them using the above function, and combining the result.

In [26]:
cdf = None
datafile_list = result.get('resources',None)
for datafile in datafile_list:
    url = datafile.get('url',None)
    if url is not None:
        # Ensure the url actually looks like an Excel spreadsheet.
        if url.endswith('.xlsx'):
            df = prepare_datafile(url)
            if cdf is None:
                cdf = df
            else:
                cdf = pd.concat([cdf,df],ignore_index=True)

To illustrate the type of analysis that can be done on the combined dataset, aggregate the child counts by school year.

In [27]:
gdf = cdf.groupby(['School year']).agg({'Number receiving CEIS':'sum',
                                        'Number receiving CEIS and special education services':'sum'})
gdf

Unnamed: 0_level_0,Number receiving CEIS,Number receiving CEIS and special education services
School year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-14,1577168.0,239170.0
2014-15,1470812.0,277636.0
2015-16,1412122.0,230986.0
2016-17,1534814.0,248942.0
2017-18,1311272.0,230268.0
2018-19,859082.0,131960.0
2019-20,803532.0,181508.0
2020-21,629262.0,160156.0
