Openly-shared data is often published in distinct chunks, such as annual summaries of data points with each year's summary published in a separate file. In order to facilitate trend analysis on data published in that fashion, the distinct chunks can be combined. 
This notebook shows an example of retrieving multiple data files, using an Application Programming Interface (API), and combining the data in those files in a single Pandas dataframe. The example shows some of the common issues encountered when using open data.
The data used in the example are available through the U.S. Department of Education's Open Data Platform, at https://data.ed.gov. That system organizes datasets in "data profiles" that describe sets of related data files.
The specific dataset used in the example is the collection of data files in the data profile with title "IDEA Section 618 Data Products: Static Tables Part B Maintenance of Effort Reduction Table 5".
Searching on that title on the Open Data Platform leads to a page which should look like the [IDEA Data Profile](IDEA-data-profile.png) image. Selecting the "Resources" link on that page presents a list of data files, each corresponding to a single school year. This example notebook shows how to retrieve all of the data files in that list and combine them in a way that enables trend analysis over all the school years for which data are available.
The Open Data Platform has an API, described in a link at the bottom right of the web application's footer. At the time of this notebook's creation, the API is documented at [CKAN version 2.9 API](http://docs.ckan.org/en/2.9/api/).
This example uses a Python module specific to the CKAN API. That Python module implements a "wrapper" around the lower-level networking modules, making it somewhat easier to use the API by abstracting some of the conventions used in the CKAN API. The Python code to use the API thus must import that CKAN API module. **NOTE:** The CKAN API module is typically not included by default in Python library sets, so it may need to be installed in order to run this notebook.

In [1]:
# Import the CKAN API wrapper module.
import ckanapi

The CKAN API documentation notes that responses to API calls are formatted as JSON documents. The json module will help display those responses.

In [2]:
import json

Make an API call to retrieve a list of data files.
The CKAN API uses the term "package" for what the Open Data Platform calls data profiles. The API funtion named "package_show" returns all the metadata describing a package. That API call needs just one parameter - a unique identifier for the package. The Open Data Platform lists the unique identifer for the package in this example as "cf9bff75-1577-4ca7-a957-1f0c269aeff4". (The "Show more" link may need to be selected to show the unique identifier.)
The following Python code performs the package_show API function call for the specific data profile of interest, and displays the result in a readable format.

In [3]:
# Define the parameters for the API connection.
# The required parameter for the connection is the base URL for the API functions.
# For this example, use the U.S. Department of Education's data portal at data.ed.gov.
# The CKAN API wrapper module appends any other information to the URL specific to the API version.
CKAN_BASE_URL = "https://data.ed.gov"

# Create a connection object for using the API.
remote = ckanapi.RemoteCKAN(CKAN_BASE_URL)

# Use the API call to retrieve the descriptive metadata for a "package", using the unique identifer
# in the data portal that contains them all.
IDENTIFIER = "cf9bff75-1577-4ca7-a957-1f0c269aeff4"

# Construct a dictionary object containing the parameters to use for the API call.
# For this example, the only parameter needed is the ID.
params = {'id': IDENTIFIER}

# Use the connection object to perform the API call, passing the name of the function to invoke and the parameter
# dictionary.
result = remote.call_action(action = 'package_show', data_dict = params)

# Print the response to see what the API returns. 
# Knowing it's a JSON object, format it for easier reading with the json.dumps function.
print(json.dumps(result,indent=2))

SSLError: HTTPSConnectionPool(host='data.ed.gov', port=443): Max retries exceeded with url: /api/action/package_show (Caused by SSLError(SSLError(1, '[SSL: UNSAFE_LEGACY_RENEGOTIATION_DISABLED] unsafe legacy renegotiation disabled (_ssl.c:1129)')))

The "resources" key contains a list of descriptive information about the data files associated with the package, similar to the partial set of key/value pairs below.
~~~
"resources": [
    {
      "access_url": "",
...
      "description": "Table 5 Number of children who received CEIS anytime in the past two years and who received special education and related services 2020-2021",
      "ed_source": "",
...
      "package_id": "cf9bff75-1577-4ca7-a957-1f0c269aeff4",
...
      "url": "https://data.ed.gov/dataset/cf9bff75-1577-4ca7-a957-1f0c269aeff4/resource/1f5632da-5044-4de6-a81c-51485434ef0c/download/2021-bmaintenancedistrict-5.xlsx",
...
    },
~~~

 The "url" key in each resources entry is a web address for where the data file is stored. Examine the URLs.

In [6]:
datafile_list = result.get('resources',None)
if datafile_list is not None:
    for datafile in datafile_list:
        url = datafile.get('url',None)
            if url is not None:
                print(url)

https://data.ed.gov/dataset/cf9bff75-1577-4ca7-a957-1f0c269aeff4/resource/1f5632da-5044-4de6-a81c-51485434ef0c/download/2021-bmaintenancedistrict-5.xlsx


The URLs all end in a file name with the Excel workbook file extension. Try loading them into a pandas dataframe using the read_excel function and examining the first few lines in each. Note, the URL can be used directly by read_excel.

In [1]:
import pandas as pd

for datafile in datafile_list:
    url = datafile.get('url',None)
    if url is not None"
        df = pd.read_excel(url)
        df.head()

The URL does load properly, but does not result in a dataframe useful for analysis using the read_excel function defaults. Do a few maniuplations on the dataframe to get it ready for combination with similar datasets for analysis.
First, note the school year listed in the first row suggests the table values are for one school year. To calculate trends across school years, each row in the dataframe needs the school year added as another column. Extract the school year for later use.

In [14]:
# The school year is in row offset 0, column offset 1.
school_year = df.iloc[0].iloc[1]
print(school_year)

2020-21


There is nothing else needed from the first 8 rows, so discard them from the dataframe.

In [18]:
df.drop(index=range(8),inplace=True)
df.head(10)

Unnamed: 0,Table Identifier,bmaintenancedistrict_5,Unnamed: 2,Unnamed: 3
8,Alabama,143,+,+
9,Alaska,54,248,543
10,American Samoa,1,+,+
11,Arizona,638,890,3730
12,Arkansas,263,100,388
13,Bureau of Indian Education,174,1126,99
14,California,1482,32834,29569
15,Colorado,68,0,29
16,Connecticut,162,266,33
17,Delaware,43,8905,1950


Now add the extracted school_year as an additional column.

In [19]:
df['school_year']=school_year
df.head()

Unnamed: 0,Table Identifier,bmaintenancedistrict_5,Unnamed: 2,Unnamed: 3,school_year
8,Alabama,143,+,+,2020-21
9,Alaska,54,248,543,2020-21
10,American Samoa,1,+,+,2020-21
11,Arizona,638,890,3730,2020-21
12,Arkansas,263,100,388,2020-21


Now set the column labels to more accurate labels than automatically retrieved from the first row of the spreadsheet. The more accurate column labels are on row 8 of the spreadsheet.
Looking ahead, note that each school year file uses slightly different lables for the tables of values. To compensate for that, force a consistent label for the columns. Combining the dataframes created from each separate file will be straightforward if the column labels match.

In [25]:
df.columns=['state','number LEAs','number receiving CEIS','number receiving CEIS and special education','school_year']
df.head()

Unnamed: 0,state,number LEAs,number receiving CEIS,number receiving CEIS and special education,school_year
8,Alabama,143.0,,+,2020-21
9,Alaska,54.0,248.0,543,2020-21
10,American Samoa,1.0,,+,2020-21
11,Arizona,638.0,890.0,3730,2020-21
12,Arkansas,263.0,100.0,388,2020-21


Note the data files use a convention of storing a "+" in each cell for which data was not reported. Translating those "+" values into "not a number" (NaN) values in the dataframe will make analysis easier.

In [26]:
# Use the to_numeric function with a parameter of errors='coerce' to compensate
# for any non-numeric strings encoded in the data.
df['number LEAs'] = pd.to_numeric(df['number LEAs'], errors='coerce')
df['number receiving CEIS'] = pd.to_numeric(df['number receiving CEIS'], errors='coerce')
df['number receiving CEIS and special education'] = pd.to_numeric(df['number receiving CEIS and special education'], errors='coerce')
df.head()

Unnamed: 0,state,number LEAs,number receiving CEIS,number receiving CEIS and special education,school_year
8,Alabama,143.0,,,2020-21
9,Alaska,54.0,248.0,543.0,2020-21
10,American Samoa,1.0,,,2020-21
11,Arizona,638.0,890.0,3730.0,2020-21
12,Arkansas,263.0,100.0,388.0,2020-21
