In [1]:
from collections import defaultdict
import json
import logging
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Import the CKAN API wrapper module.
import ckanapi



Make an API call to retrieve a list of data files.
In this example, query a CKAN data portal to retrieve a specific "package" by its unique identifier.
The identifier used in this example is for the package named "IDEA Section 618 Data Products: Static Tables Part B Maintenance of Effort Reduction Table 5" that  the U.S. Department of Education shares publicly as open data. The identifier for that set of data files is "cf9bff75-1577-4ca7-a957-1f0c269aeff4".

In [5]:
# Define the parameters for the API connection.
# The required parameter for the connection is the base URL for the API functions.
# For this example, use the U.S. Department of Education's data portal at data.ed.gov.
CKAN_BASE_URL = "https://data.ed.gov"

remote = ckanapi.RemoteCKAN(CKAN_BASE_URL)

# Use the API call to retrieve a collection of related data files, using the unique identifer
# for the "package" in the data portal that contains them all.
IDENTIFIER = "cf9bff75-1577-4ca7-a957-1f0c269aeff4"

# Construct a dictionary object containing the parameters to use for the API call.
# For this example, the only parameter needed is the ID.
params = {'id': IDENTIFIER}
result = remote.call_action(action = 'package_show', data_dict = params)

# Print the response to see what the API returns. 
# Knowing it's a JSON object, format it for easier reading with the json.dumns function.
print(json.dumps(result,indent=2))

{
  "access_level": "public",
  "amended_by_user": "true",
  "author": "Office of Special Education Programs - Research to Practice Division",
  "author_email": "OSEPideadata@ed.gov",
  "bureau_code": "018:20",
  "creator_user_id": "d6645315-1307-4d3a-9295-2a4f21b3a7ac",
  "data_dictionary_pkg": "",
  "data_dictionary_pkg_format": "",
  "data_quality": false,
  "end_date": "",
  "id": "cf9bff75-1577-4ca7-a957-1f0c269aeff4",
  "indraft": "false",
  "isopen": true,
  "level_of_data": [
    "national"
  ],
  "license_id": "cc-zero",
  "license_title": "Creative Commons CCZero",
  "license_url": "http://www.opendefinition.org/licenses/cc-zero",
  "maintainer": "",
  "maintainer_email": "odp@ed.gov",
  "metadata_created": "2021-09-24T14:30:08.358273",
  "metadata_modified": "2023-03-15T13:31:15.083103",
  "name": "idea-section-618-data-products-static-tables-part-b-moe-table5",
  "notes": "IDEA Section 618 Data Products: Static Tables\r\n##Part B Maintenance of Effort Reduction and Coordina

The "resources" key contains a list of descriptive information about the data files associated with the package, similar to the partial set of key/value pairs below.
~~~
"resources": [
    {
      "access_url": "",
...
      "description": "Table 5 Number of children who received CEIS anytime in the past two years and who received special education and related services 2020-2021",
      "ed_source": "",
...
      "package_id": "cf9bff75-1577-4ca7-a957-1f0c269aeff4",
...
      "url": "https://data.ed.gov/dataset/cf9bff75-1577-4ca7-a957-1f0c269aeff4/resource/1f5632da-5044-4de6-a81c-51485434ef0c/download/2021-bmaintenancedistrict-5.xlsx",
...
    },
~~~

 The "url" key in each resources entry is a web address for where the data file is stored. Examine the first of the URLs.

In [6]:
datafile_list = result.get('resources',None)
if datafile_list is not None:
    url = datafile_list[0].get('url',None)
    if url is not None:
        print(url)

https://data.ed.gov/dataset/cf9bff75-1577-4ca7-a957-1f0c269aeff4/resource/1f5632da-5044-4de6-a81c-51485434ef0c/download/2021-bmaintenancedistrict-5.xlsx


The URL ends in a file name with the Excel workbook file extension. Try loading it into a pandas dataframe using the read_excel function. Note, the URL can be used directly by read_excel.

In [9]:
df = pd.read_excel(url)
df.head(10)

Unnamed: 0,Table Identifier,bmaintenancedistrict_5,Unnamed: 2,Unnamed: 3
0,School Year,2020-21,,
1,Collection,Part B Maintenance of Effort Reduction and Coo...,,
2,Developed,2022-11-01 00:00:00,,
3,Revised,,,
4,,,,
5,Number of children who received CEIS anytime i...,,,
6,,,,
7,State,Number of reported LEAs1,Number of children who received CEIS during SY...,Number of children who received CEIS anytime d...
8,Alabama,143,+,+
9,Alaska,54,248,543


The URL does load properly, but does not result in a dataframe useful for analysis using the read_excel function defaults. Do a few maniuplations on the dataframe to get it ready for combination with similar datasets for analysis.
First, note the school year listed in the first row suggests the table values are for one school year. To calculate trends across school years, each row in the dataframe needs the school year added as another column. Extract the school year for later use.

In [14]:
# The school year is in row offset 0, column offset 1.
school_year = df.iloc[0].iloc[1]
print(school_year)

2020-21


There is nothing else needed from the first 8 rows, so discard them from the dataframe.

In [18]:
df.drop(index=range(8),inplace=True)
df.head(10)

Unnamed: 0,Table Identifier,bmaintenancedistrict_5,Unnamed: 2,Unnamed: 3
8,Alabama,143,+,+
9,Alaska,54,248,543
10,American Samoa,1,+,+
11,Arizona,638,890,3730
12,Arkansas,263,100,388
13,Bureau of Indian Education,174,1126,99
14,California,1482,32834,29569
15,Colorado,68,0,29
16,Connecticut,162,266,33
17,Delaware,43,8905,1950


Now add the extracted school_year as an additional column.

In [19]:
df['school_year']=school_year
df.head()

Unnamed: 0,Table Identifier,bmaintenancedistrict_5,Unnamed: 2,Unnamed: 3,school_year
8,Alabama,143,+,+,2020-21
9,Alaska,54,248,543,2020-21
10,American Samoa,1,+,+,2020-21
11,Arizona,638,890,3730,2020-21
12,Arkansas,263,100,388,2020-21


Now set the column labels to more accurate labels than automatically retrieved from the first row of the spreadsheet. The more accurate column labels are on row 8 of the spreadsheet.
Looking ahead, note that each school year file uses slightly different lables for the tables of values. To compensate for that, force a consistent label for the columns. Combining the dataframes created from each separate file will be straightforward if the column labels match.

In [25]:
df.columns=['state','number LEAs','number receiving CEIS','number receiving CEIS and special education','school_year']
df.head()

Unnamed: 0,state,number LEAs,number receiving CEIS,number receiving CEIS and special education,school_year
8,Alabama,143.0,,+,2020-21
9,Alaska,54.0,248.0,543,2020-21
10,American Samoa,1.0,,+,2020-21
11,Arizona,638.0,890.0,3730,2020-21
12,Arkansas,263.0,100.0,388,2020-21


Note the data files use a convention of storing a "+" in each cell for which data was not reported. Translating those "+" values into "not a number" (NaN) values in the dataframe will make analysis easier.

In [26]:
# Use the to_numeric function with a parameter of errors='coerce' to compensate
# for any non-numeric strings encoded in the data.
df['number LEAs'] = pd.to_numeric(df['number LEAs'], errors='coerce')
df['number receiving CEIS'] = pd.to_numeric(df['number receiving CEIS'], errors='coerce')
df['number receiving CEIS and special education'] = pd.to_numeric(df['number receiving CEIS and special education'], errors='coerce')
df.head()

Unnamed: 0,state,number LEAs,number receiving CEIS,number receiving CEIS and special education,school_year
8,Alabama,143.0,,,2020-21
9,Alaska,54.0,248.0,543.0,2020-21
10,American Samoa,1.0,,,2020-21
11,Arizona,638.0,890.0,3730.0,2020-21
12,Arkansas,263.0,100.0,388.0,2020-21
