# Broadband Access

**Notebook: 2-api-data**

## Abstract

**Purpose:** 
This notebook (intended for a technical audience) pursues acquisition of custom data via calls to the Census Bureau’s API, with the goal of comparing broadband Internet access across lower-level geographies and demographics.

**Acknowledgments:** 
Data source: U.S. Census Bureau, 2019 American Community Survey 1-Year Estimates (https://data.census.gov/cedsci/table?q=broadband&g=0400000US51.050000&y=2019&d=ACS%201-Year%20Estimates%20Data%20Profiles&tid=ACSDP1Y2019.DP02&hidePreview=false)


## Scope (notebook)

 - obtain API key
 - acquire custom api datasets
 - inspect and subset data
 - visualize statistics and data relationships

## Setup

In [1]:
# importing packages
import os
import numpy as np
import pandas as pd

In [2]:
# verifying the current working directory
os.getcwd()

'C:\\Users\\jamel\\myprojects\\acs-api\\notebooks'

In [3]:
# moving to the project's main directory
os.chdir('..')

# verifying the current working directory
os.getcwd()

'C:\\Users\\jamel\\myprojects\\acs-api'

In [4]:
# importing package to manage private key
from dotenv import load_dotenv

Within the "helpers" subdirectory is the `helpers_func` package, which includes the custom `save_pickle`/`read_pickle` serialization/de-serialization modules.

In [5]:
# importing the "helpers folder and contained modules as a package"
from helpers import *

In [6]:
# loading python's `autoreload`, to update any external module changes
%load_ext autoreload

# turning-on `autoreload`
%autoreload 2

### (de)Serialization

We will use the `pickle` serialization format to restore previously saved objects.

In [7]:
# printing object-restoration message
print("Result of restoration attempt:\n")

# restoring the data dictionary and broadband data list from the serialized file
metadata_dict = read_pickle("pickles/metadata-dict.pkl")
broadband_vars = read_pickle("pickles/broadband-vars.pkl")

Result of restoration attempt:

Object restored from pickles/metadata-dict.pkl
Object restored from pickles/broadband-vars.pkl


In [8]:
# verifying list restoration
broadband_vars

['DP02_0153E', 'DP02_0153M', 'DP02_0153PE', 'DP02_0153PM']

In [9]:
# verifying data dictionary restoration
metadata_dict

{'NAME': ['Geographic Area Name'],
 'DP02_0001E': ['Estimate!!HOUSEHOLDS BY TYPE!!Total households'],
 'DP02_0001M': ['Margin of Error!!HOUSEHOLDS BY TYPE!!Total households'],
 'DP02_0001PE': ['Percent!!HOUSEHOLDS BY TYPE!!Total households'],
 'DP02_0001PM': ['Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households'],
 'DP02_0002E': ['Estimate!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple family'],
 'DP02_0002M': ['Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple family'],
 'DP02_0002PE': ['Percent!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple family'],
 'DP02_0002PM': ['Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple family'],
 'DP02_0003E': ['Estimate!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple family!!With own children of the householder under 18 years'],
 'DP02_0003M': ['Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple family!!With own children of the householder under 18 years']

In [10]:
# # -----REFERENCE-----
# # serializing the data dictionary
# save_pickle(metadata_dict, "metadata-dict")

# # serializing the list of broadband variables
# save_pickle(broadband_geo_ids, "broadband-geo-ids")

## API

Here we test our API call for 5 rows of data without an authentication key.

In [11]:
# previewing data returned from an api call in pandas
pd.read_json("https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM&for=county:*&in=state:51").head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM,state,county
1,0500000US51095,"James City County, Virginia",27223,1677,90.0,3.5,51,095
2,0500000US51087,"Henrico County, Virginia",112672,2952,86.3,1.8,51,087
3,0500000US51177,"Spotsylvania County, Virginia",42465,1419,91.5,2.1,51,177
4,0500000US51121,"Montgomery County, Virginia",33056,1661,91.2,2.6,51,121


#### Environment Variable

Load environment variables from a local file, with the `dotenv` library.

In [12]:
# load saved key `CensusDataKey` from local file ".env"
print("Environment variables loaded.")
load_dotenv()

Environment variables loaded.


True

In [13]:
# instantiating the stored api key
key = os.environ.get("CensusDataKey")

This time, we combine a constructed api query and our authentication key to instantiate an authorized query call.

In [14]:
# instatiating the constructed api call
api_url = "https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM&for=county:*&in=state:51"   

# appending the private key to the api URL
keyed_query = api_url + f"&key={key}"

In [15]:
# previewing the queried data
pd.read_json(keyed_query)

Unnamed: 0,0,1,2,3,4,5,6,7
0,GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM,state,county
1,0500000US51095,"James City County, Virginia",27223,1677,90.0,3.5,51,095
2,0500000US51087,"Henrico County, Virginia",112672,2952,86.3,1.8,51,087
3,0500000US51177,"Spotsylvania County, Virginia",42465,1419,91.5,2.1,51,177
4,0500000US51121,"Montgomery County, Virginia",33056,1661,91.2,2.6,51,121
5,0500000US51107,"Loudoun County, Virginia",127625,2436,95.4,1.2,51,107
6,0500000US51153,"Prince William County, Virginia",137524,2159,95.8,0.8,51,153
7,0500000US51061,"Fauquier County, Virginia",21241,1293,87.5,3.0,51,061
8,0500000US51510,"Alexandria city, Virginia",65428,1931,92.1,1.6,51,510
9,0500000US51650,"Hampton city, Virginia",47767,1933,85.9,2.5,51,650


**Create a Function**

We want to return the data as a dataframe with ACS variables as column headers.

In [16]:
def api_json_to_df(keyed_query):
    ''' This function returns the JSON result of an 
        authenticated API query as a dataframe.    
    
    - INPUT:    keyed_query = formatted query string with key appended 
                <keyed_query = api_url + f"&key={key}">
    - OUTPUT:   a dataframe with column headers
                called with <"df = func>": 
    '''
    
    # instantiating the dataframe
    df = pd.read_json(keyed_query)

    # extracting column headers from first row values
    headers = df.iloc[0]

    # creating a new dataframe with extracted column headers
    ex_df  = pd.DataFrame(df.values[1:], columns=headers)

    # return the new dataframe
    return ex_df
 

In [17]:
# executing the function to acquire and instantiate data as a dataframe
api_df = api_json_to_df(keyed_query)

In [18]:
# verifying column headers
api_df.columns

Index(['GEO_ID', 'NAME', 'DP02_0153E', 'DP02_0153M', 'DP02_0153PE',
       'DP02_0153PM', 'state', 'county'],
      dtype='object', name=0)

In [19]:
# viewing the dataframe
api_df

Unnamed: 0,GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM,state,county
0,0500000US51095,"James City County, Virginia",27223,1677,90.0,3.5,51,95
1,0500000US51087,"Henrico County, Virginia",112672,2952,86.3,1.8,51,87
2,0500000US51177,"Spotsylvania County, Virginia",42465,1419,91.5,2.1,51,177
3,0500000US51121,"Montgomery County, Virginia",33056,1661,91.2,2.6,51,121
4,0500000US51107,"Loudoun County, Virginia",127625,2436,95.4,1.2,51,107
5,0500000US51153,"Prince William County, Virginia",137524,2159,95.8,0.8,51,153
6,0500000US51061,"Fauquier County, Virginia",21241,1293,87.5,3.0,51,61
7,0500000US51510,"Alexandria city, Virginia",65428,1931,92.1,1.6,51,510
8,0500000US51650,"Hampton city, Virginia",47767,1933,85.9,2.5,51,650
9,0500000US51013,"Arlington County, Virginia",103460,2365,92.5,1.4,51,13


Our next step is to plan the data we want to analyze, identify the relevant variables, and request the data from the Census Data API.

# Obtain

Let's include estimates and margin of error data for an additional subcategory: data for for the households of single mothers with children under the age of 18.

We will update our URL with the relevant variables found in our data dictionary: DP02_0011E, DP02_0011M, DP02_0011PE, and DP02_0011PM.

In [None]:
# updating the api url
api_url = "https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM,DP02_0011E,DP02_0011M,DP02_0011PE,DP02_0011PM&for=county:*&in=state:51"

We also need to 'recalculate' the value of `keyed_query` given the changed URL.

*Refactor note: We can avoid repeating this step by adding it to our function.*

In [33]:
# updating the keyed_query
keyed_query = api_url + f"&key={key}"

Now we run our function as we did previously and view the returned dataframe.

In [34]:
# executing the query
api_df = api_json_to_df(keyed_query)

# viewing the dataframe
api_df

Unnamed: 0,GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM,DP02_0011E,DP02_0011M,DP02_0011PE,DP02_0011PM,state,county
0,0500000US51095,"James City County, Virginia",27223,1677,90.0,3.5,1413,609,4.7,2.0,51,95
1,0500000US51087,"Henrico County, Virginia",112672,2952,86.3,1.8,7829,1331,6.0,1.0,51,87
2,0500000US51177,"Spotsylvania County, Virginia",42465,1419,91.5,2.1,2339,907,5.0,2.0,51,177
3,0500000US51121,"Montgomery County, Virginia",33056,1661,91.2,2.6,1052,482,2.9,1.4,51,121
4,0500000US51107,"Loudoun County, Virginia",127625,2436,95.4,1.2,5180,1406,3.9,1.1,51,107
5,0500000US51153,"Prince William County, Virginia",137524,2159,95.8,0.8,7011,1707,4.9,1.2,51,153
6,0500000US51061,"Fauquier County, Virginia",21241,1293,87.5,3.0,649,350,2.7,1.4,51,61
7,0500000US51510,"Alexandria city, Virginia",65428,1931,92.1,1.6,2882,1022,4.1,1.4,51,510
8,0500000US51650,"Hampton city, Virginia",47767,1933,85.9,2.5,4039,1124,7.3,2.0,51,650
9,0500000US51013,"Arlington County, Virginia",103460,2365,92.5,1.4,2913,831,2.6,0.7,51,13


# Scrub 

Time for some data cleansing.

In [35]:
api_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 12 columns):
GEO_ID         30 non-null object
NAME           30 non-null object
DP02_0153E     30 non-null object
DP02_0153M     30 non-null object
DP02_0153PE    30 non-null object
DP02_0153PM    30 non-null object
DP02_0011E     30 non-null object
DP02_0011M     30 non-null object
DP02_0011PE    30 non-null object
DP02_0011PM    30 non-null object
state          30 non-null object
county         30 non-null object
dtypes: object(12)
memory usage: 2.9+ KB


All data columns are string objects. We should convert "E" and "M" columns to integers. The "PE" and "PM" columns represent percentages. We can convert them to floats.

This particular dataset is small enough to visibly see that none of the data in columns we need to convert are empty, and none of them contain values that cannot be converted to numeric data

# Explore

## Notebook Summary