<h1> Attempting to Use the US Census Bureau API to Streamline the CSV Collection Process</h1>

<h2><b>IMPORTANT:</b><br><span style="color:red;">This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.</span></h2>

In [141]:
import pandas as pd
import numpy as np
import requests
import csv
import warnings
warnings.simplefilter('ignore') #Turn off warnings

<h2>Testing the API With a Set URL</h2>

In [518]:
base_url = "https://api.census.gov/data/2021/acs/acs5/profile?get="
api_key = "672276f2a0ad053d60f8bb0848cad8a290a29427"
variables = "DP02_0001E,DP02_0001M,DP02_0001PE,DP02_0001PM" # List of Variables we want to collect
# cant do it with all variables for some reason.
zcta_url = "&for=zip%20code%20tabulation%20area:" # need to include a 0 at the end when using the ZCTA range
MA_filter = "&in=state:25*" # filtering for ZCTAs in MA

In [519]:
test_url = f"{base_url}{variables}{zcta_url}01001&key={api_key}"

In [520]:
response = requests.get(test_url)
data = response.json()
header = data[0]
rows = data[1:]
header, rows
# Getting correct values from MA ZCTA501001 Total Households Estimate and Margin of Error.

(['DP02_0001E',
  'DP02_0001M',
  'DP02_0001PE',
  'DP02_0001PM',
  'zip code tabulation area'],
 [['6791', '345', '6791', '-888888888', '01001']])

I was able to get the correct values for the first row, but am struggling to find a way to get all the values for the all of the variables in the table.

<h2> Trying to use All Available Variables Instead of Just DP02_0001E and DP02_0001M </h2>

In [411]:
variable_table_url = f'https://api.census.gov/data/2021/acs/acs5/profile/groups/DP02.html'
v_table = pd.read_html(variable_table_url) # reading all available variables from API for the ACS5
variable_df = pd.DataFrame(v_table[0])
variable_df['Label'].replace({"!!": " ", ":": ""}, regex=True, inplace=True)
variable_df.head()

Unnamed: 0,Name,Label,Concept,Required,Attributes,Limit,Predicate Type,Group,Unnamed: 8
0,DP02_0001E,Estimate HOUSEHOLDS BY TYPE Total households,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,int,DP02,
1,DP02_0001EA,Annotation of Estimate HOUSEHOLDS BY TYPE Tota...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,string,DP02,
2,DP02_0001M,Margin of Error HOUSEHOLDS BY TYPE Total house...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,int,DP02,
3,DP02_0001MA,Annotation of Margin of Error HOUSEHOLDS BY TY...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,string,DP02,
4,DP02_0001PE,Percent HOUSEHOLDS BY TYPE Total households,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,int,DP02,


Created a DF from the table acs5 using the API that contains all the variable names, their corresponding labels, and other information.

In [307]:
variable_table = variable_df['Name'][0:1232]
variable_table = variable_table[~variable_table.str.endswith('A')] # want to remove values ending in A
variable_table.reset_index(drop = True, inplace = True)
variable_table

0       DP02_0001E
1       DP02_0001M
2      DP02_0001PE
3      DP02_0001PM
4       DP02_0002E
          ...     
611    DP02_0153PM
612     DP02_0154E
613     DP02_0154M
614    DP02_0154PE
615    DP02_0154PM
Name: Name, Length: 616, dtype: object

Used the DF to then create a Series of just the variable labels, also filtered out all Annotation variables.

In [313]:
all_variables = ','.join(variable_table.values)

Created a comma separated string of all variables in order to pass it into the query.

In [354]:
test_all_vars_url = f"{base_url}{all_variables}{zcta_url}01001&key={api_key}"
response = requests.get(test_all_vars_url)

In [355]:
response

<Response [400]>

Getting <Response [400]> meaning that there is an error (we want <Response [200]>), I believe its a problem with how many variables I am passing in because when I use a subset of the variables I dont run into the same error. 

Going to try and split up the sections of the table to create smaller tables which I will then merge into the complete table and get the CSV we want.

In [392]:
# VARIABLE NAME - MAX INDEX 
test = ','.join(variable_table[0:48]) # need to get 48 so we can reshape.
# HOUSEHOLDS - 68
households = ','.join(variable_table[0:68])
# RELATIONSHIP - 96
relationship = ','.join(variable_table[69:96])
# MARITAL STATUS - 144
marital_status = ','.join(variable_table[97:144])
# FERTILITY - 172
fertility = ','.join(variable_table[145:172])
# GRANDPARENTS - 208
grandparents = ','.join(variable_table[173:208])
# SCHOOL ENROLLMENT - 232
school_enroll = ','.join(variable_table[209:232])
# EDUCATIONAL ATTAINMENT - 272
# VETERAN STATUS - 280
# DISABILITY - 312
# RESIDENCE 1Y AGO - 348
# POB - 376
# US CITIZEN - 388
# YEAR OF ENTRY - 416
# WORLD REGION OF BIRTH - 444
# LANGUAGE - 492
# ANCESTRY - 604
# COMPUTERS - 616
test

'DP02_0001E,DP02_0001M,DP02_0001PE,DP02_0001PM,DP02_0002E,DP02_0002M,DP02_0002PE,DP02_0002PM,DP02_0003E,DP02_0003M,DP02_0003PE,DP02_0003PM,DP02_0004E,DP02_0004M,DP02_0004PE,DP02_0004PM,DP02_0005E,DP02_0005M,DP02_0005PE,DP02_0005PM,DP02_0006E,DP02_0006M,DP02_0006PE,DP02_0006PM,DP02_0007E,DP02_0007M,DP02_0007PE,DP02_0007PM,DP02_0008E,DP02_0008M,DP02_0008PE,DP02_0008PM,DP02_0009E,DP02_0009M,DP02_0009PE,DP02_0009PM,DP02_0010E,DP02_0010M,DP02_0010PE,DP02_0010PM,DP02_0011E,DP02_0011M,DP02_0011PE,DP02_0011PM,DP02_0012E,DP02_0012M,DP02_0012PE,DP02_0012PM'

In [393]:
test_all_vars_url = f"{base_url}{test}{zcta_url}01001&key={api_key}"
response = requests.get(test_all_vars_url)
# Limited to 50 Variables...

In [394]:
labels = response.json()[0][:-1]
values = response.json()[1:][0][:-1]
test_dict = {labels[i]: values[i] for i in range(len(values))}

In [459]:
test_df = variable_df[~variable_df["Name"].str.endswith("A")]
test_df.reset_index(drop = True, inplace = True)
test_labels = test_df["Label"][0:192]
test_dict2 = {test_labels[i]: values[i] for i in range(len(values))}

test_df = pd.DataFrame(np.array(values).reshape(-1,4)).rename(columns = {0: "Estimate", 
                                                               1: "Margin of Error", 
                                                               2: "Percent", 
                                                               3: "Percent Margin of Error"})

In [460]:
test_labels = test_labels[0::4]
new_index = [test_labels[i][test_labels[i].find("BY TYPE")+8:] for i in range(0, len(test_labels), 4)]

In [464]:
test_df.index = new_index
test_df

Unnamed: 0,Estimate,Margin of Error,Percent,Percent Margin of Error
Total households,6791,345,6791.0,-888888888.0
Total households Married-couple household,2959,265,43.6,4.1
Total households Married-couple household With children of the householder under 18 years,874,173,12.9,2.6
Total households Cohabiting couple household,726,232,10.7,3.3
Total households Cohabiting couple household With children of the householder under 18 years,196,129,2.9,1.9
"Total households Male householder, no spouse/partner present",1156,222,17.0,3.2
"Total households Male householder, no spouse/partner present With children of the householder under 18 years",48,55,0.7,0.8
"Total households Male householder, no spouse/partner present Householder living alone",843,207,12.4,3.0
"Total households Male householder, no spouse/partner present Householder living alone 65 years and over",367,146,5.4,2.1
"Total households Female householder, no spouse/partner present",1950,314,28.7,4.0


Realized I was being stupid and could use group(DP02) ...

<h1> Properly Collecting Entire CSVs Using Census Bureau API Grouping </h1>

In [579]:
labels = variable_df["Label"][:1232:8].reset_index(drop=True).str.title()
test_df.index = labels
test_df

Unnamed: 0_level_0,Estimate,Margin of Error,Percent,Percent Margin of Error
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Estimate Households By Type Total Households,6791,345,6791,(X)
Estimate Households By Type Total Households Married-Couple Household,2959,265,43.6,4.1
Estimate Households By Type Total Households Married-Couple Household With Children Of The Householder Under 18 Years,874,173,12.9,2.6
Estimate Households By Type Total Households Cohabiting Couple Household,726,232,10.7,3.3
Estimate Households By Type Total Households Cohabiting Couple Household With Children Of The Householder Under 18 Years,196,129,2.9,1.9
...,...,...,...,...
Estimate Ancestry Total Population Welsh,52,74,0.3,0.5
Estimate Ancestry Total Population West Indian (Excluding Hispanic Origin Groups),162,135,1.0,0.8
Estimate Computers And Internet Use Total Households,6791,345,6791,(X)
Estimate Computers And Internet Use Total Households With A Computer,6102,316,89.9,3.6


In [594]:
def get_df(url):
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        valid_data = np.array(data[1:][0][:-4:2]).reshape(-1,4)
        test_df = pd.DataFrame(valid_data).rename(columns = {0: "Estimate", 
                                                             1: "Margin of Error", 
                                                             2: "Percent", 
                                                             3: "Percent Margin of Error"})

        # Need to add proper indices. 
        test_df.index = labels
        # Need to replace "-888888888", need to do for rest of Annotation values.
        test_df = test_df.replace("-888888888", "(X)")
    
        return test_df
    else:
        return 0

In [None]:
base_url = "https://api.census.gov/data/2021/acs/acs5/profile?get="
api_key = "&key=672276f2a0ad053d60f8bb0848cad8a290a29427"
group = "group(DP02)"
zcta_url = "&for=zip%20code%20tabulation%20area:0"

ZCTA_Range = range(1001,2792) # all ZCTA values for MA.
labels = list(variable_df["Label"][:1232:8].reset_index(drop=True).str.title())

for zcta in ZCTA_Range:
    url = f"{base_url}{group}{zcta_url}{zcta}{api_key}"
    new_df = get_df(url)
    if not isinstance(new_df, pd.DataFrame):
        continue
    else:
        new_df.to_csv(f"felipe_csv/acs5_ZCTA50{zcta}.csv")

In [589]:
test_df

Unnamed: 0_level_0,Estimate,Margin of Error,Percent,Percent Margin of Error
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Estimate Households By Type Total Households,6791,345,6791,(X)
Estimate Households By Type Total Households Married-Couple Household,2959,265,43.6,4.1
Estimate Households By Type Total Households Married-Couple Household With Children Of The Householder Under 18 Years,874,173,12.9,2.6
Estimate Households By Type Total Households Cohabiting Couple Household,726,232,10.7,3.3
Estimate Households By Type Total Households Cohabiting Couple Household With Children Of The Householder Under 18 Years,196,129,2.9,1.9
...,...,...,...,...
Estimate Ancestry Total Population Welsh,52,74,0.3,0.5
Estimate Ancestry Total Population West Indian (Excluding Hispanic Origin Groups),162,135,1.0,0.8
Estimate Computers And Internet Use Total Households,6791,345,6791,(X)
Estimate Computers And Internet Use Total Households With A Computer,6102,316,89.9,3.6


In [599]:
new_df = get_df(f"{base_url}{group}{zcta_url}1321{api_key}")
new_df

0

In [604]:
not isinstance(new_df, pd.DataFrame)

True