<h1>Creating an Automatic Census Data Scraper by State</h1>
<h2><b>IMPORTANT:</b><br><span style="color:red;">This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.</span></h2>

<h2>Setting up the Census Bureau API</h2>

In [16]:
import pandas as pd
import numpy as np
import requests
import csv
import warnings
warnings.simplefilter('ignore') #Turn off warnings

In [17]:
base_url = "https://api.census.gov/data/2021/acs/acs5/profile?get="
api_key = "672276f2a0ad053d60f8bb0848cad8a290a29427"
zcta_url = "&for=zip%20code%20tabulation%20area:" # need to include a 0 at the end when using the ZCTA range

In [18]:
variable_table_url = f'https://api.census.gov/data/2021/acs/acs5/profile/groups/DP02.html'
v_table = pd.read_html(variable_table_url) # reading all available variables from API for the ACS5
variable_df = pd.DataFrame(v_table[0])
variable_df['Label'].replace({"!!": " ", ":": ""}, regex=True, inplace=True)
variable_df

Unnamed: 0,Name,Label,Concept,Required,Attributes,Limit,Predicate Type,Group,Unnamed: 8
0,DP02_0001E,Estimate HOUSEHOLDS BY TYPE Total households,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,int,DP02,
1,DP02_0001EA,Annotation of Estimate HOUSEHOLDS BY TYPE Tota...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,string,DP02,
2,DP02_0001M,Margin of Error HOUSEHOLDS BY TYPE Total house...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,int,DP02,
3,DP02_0001MA,Annotation of Margin of Error HOUSEHOLDS BY TY...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,string,DP02,
4,DP02_0001PE,Percent HOUSEHOLDS BY TYPE Total households,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,int,DP02,
...,...,...,...,...,...,...,...,...,...
1228,DP02_0154PE,Percent COMPUTERS AND INTERNET USE Total house...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,float,DP02,
1229,DP02_0154PEA,Annotation of Percent COMPUTERS AND INTERNET U...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,string,DP02,
1230,DP02_0154PM,Percent Margin of Error COMPUTERS AND INTERNET...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,float,DP02,
1231,DP02_0154PMA,Annotation of Percent Margin of Error COMPUTER...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,predicate-only,,0,string,DP02,


In [19]:
variable_table = variable_df['Name'][0:1232]
variable_table = variable_table[~variable_table.str.endswith('A')] # want to remove values ending in A
variable_table.reset_index(drop = True, inplace = True)
variable_table

0       DP02_0001E
1       DP02_0001M
2      DP02_0001PE
3      DP02_0001PM
4       DP02_0002E
          ...     
611    DP02_0153PM
612     DP02_0154E
613     DP02_0154M
614    DP02_0154PE
615    DP02_0154PM
Name: Name, Length: 616, dtype: object

In [20]:
test = ','.join(variable_table[0:48])
test_all_vars_url = f"{base_url}{test}{zcta_url}01001&key={api_key}"
response = requests.get(test_all_vars_url)

In [21]:
labels = response.json()[0][:-1]
values = response.json()[1:][0][:-1]
test_dict = {labels[i]: values[i] for i in range(len(values))}

test_df = variable_df[~variable_df["Name"].str.endswith("A")]
test_df.reset_index(drop = True, inplace = True)
test_labels = test_df["Label"][0:192]
test_dict2 = {test_labels[i]: values[i] for i in range(len(values))}

test_df = pd.DataFrame(np.array(values).reshape(-1,4)).rename(columns = {0: "Estimate", 
                                                               1: "Margin of Error", 
                                                               2: "Percent", 
                                                               3: "Percent Margin of Error"})
test_df

Unnamed: 0,Estimate,Margin of Error,Percent,Percent Margin of Error
0,6791,345,6791.0,-888888888.0
1,2959,265,43.6,4.1
2,874,173,12.9,2.6
3,726,232,10.7,3.3
4,196,129,2.9,1.9
5,1156,222,17.0,3.2
6,48,55,0.7,0.8
7,843,207,12.4,3.0
8,367,146,5.4,2.1
9,1950,314,28.7,4.0


In [22]:
test_labels = test_labels[0::4]
new_index = [test_labels[i][test_labels[i].find("BY TYPE")+8:] for i in range(0, len(test_labels), 4)]

test_df.index = new_index
test_df

Unnamed: 0,Estimate,Margin of Error,Percent,Percent Margin of Error
Total households,6791,345,6791.0,-888888888.0
Total households Married-couple household,2959,265,43.6,4.1
Total households Married-couple household With children of the householder under 18 years,874,173,12.9,2.6
Total households Cohabiting couple household,726,232,10.7,3.3
Total households Cohabiting couple household With children of the householder under 18 years,196,129,2.9,1.9
"Total households Male householder, no spouse/partner present",1156,222,17.0,3.2
"Total households Male householder, no spouse/partner present With children of the householder under 18 years",48,55,0.7,0.8
"Total households Male householder, no spouse/partner present Householder living alone",843,207,12.4,3.0
"Total households Male householder, no spouse/partner present Householder living alone 65 years and over",367,146,5.4,2.1
"Total households Female householder, no spouse/partner present",1950,314,28.7,4.0


In [23]:
social_labels = variable_df["Label"][:1232:8].reset_index(drop=True).str.title()

# Getting labels for the Demographic Table
variable_table_url = f'https://api.census.gov/data/2021/acs/acs5/profile/groups/DP05.html'
v_table = pd.read_html(variable_table_url) # reading all available variables from API for the ACS5
variable_df = pd.DataFrame(v_table[0])
variable_df['Label'].replace({"!!": " ", ":": ""}, regex=True, inplace=True)

demographic_labels = variable_df["Label"][:712:8].reset_index(drop=True).str.title()

In [24]:
def get_social_df_url(url):
    response = requests.get(url)
    print(response.status_code) if response.status_code != 200 else False
    if response.status_code == 200:
        data = response.json()
        valid_data = np.array(data[1:][0][:-4:2]).reshape(-1,4)
        socs_df = pd.DataFrame(valid_data).rename(columns = {0: "Estimate", 
                                                             1: "Margin of Error", 
                                                             2: "Percent", 
                                                             3: "Percent Margin of Error"})

        # Need to add proper indices. 
        socs_df.index = social_labels
        # Need to replace "-888888888", need to do for rest of Annotation values.
        socs_df = test_df.replace("-888888888", "(X)")
    
        return socs_df
    else:
        return 0

<h2>Pulling ZCTA Per State to Allow Scraping by State</h2>

In [25]:
zcta_mapping_df = pd.read_excel("./ani_csv/ZIPCodetoZCTACrosswalk2021UDS.xlsx")
zcta_mapping_df

Unnamed: 0,ZIP_CODE,PO_NAME,STATE,ZIP_TYPE,ZCTA,zip_join_type
0,501,Holtsville,NY,Post Office or large volume customer,11742.0,Spatial join to ZCTA
1,544,Holtsville,NY,Post Office or large volume customer,11742.0,Spatial join to ZCTA
2,601,Adjuntas,PR,Zip Code Area,601.0,Zip matches ZCTA
3,602,Aguada,PR,Zip Code Area,602.0,Zip matches ZCTA
4,603,Aguadilla,PR,Zip Code Area,603.0,Zip matches ZCTA
...,...,...,...,...,...,...
41086,99926,Metlakatla,AK,Zip Code Area,99926.0,Zip matches ZCTA
41087,99927,Point Baker,AK,Zip Code Area,99927.0,Zip matches ZCTA
41088,99928,Ward Cove,AK,Post Office or large volume customer,99901.0,Spatial join to ZCTA
41089,99929,Wrangell,AK,Zip Code Area,99929.0,Zip matches ZCTA


In [26]:
#Clean it so it is only ZCTA and State

zcta_to_state = zcta_mapping_df[["STATE", "ZCTA"]]
zcta_to_state = zcta_to_state[~zcta_to_state["ZCTA"].isna()]
zcta_to_state["ZCTA"] = zcta_to_state["ZCTA"].astype(int)
zcta_to_state

Unnamed: 0,STATE,ZCTA
0,NY,11742
1,NY,11742
2,PR,601
3,PR,602
4,PR,603
...,...,...
41086,AK,99926
41087,AK,99927
41088,AK,99901
41089,AK,99929


In [33]:
clean_zip = lambda zip: str(zip) if zip > 10000 else (f"0{zip}" if zip >= 1000 else f"00{zip}")
zcta_to_state['ZCTA'] = zcta_to_state['ZCTA'].map(clean_zip)
zcta_to_state

Unnamed: 0,STATE,ZCTA
0,NY,11742
1,NY,11742
2,PR,00601
3,PR,00602
4,PR,00603
...,...,...
41086,AK,99926
41087,AK,99927
41088,AK,99901
41089,AK,99929


<h2>Preparing State-Specific Scraping Function</h2>

In [28]:
import os

In [34]:
def get_social_df_state(abbrev):
    base_url = "https://api.census.gov/data/2021/acs/acs5/profile?get="
    api_key = "&key=672276f2a0ad053d60f8bb0848cad8a290a29427"
    group = "group(DP02)"
    zcta_url = "&for=zip%20code%20tabulation%20area:"

    ZCTA_Range = list(zcta_to_state[zcta_to_state["STATE"] == abbrev]["ZCTA"])
    
    try:
        os.mkdir("census_data")
    except:
        pass
    
    try:
        os.mkdir("./census_data/social_chars")
    except:
        pass
    
    try:
        os.mkdir(f"./census_data/social_chars/{abbrev}")
    except:
        pass

    for zcta in ZCTA_Range:
        url = f"{base_url}{group}{zcta_url}{zcta}{api_key}"
        new_social_df = get_social_df_url(url)
        if not isinstance(new_social_df, pd.DataFrame):
            print(f"Failed at zcta {zcta}")
            continue
        else:
            try:
                new_social_df.to_csv(f"census_data/social_chars/{abbrev}/{zcta}.csv")
            except:
                pass
            print(f"Succeeded at zcta {zcta}")

In [37]:
get_social_df_state("MA")

Succeeded at zcta 01001
Succeeded at zcta 01002
Succeeded at zcta 01003
Succeeded at zcta 01002
Succeeded at zcta 01005
Succeeded at zcta 01007
Succeeded at zcta 01008
Succeeded at zcta 01009
Succeeded at zcta 01010
Succeeded at zcta 01011
Succeeded at zcta 01012
Succeeded at zcta 01013
Succeeded at zcta 01013
Succeeded at zcta 01020
Succeeded at zcta 01020
Succeeded at zcta 01022
Succeeded at zcta 01026
Succeeded at zcta 01027
Succeeded at zcta 01028
Succeeded at zcta 01029
Succeeded at zcta 01030
Succeeded at zcta 01031
Succeeded at zcta 01032
Succeeded at zcta 01033
Succeeded at zcta 01034
Succeeded at zcta 01035
Succeeded at zcta 01036
Succeeded at zcta 01037
Succeeded at zcta 01038
Succeeded at zcta 01039
Succeeded at zcta 01040
Succeeded at zcta 01040
Succeeded at zcta 01050
Succeeded at zcta 01053
Succeeded at zcta 01054
Succeeded at zcta 01056
Succeeded at zcta 01057
Succeeded at zcta 01002
Succeeded at zcta 01060
Succeeded at zcta 01060
Succeeded at zcta 01062
Succeeded at zct

Succeeded at zcta 01921
Succeeded at zcta 01922
Succeeded at zcta 01923
Succeeded at zcta 01929
Succeeded at zcta 01930
Succeeded at zcta 01930
Succeeded at zcta 01982
Succeeded at zcta 01937
Succeeded at zcta 01938
Succeeded at zcta 01940
Succeeded at zcta 01944
Succeeded at zcta 01945
Succeeded at zcta 01949
Succeeded at zcta 01950
Succeeded at zcta 01951
Succeeded at zcta 01952
Succeeded at zcta 01960
Succeeded at zcta 01960
Succeeded at zcta 01915
Succeeded at zcta 01966
Succeeded at zcta 01969
Succeeded at zcta 01970
Succeeded at zcta 01970
Succeeded at zcta 01982
Succeeded at zcta 01983
Succeeded at zcta 01984
Succeeded at zcta 01985
Succeeded at zcta 02043
Succeeded at zcta 02019
Succeeded at zcta 02050
Succeeded at zcta 02021
Succeeded at zcta 02025
Succeeded at zcta 02026
Succeeded at zcta 02026
Succeeded at zcta 02030
Succeeded at zcta 02032
Succeeded at zcta 02035
Succeeded at zcta 02038
Succeeded at zcta 02066
Succeeded at zcta 02050
Succeeded at zcta 02043
Succeeded at zct

In [None]:
len(list(zcta_to_state[zcta_to_state["STATE"] == "MA"]["ZCTA"]))
#It took 10 minutes, or around 600 seconds to scrape 681 data sets
#Texas has 1939 ZCTAs so it will take an estimated 

In [36]:
get_social_df_state("TX")

Succeeded at zcta 78704
Succeeded at zcta 78744
Succeeded at zcta 73949
Succeeded at zcta 75001
Succeeded at zcta 75002
Succeeded at zcta 75006
Succeeded at zcta 75007
Succeeded at zcta 75009
Succeeded at zcta 75010
Succeeded at zcta 75006
Succeeded at zcta 75013
Succeeded at zcta 75039
Succeeded at zcta 75061
Succeeded at zcta 75038
Succeeded at zcta 75061
Succeeded at zcta 75019
Succeeded at zcta 75020
Succeeded at zcta 75021
Succeeded at zcta 75022
Succeeded at zcta 75023
Succeeded at zcta 75024
Succeeded at zcta 75025
Succeeded at zcta 75023
Succeeded at zcta 75028
Succeeded at zcta 75028
Succeeded at zcta 75067
Succeeded at zcta 75088
Succeeded at zcta 75032
Succeeded at zcta 75034
Succeeded at zcta 75034
Succeeded at zcta 75035
Succeeded at zcta 75034
Succeeded at zcta 75038
Succeeded at zcta 75039
Succeeded at zcta 75040
Succeeded at zcta 75041
Succeeded at zcta 75042
Succeeded at zcta 75043
Succeeded at zcta 75044
Succeeded at zcta 75044
Succeeded at zcta 75040
Succeeded at zct

Succeeded at zcta 75501
Succeeded at zcta 75550
Succeeded at zcta 75551
Succeeded at zcta 75554
Succeeded at zcta 75555
Succeeded at zcta 75556
Succeeded at zcta 75558
Succeeded at zcta 75559
Succeeded at zcta 75560
Succeeded at zcta 75561
Succeeded at zcta 75562
Succeeded at zcta 75563
Succeeded at zcta 75657
Succeeded at zcta 75565
Succeeded at zcta 75566
Succeeded at zcta 75567
Succeeded at zcta 75568
Succeeded at zcta 75569
Succeeded at zcta 75570
Succeeded at zcta 75571
Succeeded at zcta 75572
Succeeded at zcta 75573
Succeeded at zcta 75574
Succeeded at zcta 75501
Succeeded at zcta 75601
Succeeded at zcta 75602
Succeeded at zcta 75603
Succeeded at zcta 75604
Succeeded at zcta 75605
Succeeded at zcta 75601
Succeeded at zcta 75602
Succeeded at zcta 75605
Succeeded at zcta 75605
Succeeded at zcta 75630
Succeeded at zcta 75631
Succeeded at zcta 75633
Succeeded at zcta 75638
Succeeded at zcta 75633
Succeeded at zcta 75638
Succeeded at zcta 75639
Succeeded at zcta 75640
Succeeded at zct

Succeeded at zcta 76201
Succeeded at zcta 76201
Succeeded at zcta 76201
Succeeded at zcta 76201
Succeeded at zcta 76205
Succeeded at zcta 76205
Succeeded at zcta 76207
Succeeded at zcta 76208
Succeeded at zcta 76209
Succeeded at zcta 76210
Succeeded at zcta 76225
Succeeded at zcta 76226
Succeeded at zcta 76227
Succeeded at zcta 76228
Succeeded at zcta 76230
Succeeded at zcta 76233
Succeeded at zcta 76234
Succeeded at zcta 76238
Succeeded at zcta 76239
Succeeded at zcta 76240
Succeeded at zcta 76240
Succeeded at zcta 76244
Succeeded at zcta 76245
Succeeded at zcta 76234
Succeeded at zcta 76247
Succeeded at zcta 76248
Succeeded at zcta 76249
Succeeded at zcta 76250
Succeeded at zcta 76251
Succeeded at zcta 76252
Succeeded at zcta 76253
Succeeded at zcta 76255
Succeeded at zcta 76258
Succeeded at zcta 76259
Succeeded at zcta 76261
Succeeded at zcta 76262
Succeeded at zcta 76263
Succeeded at zcta 76264
Succeeded at zcta 76265
Succeeded at zcta 76266
Succeeded at zcta 76234
Succeeded at zct

Succeeded at zcta 77002
Succeeded at zcta 77003
Succeeded at zcta 77004
Succeeded at zcta 77005
Succeeded at zcta 77006
Succeeded at zcta 77007
Succeeded at zcta 77008
Succeeded at zcta 77009
Succeeded at zcta 77010
Succeeded at zcta 77011
Succeeded at zcta 77012
Succeeded at zcta 77013
Succeeded at zcta 77014
Succeeded at zcta 77015
Succeeded at zcta 77016
Succeeded at zcta 77017
Succeeded at zcta 77018
Succeeded at zcta 77019
Succeeded at zcta 77020
Succeeded at zcta 77021
Succeeded at zcta 77022
Succeeded at zcta 77023
Succeeded at zcta 77024
Succeeded at zcta 77025
Succeeded at zcta 77026
Succeeded at zcta 77027
Succeeded at zcta 77028
Succeeded at zcta 77029
Succeeded at zcta 77030
Succeeded at zcta 77031
Succeeded at zcta 77032
Succeeded at zcta 77033
Succeeded at zcta 77034
Succeeded at zcta 77035
Succeeded at zcta 77036
Succeeded at zcta 77037
Succeeded at zcta 77038
Succeeded at zcta 77039
Succeeded at zcta 77040
Succeeded at zcta 77041
Succeeded at zcta 77042
Succeeded at zct

Succeeded at zcta 77511
Succeeded at zcta 77511
Succeeded at zcta 77514
Succeeded at zcta 77515
Succeeded at zcta 77515
Succeeded at zcta 77517
Succeeded at zcta 77518
Succeeded at zcta 77519
Succeeded at zcta 77520
Succeeded at zcta 77521
Succeeded at zcta 77521
Succeeded at zcta 77523
Succeeded at zcta 77530
Succeeded at zcta 77531
Succeeded at zcta 77532
Succeeded at zcta 77533
Succeeded at zcta 77534
Succeeded at zcta 77535
Succeeded at zcta 77536
Succeeded at zcta 77538
Succeeded at zcta 77539
Succeeded at zcta 77541
Succeeded at zcta 77541
Succeeded at zcta 77545
Succeeded at zcta 77546
Succeeded at zcta 77547
Succeeded at zcta 77546
Succeeded at zcta 77550
Succeeded at zcta 77551
Succeeded at zcta 77551
Succeeded at zcta 77550
Succeeded at zcta 77554
Succeeded at zcta 77550
Succeeded at zcta 77560
Succeeded at zcta 77561
Succeeded at zcta 77562
Succeeded at zcta 77563
Succeeded at zcta 77564
Succeeded at zcta 77565
Succeeded at zcta 77566
Succeeded at zcta 77568
Succeeded at zct

Succeeded at zcta 78242
Succeeded at zcta 78243
Succeeded at zcta 78244
Succeeded at zcta 78245
Succeeded at zcta 78216
Succeeded at zcta 78247
Succeeded at zcta 78248
Succeeded at zcta 78249
Succeeded at zcta 78250
Succeeded at zcta 78251
Succeeded at zcta 78252
Succeeded at zcta 78253
Succeeded at zcta 78254
Succeeded at zcta 78255
Succeeded at zcta 78256
Succeeded at zcta 78257
Succeeded at zcta 78258
Succeeded at zcta 78259
Succeeded at zcta 78260
Succeeded at zcta 78261
Succeeded at zcta 78263
Succeeded at zcta 78264
Succeeded at zcta 78217
Succeeded at zcta 78266
Succeeded at zcta 78238
Succeeded at zcta 78249
Succeeded at zcta 78232
Succeeded at zcta 78230
Succeeded at zcta 78216
Succeeded at zcta 78230
Succeeded at zcta 78204
Succeeded at zcta 78217
Succeeded at zcta 78217
Succeeded at zcta 78240
Succeeded at zcta 78217
Succeeded at zcta 78205
Succeeded at zcta 78205
Succeeded at zcta 78205
Succeeded at zcta 78205
Succeeded at zcta 78205
Succeeded at zcta 78205
Succeeded at zct

Succeeded at zcta 78830
Succeeded at zcta 78832
Succeeded at zcta 78833
Succeeded at zcta 78834
Succeeded at zcta 78836
Succeeded at zcta 78837
Succeeded at zcta 78838
Succeeded at zcta 78839
Succeeded at zcta 78840
Succeeded at zcta 78840
Succeeded at zcta 78840
Succeeded at zcta 78843
Succeeded at zcta 78840
Succeeded at zcta 78850
Succeeded at zcta 78851
Succeeded at zcta 78852
Succeeded at zcta 78852
Succeeded at zcta 78860
Succeeded at zcta 78861
Succeeded at zcta 78870
Succeeded at zcta 78871
Succeeded at zcta 78872
Succeeded at zcta 78873
Succeeded at zcta 78877
Succeeded at zcta 78879
Succeeded at zcta 78880
Succeeded at zcta 78881
Succeeded at zcta 78883
Succeeded at zcta 78884
Succeeded at zcta 78885
Succeeded at zcta 78886
Succeeded at zcta 78931
Succeeded at zcta 78932
Succeeded at zcta 78933
Succeeded at zcta 78934
Succeeded at zcta 78935
Succeeded at zcta 78938
Succeeded at zcta 78940
Succeeded at zcta 78941
Succeeded at zcta 78942
Succeeded at zcta 78943
Succeeded at zct

Succeeded at zcta 79720
Succeeded at zcta 79730
Succeeded at zcta 79731
Succeeded at zcta 79733
Succeeded at zcta 79734
Succeeded at zcta 79735
Succeeded at zcta 79738
Succeeded at zcta 79739
Succeeded at zcta 79743
Succeeded at zcta 79741
Succeeded at zcta 79742
Succeeded at zcta 79743
Succeeded at zcta 79744
Succeeded at zcta 79745
Succeeded at zcta 79748
Succeeded at zcta 79749
Succeeded at zcta 79752
Succeeded at zcta 79754
Succeeded at zcta 79755
Succeeded at zcta 79756
Succeeded at zcta 79758
Succeeded at zcta 79759
Succeeded at zcta 79761
Succeeded at zcta 79761
Succeeded at zcta 79762
Succeeded at zcta 79763
Succeeded at zcta 79764
Succeeded at zcta 79765
Succeeded at zcta 79766
Succeeded at zcta 79762
Succeeded at zcta 79764
Succeeded at zcta 79770
Succeeded at zcta 79772
Succeeded at zcta 79759
Succeeded at zcta 79777
Succeeded at zcta 79778
Succeeded at zcta 79780
Succeeded at zcta 79781
Succeeded at zcta 79782
Succeeded at zcta 79783
Succeeded at zcta 79785
Succeeded at zct