<a href="https://colab.research.google.com/github/npr99/PlanningMethods/blob/master/Explore_ACS_Variable_Metadata_2021_06_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to explore ACS variable metdata
The [American Community Survey developers section for the application programming interface (API)](https://www.census.gov/data/developers/data-sets/acs-5year.html), provides a comprehensive list of all ACS variables. The problem is that there are more than 25,000 unique estimates provided by each ACS survey. Trying to find a variable that relates to a specific topic is not easy. This notebook helps solve this problem.

The ACS metadata for each varaiable provides the following:
1. Unique variable name
2. Label
3. Group
4. Limit
5. Concept
6. Attributes
7. Predicate Type
8. Required
9. Predicate Only

The Census API guide provides more details on what each column means
https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf

For ACS data the most important columns are the name, label, group, and concept.

In [1]:
# Python packages required to read in and Census API data
import requests ## Required for the Census API
import pandas as pd # For reading, writing and wrangling data

In [2]:
# https://stackoverflow.com/questions/38845474/how-to-convert-json-into-dataframe
acs_variables_json = pd.read_json('https://api.census.gov/data/2012/acs/acs5/variables.json')

In [3]:
# explore variable dictionary from json
acs_variables_json.variables.head()

AIANHH         {'label': 'American Indian Area/Alaska Native ...
AIHHTLI        {'label': 'American Indian Area (Off-Reservati...
AITSCE         {'label': 'American Indian Tribal Subdivision ...
ANRC           {'label': 'Alaska Native Regional Corporation'...
B00001_001E    {'label': 'Estimate!!Total', 'concept': 'UNWEI...
Name: variables, dtype: object

In [4]:
# Apply Series to variables
acs_variables_df = acs_variables_json.variables.apply(pd.Series)
acs_variables_df.head()

Unnamed: 0,label,group,limit,concept,predicateType,attributes,required,values,predicateOnly
AIANHH,American Indian Area/Alaska Native Area/Hawaii...,,0,,,,,,
AIHHTLI,American Indian Area (Off-Reservation Trust La...,,0,,,,,,
AITSCE,American Indian Tribal Subdivision (Census),,0,,,,,,
ANRC,Alaska Native Regional Corporation,,0,,,,,,
B00001_001E,Estimate!!Total,B00001,0,UNWEIGHTED SAMPLE COUNT OF THE POPULATION,int,B00001_001EA,,,


In [5]:
# The variable name is in the index column - reset index move name
acs_variables = acs_variables_df.reset_index()
# rename index column
acs_variables = acs_variables.rename(columns={"index": "name"})
acs_variables.head()

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,values,predicateOnly
0,AIANHH,American Indian Area/Alaska Native Area/Hawaii...,,0,,,,,,
1,AIHHTLI,American Indian Area (Off-Reservation Trust La...,,0,,,,,,
2,AITSCE,American Indian Tribal Subdivision (Census),,0,,,,,,
3,ANRC,Alaska Native Regional Corporation,,0,,,,,,
4,B00001_001E,Estimate!!Total,B00001,0,UNWEIGHTED SAMPLE COUNT OF THE POPULATION,int,B00001_001EA,,,


In [6]:
pd.set_option('max_colwidth', 20) # concept names are long - can make pandas display different column width
acs_variables[["name","concept","group","predicateType","required","predicateOnly"]].describe()

Unnamed: 0,name,concept,group,predicateType,required,predicateOnly
count,22567,22531,22567,22537,1,3
unique,22567,1030,1030,6,1,1
top,B17010D_035E,DETAILED OCCUPAT...,B24123,int,default displayed,True
freq,1,526,526,22350,1,3


In [7]:
acs_variables.loc[acs_variables['required'] == 'default displayed']

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,values,predicateOnly
22539,GEOCOMP,Geographic Compo...,,0,,string,,default displayed,{'item': {'R1': ...,


In [8]:
# Look at variables by prediate type - relates to variable type integer, float, string
acs_variables.groupby(['predicateType']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,predicateType,count
0,fips-for,1
1,fips-in,1
2,float,177
3,int,22350
4,string,7
5,ucgid,1


In [9]:
# Look at variables that are not floats and not integers and not missing
pd.set_option('max_colwidth', 60) # concept names are long - can make pandas display different column width
acs_variables[['name','label','predicateType']].loc[(acs_variables['predicateType'].isin(["float","int"])==False) &
                  (acs_variables['predicateType'].notna())]

Unnamed: 0,name,label,predicateType
22539,GEOCOMP,Geographic Component code,string
22540,GEOVARIANT,Geographic variant,string
22541,GEO_ID,Geography,string
22542,LSAD_NAME,Legal/Statistical Area Description name,string
22557,SUMLEVEL,Summary Level code,string
22561,US,United States,string
22563,ZCTA5,Zip Code Tabulation Area (Five-Digit),string
22564,for,Census API FIPS 'for' clause,fips-for
22565,in,Census API FIPS 'in' clause,fips-in
22566,ucgid,Uniform Census Geography Identifier clause,ucgid


In [10]:
# Look at the top 25 observations
acs_variables.head(25)

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,values,predicateOnly
0,AIANHH,American Indian Area/Alaska Native Area/Hawaiian Home Land,,0,,,,,,
1,AIHHTLI,American Indian Area (Off-Reservation Trust Land Only)/H...,,0,,,,,,
2,AITSCE,American Indian Tribal Subdivision (Census),,0,,,,,,
3,ANRC,Alaska Native Regional Corporation,,0,,,,,,
4,B00001_001E,Estimate!!Total,B00001,0,UNWEIGHTED SAMPLE COUNT OF THE POPULATION,int,B00001_001EA,,,
5,B00002_001E,Estimate!!Total,B00002,0,UNWEIGHTED SAMPLE HOUSING UNITS,int,B00002_001EA,,,
6,B01001A_001E,Estimate!!Total,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_001EA,B01001A_001M,B01001A_001MA",,,
7,B01001A_002E,Estimate!!Total!!Male,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_002EA,B01001A_002M,B01001A_002MA",,,
8,B01001A_003E,Estimate!!Total!!Male!!Under 5 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_003EA,B01001A_003M,B01001A_003MA",,,
9,B01001A_004E,Estimate!!Total!!Male!!5 to 9 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_004EA,B01001A_004M,B01001A_004MA",,,


# Search the labels

In [50]:
# Look at one group of variables 
# Notice search goes back to the full list of ACS variables
acs_variables.loc[acs_variables['group'] == 'B25118']

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,values,predicateOnly,search var count
19583,B25118_001E,Estimate!!Total,B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_001EA,B25118_001M,B25118_001MA",,,,1
19584,B25118_002E,Estimate!!Total!!Owner occupied,B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_002EA,B25118_002M,B25118_002MA",,,,1
19585,B25118_003E,"Estimate!!Total!!Owner occupied!!Less than $5,000",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_003EA,B25118_003M,B25118_003MA",,,,1
19586,B25118_004E,"Estimate!!Total!!Owner occupied!!$5,000 to $9,999",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_004EA,B25118_004M,B25118_004MA",,,,1
19587,B25118_005E,"Estimate!!Total!!Owner occupied!!$10,000 to $14,999",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_005EA,B25118_005M,B25118_005MA",,,,1
19588,B25118_006E,"Estimate!!Total!!Owner occupied!!$15,000 to $19,999",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_006EA,B25118_006M,B25118_006MA",,,,1
19589,B25118_007E,"Estimate!!Total!!Owner occupied!!$20,000 to $24,999",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_007EA,B25118_007M,B25118_007MA",,,,1
19590,B25118_008E,"Estimate!!Total!!Owner occupied!!$25,000 to $34,999",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_008EA,B25118_008M,B25118_008MA",,,,1
19591,B25118_009E,"Estimate!!Total!!Owner occupied!!$35,000 to $49,999",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_009EA,B25118_009M,B25118_009MA",,,,1
19592,B25118_010E,"Estimate!!Total!!Owner occupied!!$50,000 to $74,999",B25118,0,TENURE BY HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),int,"B25118_010EA,B25118_010M,B25118_010MA",,,,1


In [47]:
acs_variables["search var count"] = 0
acs_variables.loc[acs_variables['concept'].str.contains('SIZE',case=False,na=False),'search var count'] += 1
acs_variables.loc[acs_variables['concept'].str.contains('INCOME',case=False,na=False),'search var count'] += 1
acs_variables["search var count"].describe()

count    22567.000000
mean         0.140338
std          0.349380
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          2.000000
Name: search var count, dtype: float64

In [48]:
# save observations with desired search var count
variables_to_explore = acs_variables.loc[acs_variables['search var count'] >= 2]
# drop if GEO_ID is included in list - GEO_ID includes list of all variables
variables_to_explore = variables_to_explore.loc[variables_to_explore['name']!='GEO_ID']
pd.set_option('max_colwidth', 200) # concept names are long - can make pandas display wider columns
variables_to_explore.groupby(['group','concept']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,group,concept,count
0,B19019,MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) BY HOUSEHOLD SIZE,8
1,B19119,MEDIAN FAMILY INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) BY FAMILY SIZE,7


In [40]:
acs_variables["search var count"] = 0
acs_variables.loc[acs_variables['concept'].str.contains('HOUSEHOLD INCOME IN THE PAST 12 MONTHS',case=False,na=False),'search var count'] += 1
acs_variables.loc[acs_variables['concept'].str.contains('FAMILY INCOME IN THE PAST 12 MONTHS',case=False,na=False),'search var count'] += 1

acs_variables.loc[acs_variables['concept'].str.contains('poverty',case=False,na=False),'search var count'] -= .3
acs_variables.loc[acs_variables['concept'].str.contains('median',case=False,na=False),'search var count'] -= .3
acs_variables.loc[acs_variables['concept'].str.contains('AGGREGATE',case=False,na=False),'search var count'] -= .3
acs_variables["search var count"].describe()

count    22567.000000
mean         0.019391
std          0.320990
min         -0.300000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.100000
Name: search var count, dtype: float64

In [41]:
# save observations with desired search var count
variables_to_explore = acs_variables.loc[acs_variables['search var count'] >= 1]
# drop if GEO_ID is included in list - GEO_ID includes list of all variables
variables_to_explore = variables_to_explore.loc[variables_to_explore['name']!='GEO_ID']
variables_to_explore.head(50)

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,values,predicateOnly,search var count
11157,B19001A_001E,Estimate!!Total,B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_001EA,B19001A_001M,B19001A_001MA",,,,1.0
11158,B19001A_002E,"Estimate!!Total!!Less than $10,000",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_002EA,B19001A_002M,B19001A_002MA",,,,1.0
11159,B19001A_003E,"Estimate!!Total!!$10,000 to $14,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_003EA,B19001A_003M,B19001A_003MA",,,,1.0
11160,B19001A_004E,"Estimate!!Total!!$15,000 to $19,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_004EA,B19001A_004M,B19001A_004MA",,,,1.0
11161,B19001A_005E,"Estimate!!Total!!$20,000 to $24,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_005EA,B19001A_005M,B19001A_005MA",,,,1.0
11162,B19001A_006E,"Estimate!!Total!!$25,000 to $29,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_006EA,B19001A_006M,B19001A_006MA",,,,1.0
11163,B19001A_007E,"Estimate!!Total!!$30,000 to $34,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_007EA,B19001A_007M,B19001A_007MA",,,,1.0
11164,B19001A_008E,"Estimate!!Total!!$35,000 to $39,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_008EA,B19001A_008M,B19001A_008MA",,,,1.0
11165,B19001A_009E,"Estimate!!Total!!$40,000 to $44,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_009EA,B19001A_009M,B19001A_009MA",,,,1.0
11166,B19001A_010E,"Estimate!!Total!!$45,000 to $49,999",B19001A,0,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),int,"B19001A_010EA,B19001A_010M,B19001A_010MA",,,,1.0


In [42]:
variables_to_explore['name'].describe()

count            1799
unique           1799
top       B25074_007E
freq                1
Name: name, dtype: object

## Look at all of the groups and concepts for variable
The group and concept provide a description of the variables.

In [43]:
pd.set_option('max_colwidth', 200) # concept names are long - can make pandas display wider columns
variables_to_explore.groupby(['group','concept']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,group,concept,count
0,B19001,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS),17
1,B19001A,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER),17
2,B19001B,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (BLACK OR AFRICAN AMERICAN ALONE HOUSEHOLDER),17
3,B19001C,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (AMERICAN INDIAN AND ALASKA NATIVE ALONE HOUSEHOLDER),17
4,B19001D,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (ASIAN ALONE HOUSEHOLDER),17
5,B19001E,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (NATIVE HAWAIIAN AND OTHER PACIFIC ISLANDER ALONE HOUSEHOLDER),17
6,B19001F,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (SOME OTHER RACE ALONE HOUSEHOLDER),17
7,B19001G,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (TWO OR MORE RACES HOUSEHOLDER),17
8,B19001H,"HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE, NOT HISPANIC OR LATINO HOUSEHOLDER)",17
9,B19001I,HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2012 INFLATION-ADJUSTED DOLLARS) (HISPANIC OR LATINO HOUSEHOLDER),17


In [None]:
# Save results to a csv file
variables_to_explore.to_csv("variables_to_explore.csv")