<a href="https://colab.research.google.com/github/npr99/PlanningMethods/blob/master/Explore_ACS_Variable_Metadata_2021_06_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to explore ACS variable metdata
The [American Community Survey developers section for the application programming interface (API)](https://www.census.gov/data/developers/data-sets/acs-5year.html), provides a comprehensive list of all ACS variables. The problem is that there are more than 25,000 unique estimates provided by each ACS survey. Trying to find a variable that relates to a specific topic is not easy. This notebook helps solve this problem.

The ACS metadata for each varaiable provides the following:
1. Unique variable name
2. Label
3. Group
4. Limit
5. Concept
6. Attributes
7. Predicate Type
8. Required
9. Predicate Only

The Census API guide provides more details on what each column means
https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf

For ACS data the most important columns are the name, label, group, and concept.

In [None]:
# Python packages required to read in and Census API data
import requests ## Required for the Census API
import pandas as pd # For reading, writing and wrangling data

In [None]:
# https://stackoverflow.com/questions/38845474/how-to-convert-json-into-dataframe
acs_variables_json = pd.read_json('https://api.census.gov/data/2019/acs/acs1/variables.json')

In [None]:
# explore variable dictionary from json
acs_variables_json.variables.head()

AIANHH          {'label': 'Geography', 'group': 'N/A', 'limit'...
ANRC            {'label': 'Geography', 'group': 'N/A', 'limit'...
B01001A_001E    {'label': 'Estimate!!Total:', 'concept': 'SEX ...
B01001A_002E    {'label': 'Estimate!!Total:!!Male:', 'concept'...
B01001A_003E    {'label': 'Estimate!!Total:!!Male:!!Under 5 ye...
Name: variables, dtype: object

In [None]:
# Apply Series to variables
acs_variables_df = acs_variables_json.variables.apply(pd.Series)
acs_variables_df.head()

Unnamed: 0,label,group,limit,concept,predicateType,attributes,required,predicateOnly
AIANHH,Geography,,0,,,,,
ANRC,Geography,,0,,,,,
B01001A_001E,Estimate!!Total:,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_001EA,B01001A_001M,B01001A_001MA",,
B01001A_002E,Estimate!!Total:!!Male:,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_002EA,B01001A_002M,B01001A_002MA",,
B01001A_003E,Estimate!!Total:!!Male:!!Under 5 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_003EA,B01001A_003M,B01001A_003MA",,


In [None]:
# The variable name is in the index column - reset index move name
acs_variables = acs_variables_df.reset_index()
# rename index column
acs_variables = acs_variables.rename(columns={"index": "name"})
acs_variables.head()

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,predicateOnly
0,AIANHH,Geography,,0,,,,,
1,ANRC,Geography,,0,,,,,
2,B01001A_001E,Estimate!!Total:,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_001EA,B01001A_001M,B01001A_001MA",,
3,B01001A_002E,Estimate!!Total:!!Male:,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_002EA,B01001A_002M,B01001A_002MA",,
4,B01001A_003E,Estimate!!Total:!!Male:!!Under 5 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_003EA,B01001A_003M,B01001A_003MA",,


In [None]:
pd.set_option('max_colwidth', 20) # concept names are long - can make pandas display different column width
acs_variables[["name","concept","group","predicateType","required","predicateOnly"]].describe()

Unnamed: 0,name,concept,group,predicateType,required,predicateOnly
count,35555,35531,35555,35533,1,3
unique,35555,1189,1378,6,1,1
top,B24010H_068E,DETAILED OCCUPAT...,B24124,int,default displayed,True
freq,1,566,566,35320,1,3


In [None]:
acs_variables.loc[acs_variables['required'] == 'default displayed']

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,predicateOnly
35536,GEOCOMP,GEO_ID Component,,0,,string,,default displayed,


In [None]:
# Look at variables by prediate type - relates to variable type integer, float, string
acs_variables.groupby(['predicateType']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,predicateType,count
0,fips-for,1
1,fips-in,1
2,float,200
3,int,35320
4,string,10
5,ucgid,1


In [None]:
# Look at variables that are not floats and not integers and not missing
pd.set_option('max_colwidth', 60) # concept names are long - can make pandas display different column width
acs_variables[['name','label','predicateType']].loc[(acs_variables['predicateType'].isin(["float","int"])==False) &
                  (acs_variables['predicateType'].notna())]

Unnamed: 0,name,label,predicateType
24408,B25035_001E,Estimate!!Median year structure built,string
24432,B25037_001E,Estimate!!Median year structure built --!!Total:,string
24433,B25037_002E,Estimate!!Median year structure built --!!Owner occupied,string
24434,B25037_003E,Estimate!!Median year structure built --!!Renter occupied,string
24450,B25039_001E,Estimate!!Median year householder moved into unit --!!To...,string
24451,B25039_002E,Estimate!!Median year householder moved into unit --!!To...,string
24452,B25039_003E,Estimate!!Median year householder moved into unit --!!To...,string
35536,GEOCOMP,GEO_ID Component,string
35537,GEO_ID,Geography,string
35550,SUMLEVEL,Summary Level code,string


In [None]:
# Look at the top 25 observations
acs_variables.head(25)

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,predicateOnly
0,AIANHH,Geography,,0,,,,,
1,ANRC,Geography,,0,,,,,
2,B01001A_001E,Estimate!!Total:,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_001EA,B01001A_001M,B01001A_001MA",,
3,B01001A_002E,Estimate!!Total:!!Male:,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_002EA,B01001A_002M,B01001A_002MA",,
4,B01001A_003E,Estimate!!Total:!!Male:!!Under 5 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_003EA,B01001A_003M,B01001A_003MA",,
5,B01001A_004E,Estimate!!Total:!!Male:!!5 to 9 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_004EA,B01001A_004M,B01001A_004MA",,
6,B01001A_005E,Estimate!!Total:!!Male:!!10 to 14 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_005EA,B01001A_005M,B01001A_005MA",,
7,B01001A_006E,Estimate!!Total:!!Male:!!15 to 17 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_006EA,B01001A_006M,B01001A_006MA",,
8,B01001A_007E,Estimate!!Total:!!Male:!!18 and 19 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_007EA,B01001A_007M,B01001A_007MA",,
9,B01001A_008E,Estimate!!Total:!!Male:!!20 to 24 years,B01001A,0,SEX BY AGE (WHITE ALONE),int,"B01001A_008EA,B01001A_008M,B01001A_008MA",,


# Search the labels

In [None]:
# Look at one group of variables 
# Notice search goes back to the full list of ACS variables
acs_variables.loc[acs_variables['group'] == 'B24010']

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,predicateOnly,search var count
16777,B24010_001E,Estimate!!Total:,B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_001EA,B24010_001M,B24010_001MA",,,0
16778,B24010_002E,Estimate!!Total:!!Male:,B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_002EA,B24010_002M,B24010_002MA",,,0
16779,B24010_003E,"Estimate!!Total:!!Male:!!Management, business, science, and arts occupations:",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_003EA,B24010_003M,B24010_003MA",,,0
16780,B24010_004E,"Estimate!!Total:!!Male:!!Management, business, science, and arts occupations:!!Management, business, and financial occupations:",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_004EA,B24010_004M,B24010_004MA",,,0
16781,B24010_005E,"Estimate!!Total:!!Male:!!Management, business, science, and arts occupations:!!Management, business, and financial occupations:!!Management occupations:",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_005EA,B24010_005M,B24010_005MA",,,0
...,...,...,...,...,...,...,...,...,...,...
17075,B24010_299E,"Estimate!!Total:!!Female:!!Production, transportation, and material moving occupations:!!Transportation occupations:!!Motor vehicle operators except bus and truck drivers",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_299EA,B24010_299M,B24010_299MA",,,0
17076,B24010_300E,"Estimate!!Total:!!Female:!!Production, transportation, and material moving occupations:!!Transportation occupations:!!Other transportation workers",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_300EA,B24010_300M,B24010_300MA",,,0
17077,B24010_301E,"Estimate!!Total:!!Female:!!Production, transportation, and material moving occupations:!!Material moving occupations:",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_301EA,B24010_301M,B24010_301MA",,,0
17078,B24010_302E,"Estimate!!Total:!!Female:!!Production, transportation, and material moving occupations:!!Material moving occupations:!!Laborers and material movers, hand",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_302EA,B24010_302M,B24010_302MA",,,0


In [None]:
acs_variables["search var count"] = 0
acs_variables.loc[acs_variables['label'].str.contains('nurse',case=False,na=False),'search var count'] += 1
acs_variables.loc[acs_variables['label'].str.contains('registered',case=False,na=False),'search var count'] += 1
acs_variables["search var count"].describe()

count    35555.000000
mean         0.003291
std          0.063334
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          2.000000
Name: search var count, dtype: float64

In [None]:
# save observations with desired search var count
variables_to_explore = acs_variables.loc[acs_variables['search var count'] == 2]
# drop if GEO_ID is included in list - GEO_ID includes list of all variables
variables_to_explore = variables_to_explore.loc[variables_to_explore['name']!='GEO_ID']
variables_to_explore.head(50)

Unnamed: 0,name,label,group,limit,concept,predicateType,attributes,required,predicateOnly,search var count
16836,B24010_060E,"Estimate!!Total:!!Male:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technical...",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_060EA,B24010_060M,B24010_060MA",,,2
16987,B24010_211E,"Estimate!!Total:!!Female:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technic...",B24010,0,SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24010_211EA,B24010_211M,B24010_211MA",,,2
17248,B24020_060E,"Estimate!!Total:!!Male:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technical...",B24020,0,"SEX BY OCCUPATION FOR THE FULL-TIME, YEAR-ROUND CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER",int,"B24020_060EA,B24020_060M,B24020_060MA",,,2
17399,B24020_211E,"Estimate!!Total:!!Female:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technic...",B24020,0,"SEX BY OCCUPATION FOR THE FULL-TIME, YEAR-ROUND CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER",int,"B24020_211EA,B24020_211M,B24020_211MA",,,2
19035,B24114_219E,Estimate!!Total:!!Registered nurses,B24114,0,DETAILED OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER,int,"B24114_219EA,B24114_219M,B24114_219MA",,,2
19601,B24115_219E,Estimate!!Total:!!Registered nurses,B24115,0,DETAILED OCCUPATION FOR THE CIVILIAN EMPLOYED MALE POPULATION 16 YEARS AND OVER,int,"B24115_219EA,B24115_219M,B24115_219MA",,,2
20167,B24116_219E,Estimate!!Total:!!Registered nurses,B24116,0,DETAILED OCCUPATION FOR THE CIVILIAN EMPLOYED FEMALE POPULATION 16 YEARS AND OVER,int,"B24116_219EA,B24116_219M,B24116_219MA",,,2
20733,B24121_219E,Estimate!!Total:!!Registered nurses,B24121,0,"DETAILED OCCUPATION BY MEDIAN EARNINGS IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) FOR THE FULL-TIME, YEAR-ROUND CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER",int,"B24121_219EA,B24121_219M,B24121_219MA",,,2
21299,B24122_219E,Estimate!!Total:!!Registered nurses,B24122,0,"DETAILED OCCUPATION BY MEDIAN EARNINGS IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) FOR THE FULL-TIME, YEAR-ROUND CIVILIAN EMPLOYED MALE POPULATION 16 YEARS AND OVER",int,"B24122_219EA,B24122_219M,B24122_219MA",,,2
21865,B24123_219E,Estimate!!Total:!!Registered nurses,B24123,0,"DETAILED OCCUPATION BY MEDIAN EARNINGS IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) FOR THE FULL-TIME, YEAR-ROUND CIVILIAN EMPLOYED FEMALE POPULATION 16 YEARS AND OVER",int,"B24123_219EA,B24123_219M,B24123_219MA",,,2


In [None]:
variables_to_explore['name'].describe()

count              13
unique             13
top       B24114_219E
freq                1
Name: name, dtype: object

## Look at all of the groups and concepts for variable
The group and concept provide a description of the variables.

In [None]:
pd.set_option('max_colwidth', 200) # concept names are long - can make pandas display wider columns
variables_to_explore.groupby(['group','label']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,group,label,count
0,B24010,"Estimate!!Total:!!Female:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technic...",1
1,B24010,"Estimate!!Total:!!Male:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technical...",1
2,B24020,"Estimate!!Total:!!Female:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technic...",1
3,B24020,"Estimate!!Total:!!Male:!!Management, business, science, and arts occupations:!!Healthcare practitioners and technical occupations:!!Health diagnosing and treating practitioners and other technical...",1
4,B24114,Estimate!!Total:!!Registered nurses,1
5,B24115,Estimate!!Total:!!Registered nurses,1
6,B24116,Estimate!!Total:!!Registered nurses,1
7,B24121,Estimate!!Total:!!Registered nurses,1
8,B24122,Estimate!!Total:!!Registered nurses,1
9,B24123,Estimate!!Total:!!Registered nurses,1


In [None]:
# Save results to a csv file
variables_to_explore.to_csv("variables_to_explore.csv")