In [2]:
import requests
import zipfile
import io

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Source

United States Census Bureau. *Household Pulse Survey: Measuring Emergent Social and Economic Matters Facing U.S. Households*. https://www.census.gov/programs-surveys/household-pulse-survey.html

Fields JF, Hunter-Childs J, Tersine A, Sisson J, Parker E, Velkoff V, Logan C, and Shin H (2020). *Design and Operation of the 2020 Household Pulse Survey*. [https://www2.census.gov/programs-surveys/demo/technical-documentation/hhp/2020_HPS_Background.pdf]. U.S. Census Bureau.

The first citation is for the main Household Pulse Survey website. We are working with public use files at https://www.census.gov/programs-surveys/household-pulse-survey/data/datasets.html. The second citation is for a reference paper for understanding the larger context around the data we are working with.

I originally found an interesting COVID-19-related mental health dataset at https://catalog.data.gov/dataset/mental-health-care-in-the-last-4-weeks. That page points to a related Centers for Disease Control and Prevention (CDC) page at https://www.cdc.gov/nchs/covid19/pulse/mental-health-care.htm. Both of these are aggregated, but we will want disaggregated data for unsupervised learning. A link at the bottom of the CDC pages gets us back to the Census page and the original source(s), including disaggregated data.

## Survey and Survey Data Summary

I started this project by looking at candidate datasets. I came across the mental health dataset at https://data.gov, I figured that data would be relatively straightforward to work with. Little did I know how just much complexity goes into surveys, survey data, and Census data. I will summarize what I learned about how the specific survey we look at in this project as well as how surveys similar to the one in this project work.

Note: much of my summary in this section comes from the main Household Pulse Survey link cited above as well as this sub-page: https://www.census.gov/data/experimental-data-products/household-pulse-survey.html

The United Stated Census Bureau is a primary hub for data about the American citizenry and economy. Before digging into this project, I was aware of the United States Census that happens every ten years. Beyond the official United States Census, I actually did not know what else the Census Bureau gathered, though I was aware that it was involved in other data work.

The Census Bureau started the Household Pulse Survey (HPS) in 2020 in partnership with other federal agencies and to measure the impact of the COVID-19 pandemic. The HPS gathers data about American households in a much faster timeframe than other surveys available at the time. The HPS originally surveyed respondents weekly. This then changed to bi-weekly and eventually monthly. Each of these collections is a cycle, and cycles get bundled into phases that typically run for around nine weeks with multiple collections during phases. From my research, this is a very fast turnaround time for surveys and survey data. One result of that is that representation among survey respondents becomes more difficult. We will speak about that more in the weighting section below. But, in general, it looks like the HPS has a gathering and reporting cycle that is pretty fast compared to many other common surveys that I looked into.

The HPS allows analysis at the national level, state level, and scoped to the 15 largest metropolitan statistical areas (MSA). You can find more information about MSAs at https://www.census.gov/programs-surveys/metro-micro.html, but my colloquial summary for them is areas of high population density and certain characteristics that make them population and economic hubs in their respective regions.

The HPS is a 20-minute survey that asks about demographic characteristics as well as a number of relevant topics such as:
- Childcare arrangements
- Food sufficiency
- Housing security
- Household spending
- Physical and mental health
- Health insurance coverage
- Social isolation
- (You can see a more complete list of topics at https://www.census.gov/data/experimental-data-products/household-pulse-survey.html in the section titled **What information does the Household Pulse Survey collect?**

The scope of what the HPS is looking to study is robust, and the topics are relevant to areas where people were struggling during the COVID-19 pandemic and are continuing to struggle now.

This project will focus on earlier surveys when the HPS started in 2020. The HPS continues today and has expanded in scope over time.

We will talk more about this below, but the general approach in this project will be to explore clustering, anomaly/outlier detection, and feature reduction. Each of these prefer or require disaggregated data. This means we will work with the HPS Public Use Files (PUF) instead of data tables. PUFs have respondent-level data. The Census Bureau provides data tables with re-weighting and aggregation already done so that researchers can focus on trends and findings instead of working on the disaggregated data, so those are available if of interest.

This is a quick summary of my current understanding of surveys, Census Bureau data, and the HPS. It took a lot of research to get to this level of understanding. I call this out because there is one more aspect of the HPS data that has been even more difficult to make sense of and to determine how to incorporate into this project: weighting.

## Survey Weights

Hold on to your hats. This section is going all over the place.

Determining what to do with the weights that come with the data has been the most difficult part of understanding the HPS data. Maybe the weightings are obvious to others, but I have spent more time trying to figure out what to do with these than any other research on the HPS and its data. It has been hard to find solid reference documentation on the right ways to use HPS weightings, and what I have found that is more concrete is for other Census Bureau data.

Okay. With that out of the way, here is my current understanding.

The three types of weights for the HPS are:
- Household weights: adjusts for household-level response representation that does not match known or expected representation for respondents based on the American Community Survey (ACS)
- Person weights: multiplies the household weight by the number of adults in the household to get per-person response values
- Replicate weights: provides 160 weights for use in estimating point estimates and associated variances and standard errors

There are references at the bottom of the notebook that describe Census Bureau weights. I find https://www2.census.gov/programs-surveys/demo/technical-documentation/hhp/2020_HPS_NR_Bias_Report-final.pdf to be the most useful, section 1.2 in particular. To get a sense of the complexity of these weights, see https://www.census.gov/content/dam/Census/library/publications/2010/acs/Chapter_11_RevisedDec2010.pdf. A number of the references I found are not for the HPS directly but help to understand how the Census Bureau uses weights for other surveys. Often that was the best I could find due to limited HPS-specific reference data for weighting.

The HPS uses demographic information from other Census Bureau sources to help adjust the data from the HPS, meaning we know from other sources what the demographic breakdown should look like, and we can adjust the HPS data so that it aligns with the representation that we see in other more reliable and established survey data. So, if we know that X% of the population in a response area are a certain race, we can adjust the HPS data to match that percentage even though we did not get that percentage based on actual responses. The reference papers mention that this process will still lead to inaccuracies, but it is preferable when trying to extrapolate to larger population-level findings. Remember also that the HPS was deployed to get faster feedback on American during the COVID-19 pandemic, so the trade-offs were acceptable.

Another confusing aspect is that some HPS data has individual household and person weights as single columns in the primary PUF while others only have person weights. Some have replicates for household and person weights separated out into 80 replicates each in different files, while some have them together -- I do not actually think having them in the same or separate files on its own is confusing, but, with the variety of presentation of weights, it adds one more piece to have to interpret. Some university resources I found have researchers dividing by weights while Census Bureau docs hint at multiplying by weights, the latter being what I would expect.

After reading through the supplementary resources, I am still not sure if the Census Bureau has one standard way to use weightings across all survey data or if there are nuances for each survey's use of weights. My sense is that the process for adjusting to individual-scale population data is the same across surveys, but the household piece does not show up as often. I wish the Census Bureau had released something concrete and straightforward for those of us not versed in using their weights.

Looking in more detail at the data, each row in the primary PUF is one household response. One adult per household responds on behalf of that household. We can use the household weights to adjust for representation purposes. When looking at population-level statistics, we need to adjust the household responses to represent individual people. We do this by multiplying the person weight by the number of adults in the household. Any adults who are not the respondent assume the respondent's demographic and other characteristics. This is known not to be fully accurate, but it is a best effort in trying to turn the data around quickly.

One item to bring up now is the distinction between clustering, dimensionality reduction, anomaly detection, and other unsupervised learning use cases compared to finding points estimates and associated variances, standard errors, and other measures for capturing uncertainty in those estimates. When working on point estimates, we use the replicate weights to simulate a much wider range of respondents and narrow in on more accurate calculations of uncertainty -- wider confidence intervals, as one example. This project will not focus on point estimates, so we can leave the replicate weights out.

Here is where I start to struggle with the best way forward for weights. The decision centers on if incorporating household or person weights will help with unsupervised learning. When we use weights, we adjust the values in each row, but we do not change the number of rows. When working with clustering, for example, adjusted feature values may change which cluster a row gets put in, but it does not increase the number of rows for underrepresented groups, so I am dubious about it revealing more accurate clusters, and I worry that it may actually make the situation worse by making it look like dissimilar household are actually part of the same cluster. If we were to take a strategy of replicating rows based on representation, then I could see new clusters starting to pop up around more groups. I will talk shortly about why I do not think we should pursue row resampling for the current project either.

One point to highlight from the last paragraph is that clustering, as one example, works against individual detailed rows, so it is doing something fundamentally different than point estimates. The point estimate tells us a summary piece of information about the population we are looking at. Clustering does provide us information in terms of optimal numbers of clusters, but I see it more as a step to understand groups in the data that we will later use as parts of larger analyses. One way we may use the trained clustering model is to pass in new HPS data to determine which group each row belongs to, possibly using clusters as a base unit for responses such as increased pandemic support.

Adjusting row values will change distances between rows for clustering, but it is not clear to me that that helps the situation for unsupervised learning. One worry is that by trying to incorporate weights when it is unclear if we should, we introduce a new problem on top of the original issue of underrepresentation. The burden should be on having a strong justification for adjusting with weights rather than the opposite. We know anyone working with the HPS data will need to determine how to understand the weights. We do not want to add an extra layer of confusion of adjusting with weights in arbitrary or incorrect ways that do not make a ton of sense and force others to have to untangle the reasoning there. 

Also, I still am not very clear on the correct way to apply the weights outside of some cases os using them for point estimation. It may be that I am leaning away from using the weights because I do not yet understand them enough, but, acknowledging that there is still confusion for me around them, I feel that the responsible decision is to leave the weights out for now. Future iterations of the project can incorporate them if and once ready.

If we leave the data at its original scale, one hypothesis I have is that we would be able to see underrepresented groups as outliers from clusters around overrepresented groups. Change feature values may actually end up muddying the situation enough that it becomes harder to see the distinction between the groups in the base data, or at the least not improve it.

If we do not adjust the data with weights, then it means we will cluster or look for anomalies at the household scale, unadjusted for representation. I am comfortable with this.

In summary, and in being honest, the proper way to incorporate weights still feels murky to me. This is a key area to get domain expertise help from someone who is fluent with working with Census Bureau weighting. But, for now, this is my best understanding of the weights and my reasoning for not incorporating them in the current project. So, we will move forward without adjusting based on household or person weights, and we already talked about why we will not use replicate weights. But this is one of the main areas to work on further or get help from others in future iterations.

## Project

- Why unsupervised is useful here
- What unsupervised allows us to do that supervised does not
- Emphasize the speed and messiness of this data coming in compared to other surveys
- Talk about how we will account for representation issues through possibly anomaly detection or another technical approach
- Can use unsupervised output to then return to demographic data to try and provide more useful info
- DBSCAN or other density-based clustering comes to mind as particularly useful because it may be able to set aside some version of outlier collections as clusters that are underrepresented in the data (hypothesis to test)
- Similar for hierarchical clustering with intelligent cutoffs
- Is there a use case for matrix factorization
- How much are we relatively worried about outliers vs missing values etc
- Is there a use case for winsorizing
- Approach EDA as hypothesis exploratio as part of it, not just poking around with common steps
- Maybe try isolation forests as a new algorithm from outside of class

## Let's Get Some Data

We are going to focus on data from phase on of the HPS, running from the April 23 - May 5 cycle through the July 16 - July 21 cycle in 2020. The Census Bureau releases the data for each cycle as a compressed zip archive containing three files:
1. The PUF CSV data file
2. Replicate weights file
3. Data dictionary

We need to pull each of these zips, extract them, and combine the 12 PUFs. We do a little bit of column checking and dropping to make sure that the files line up correctly.

First is a code block that downloads the zip archives from the HPS website for when you need to grab the original archives.

In [11]:
# Sample URL
# 'https://www2.census.gov/programs-surveys/demo/datasets/hhp/2020/wk1/HPS_Week01_PUF_CSV.zip'

# Uncomment if you want to pull the source files from the Census Bureau website
# for i in range(1, 13):
#     r = requests.get(f'https://www2.census.gov/programs-surveys/demo/datasets/hhp/2020/wk{i}/HPS_Week{'0'+str(i) if i<10 else i}_PUF_CSV.zip')
#     if r.ok:
#         with zipfile.ZipFile(io.BytesIO(r.content), 'r') as zipz:
#             zipz.extractall('./data/')

In [4]:
df_01 = pd.read_csv('data/pulse2020_puf_01.csv')
df_02 = pd.read_csv('data/pulse2020_puf_02.csv')
df_03 = pd.read_csv('data/pulse2020_puf_03.csv')
df_04 = pd.read_csv('data/pulse2020_puf_04.csv')
df_05 = pd.read_csv('data/pulse2020_puf_05.csv')
df_06 = pd.read_csv('data/pulse2020_puf_06.csv')
df_07 = pd.read_csv('data/pulse2020_puf_07.csv')
df_08 = pd.read_csv('data/pulse2020_puf_08.csv')
df_09 = pd.read_csv('data/pulse2020_puf_09.csv')
df_10 = pd.read_csv('data/pulse2020_puf_10.csv')
df_11 = pd.read_csv('data/pulse2020_puf_11.csv')
df_12 = pd.read_csv('data/pulse2020_puf_12.csv')

Next we check out the columns. The first block shows that we have different numbers of columns across the 12 datasets. We also want to make sure that files with the same number of columns have the same column names. The second block tells us where we have different columns that we need to address. The third block combines all 12 datasets, dropping columns not present across all 12 as needed.

In [15]:
print(f'01: {len(df_01.columns)}')
print(f'02: {len(df_02.columns)}')
print(f'03: {len(df_03.columns)}')
print(f'04: {len(df_04.columns)}')
print(f'05: {len(df_05.columns)}')
print(f'06: {len(df_06.columns)}')
print(f'07: {len(df_07.columns)}')
print(f'08: {len(df_08.columns)}')
print(f'09: {len(df_09.columns)}')
print(f'10: {len(df_10.columns)}')
print(f'11: {len(df_11.columns)}')
print(f'12: {len(df_12.columns)}')

01: 82
02: 82
03: 82
04: 82
05: 82
06: 84
07: 105
08: 105
09: 105
10: 105
11: 105
12: 105


In [70]:
# Takes in two lists of columns and returns the columns in each set that are
# not present in the other, returning a null set for both when they have
# the same columns
def what_differences(cols1, cols2):

    # We need to check both directions separately since we are using difference()
    diff1 = set(cols1).difference(set(cols2))
    diff2 = set(cols2).difference(set(cols1))
    
    return diff1, diff2
    
print(f'01 02: {what_differences(df_01.columns, df_02.columns)}')
print(f'01 03: {what_differences(df_01.columns, df_03.columns)}')
print(f'01 04: {what_differences(df_01.columns, df_04.columns)}')
print(f'01 05: {what_differences(df_01.columns, df_05.columns)}')
print(f'01 06: {what_differences(df_01.columns, df_06.columns)}')
print(f'06 07: {what_differences(df_06.columns, df_07.columns)}')
print(f'07 08: {what_differences(df_07.columns, df_08.columns)}')
print(f'07 09: {what_differences(df_07.columns, df_09.columns)}')
print(f'07 10: {what_differences(df_07.columns, df_10.columns)}')
print(f'07 11: {what_differences(df_07.columns, df_11.columns)}')
print(f'07 12: {what_differences(df_07.columns, df_12.columns)}')

01 02: (set(), set())
01 03: (set(), set())
01 04: (set(), set())
01 05: (set(), set())
01 06: (set(), {'TSTDY_HRS', 'CHILDFOOD'})
06 07: (set(), {'EIPSPND2', 'EIPSPND6', 'SPNDSRC1', 'EIPSPND4', 'EIPSPND11', 'SPNDSRC2', 'EIPSPND10', 'SPNDSRC3', 'EIPSPND8', 'SPNDSRC6', 'EIPSPND13', 'EIPSPND12', 'EIP', 'SPNDSRC4', 'EIPSPND9', 'EIPSPND1', 'SPNDSRC5', 'EIPSPND5', 'EIPSPND3', 'EIPSPND7', 'SPNDSRC7'})
07 08: (set(), set())
07 09: (set(), set())
07 10: (set(), set())
07 11: (set(), set())
07 12: (set(), set())


In [72]:
df_combined = pd.concat([
    df_01
    , df_02
    , df_03
    , df_04
    , df_05
    , df_06.drop(columns=['TSTDY_HRS', 'CHILDFOOD'])
    , df_07.drop(columns=['TSTDY_HRS','CHILDFOOD','EIPSPND2','EIPSPND6','SPNDSRC1','EIPSPND4'
        ,'EIPSPND11','SPNDSRC2','EIPSPND10','SPNDSRC3','EIPSPND8','SPNDSRC6','EIPSPND13','EIPSPND12'
        ,'EIP','SPNDSRC4','EIPSPND9','EIPSPND1','SPNDSRC5','EIPSPND5','EIPSPND3','EIPSPND7','SPNDSRC7'
    ]), df_08.drop(columns=['TSTDY_HRS','CHILDFOOD','EIPSPND2','EIPSPND6','SPNDSRC1','EIPSPND4'
        ,'EIPSPND11','SPNDSRC2','EIPSPND10','SPNDSRC3','EIPSPND8','SPNDSRC6','EIPSPND13','EIPSPND12'
        ,'EIP','SPNDSRC4','EIPSPND9','EIPSPND1','SPNDSRC5','EIPSPND5','EIPSPND3','EIPSPND7','SPNDSRC7'
    ]), df_09.drop(columns=['TSTDY_HRS','CHILDFOOD','EIPSPND2','EIPSPND6','SPNDSRC1','EIPSPND4'
        ,'EIPSPND11','SPNDSRC2','EIPSPND10','SPNDSRC3','EIPSPND8','SPNDSRC6','EIPSPND13','EIPSPND12'
        ,'EIP','SPNDSRC4','EIPSPND9','EIPSPND1','SPNDSRC5','EIPSPND5','EIPSPND3','EIPSPND7','SPNDSRC7'
    ]), df_10.drop(columns=['TSTDY_HRS','CHILDFOOD','EIPSPND2','EIPSPND6','SPNDSRC1','EIPSPND4'
        ,'EIPSPND11','SPNDSRC2','EIPSPND10','SPNDSRC3','EIPSPND8','SPNDSRC6','EIPSPND13','EIPSPND12'
        ,'EIP','SPNDSRC4','EIPSPND9','EIPSPND1','SPNDSRC5','EIPSPND5','EIPSPND3','EIPSPND7','SPNDSRC7'
    ]), df_11.drop(columns=['TSTDY_HRS','CHILDFOOD','EIPSPND2','EIPSPND6','SPNDSRC1','EIPSPND4'
        ,'EIPSPND11','SPNDSRC2','EIPSPND10','SPNDSRC3','EIPSPND8','SPNDSRC6','EIPSPND13','EIPSPND12'
        ,'EIP','SPNDSRC4','EIPSPND9','EIPSPND1','SPNDSRC5','EIPSPND5','EIPSPND3','EIPSPND7','SPNDSRC7'
    ]), df_12.drop(columns=['TSTDY_HRS','CHILDFOOD','EIPSPND2','EIPSPND6','SPNDSRC1','EIPSPND4'
        ,'EIPSPND11','SPNDSRC2','EIPSPND10','SPNDSRC3','EIPSPND8','SPNDSRC6','EIPSPND13','EIPSPND12'
        ,'EIP','SPNDSRC4','EIPSPND9','EIPSPND1','SPNDSRC5','EIPSPND5','EIPSPND3','EIPSPND7','SPNDSRC7'
    ])
])

print(what_differences(df_combined.columns, df_01))
print(df_combined.shape)

(set(), set())
(1088314, 82)


We now have one unified dataset with 1,088,314 rows and 82 columns. We will now pivot to the dictionary for the first cycle -- the first cycle because we removed any columns not in the first cycle -- to determine what data we have available and what format it is in.

## Feature Summary

- Which phase(s) are we using and why?
- Are we looking to limit the feature space?
- What type of preprocessing do we expect to do
- Talk about what data comes with the PUFs
- PUFs vs data tables
- Mostly categorical variables that have already been pivoted it looks like
- Maybe use PCA first and then feed into other unsupervised algorithms to test for high correlation in responses
- Because of this complexity, we will want to avoid central tendency imputation or other imputation methods since we do not know if it is accurate to use those across the dataset as a whole, and we are not sure how to subset yet because that is part of what we are determining with this project

## References

Household Pulse Survey technical documentation: 
- https://www.census.gov/programs-surveys/household-pulse-survey/technical-documentation.html

Household Pulse Survey page with example usages:
- https://www.census.gov/programs-surveys/household-pulse-survey/library/working-papers.html

Info about survey weights:
- https://www.pewresearch.org/methods/2018/01/26/how-different-weighting-methods-work/
- https://pages.nyu.edu/jackson/design.of.social.research/Readings/Johnson%20-%20Introduction%20to%20survey%20weights%20%28PRI%20version%29.pdf
- https://analythical.com/blog/weighting-data-explained

Replicate weights for the Census Current Population Survey (CPS):
- https://cps.ipums.org/cps/repwt.shtml

Decision tree for using Census Bureau Survey of Income and Program Participation (SIPP) weights:
- https://www2.census.gov/programs-surveys/sipp/Select_approp_wgt_2014SIPPpanel.pdf
 
R example of working with CPS data and replicate weights:
- https://www.adambibler.com/post/exploring-census-household-pulse-survey-part-1/

Articles about using CPS data and weights:
- https://www.jchs.harvard.edu/blog/using-the-census-bureaus-household-pulse-survey-to-assess-the-economic-impacts-of-covid-19-on-americas-households
- https://www.bgsu.edu/content/dam/BGSU/college-of-arts-and-sciences/center-for-family-and-demographic-research/documents/Workshops/2020-Household-Pulse-Survey.pdf

Put in links to where to find supplementary PDFs for HPS and Census data
- https://www2.census.gov/programs-surveys/demo/technical-documentation/hhp/2020_HPS_Background.pdf
- https://www2.census.gov/programs-surveys/demo/technical-documentation/hhp/2020_HPS_NR_Bias_Report-final.pdf
- https://www.census.gov/content/dam/Census/library/publications/2010/acs/Chapter_11_RevisedDec2010.pdf
- https://www.census.gov/content/dam/Census/programs-surveys/ahs/tech-documentation/2015/Quick_Guide_to_Estimating_Variance_Using_Replicate_Weights_2009_to_Current.pdf
- https://www2.census.gov/programs-surveys/cps/datasets/2018/supp/PERSON-level_Use_of_the_Public_Use_Replicate_Weight_File.doc
- https://www2.census.gov/programs-surveys/cps/datasets/2021/march/Guidance_on_Using_Replicate_Weights_2020-2021.pdf

Video presentations that include chunks about the HPS or other Census data: 
- https://www.youtube.com/watch?v=ltyT34S3C90
- https://www.youtube.com/watch?v=lLk6esuBI6M
- https://www.youtube.com/watch?v=aJDQmsmCv7A
- https://www.youtube.com/watch?v=zfXohiWjzzY
- https://www.youtube.com/watch?v=5Re4D1Ht74k