In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Source

United States Census Bureau. *Household Pulse Survey: Measuring Emergent Social and Economic Matters Facing U.S. Households*. https://www.census.gov/programs-surveys/household-pulse-survey.html

Fields JF, Hunter-Childs J, Tersine A, Sisson J, Parker E, Velkoff V, Logan C, and Shin H (2020). *Design and Operation of the 2020 Household Pulse Survey*. [https://www2.census.gov/programs-surveys/demo/technical-documentation/hhp/2020_HPS_Background.pdf]. U.S. Census Bureau.

The first citation is for the main Household Pulse Survey website. We are working with public use files at https://www.census.gov/programs-surveys/household-pulse-survey/data/datasets.html. The second citation is for a reference paper for understanding the larger context around the data we are working with.

I originally found an interesting COVID-19-related mental health dataset at https://catalog.data.gov/dataset/mental-health-care-in-the-last-4-weeks. That page points to a related Centers for Disease Control and Prevention (CDC) page at https://www.cdc.gov/nchs/covid19/pulse/mental-health-care.htm. Both of these are aggregated, but we will want disaggregated data for unsupervised learning. A link at the bottom of the CDC pages gets us back to the Census page and the original source(s), including disaggregated data.

## Survey and Survey Data Summary

I started this project by looking at candidate datasets. I came across the mental health dataset at https://data.gov, I figured that data would be relatively straightforward to work with. Little did I know how just much complexity goes into surveys, survey data, and Census data. I will summarize what I learned about how the specific survey we look at in this project as well as how surveys similar to the one in this project work.

Note: much of my summary in this section comes from the main Household Pulse Survey link cited above as well as this sub-page: https://www.census.gov/data/experimental-data-products/household-pulse-survey.html

The United Stated Census Bureau is a primary hub for data about the American citizenry and economy. Before digging into this project, I was aware of the United States Census that happens every ten years. Beyond the official United States Census, I actually did not know what else the Census Bureau gathered, though I was aware that it was involved in other data work.

The Census Bureau started the Household Pulse Survey (HPS) in 2020 in partnership with other federal agencies and to measure the impact of the COVID-19 pandemic. The HPS gathers data about American households in a much faster timeframe than other surveys available at the time. The HPS originally surveyed respondents weekly. This then changed to bi-weekly and eventually monthly. The HPS gathers and releases data in phases that typically run for around nine weeks with multiple collections during phases. From my research, this is a very fast turnaround time for surveys and survey data. One result of that is that representation among survey respondents becomes more difficult. We will speak about that more in the weighting section below. But, in general, it looks like the HPS has a gathering and reporting cycle that is pretty fast compared to many other common surveys that I looked into.

The HPS allows analysis at the national level, state level, and scoped to the 15 largest metropolitan statistical areas (MSA). You can find more information about MSAs at https://www.census.gov/programs-surveys/metro-micro.html, but my colloquial summary for them is areas of high population density and certain characteristics that make them population and economic hubs in their respective regions.

The HPS is a 20-minute survey that asks about demographic characteristics as well as a number of relevant topics such as:
- Childcare arrangements
- Food sufficiency
- Housing security
- Household spending
- Physical and mental health
- Health insurance coverage
- Social isolation
- (You can see a more complete list of topics at https://www.census.gov/data/experimental-data-products/household-pulse-survey.html in the section titled **What information does the Household Pulse Survey collect?**

The scope of what the HPS is looking to study is robust, and the topics are relevant to areas where people were struggling during the COVID-19 pandemic and are continuing to struggle now.

This project will focus on earlier surveys when the HPS started in 2020. The HPS continues today and has expanded in scope over time.

We will talk more about this below, but the general approach in this project will be to explore clustering, anomaly/outlier detection, and feature reduction. Each of these prefer or require disaggregated data. This means we will work with the HPS Public Use Files (PUF) instead of data tables. PUFs have respondent-level data. The Census Bureau provides data tables with re-weighting and aggregation already done so that researchers can focus on trends and findings instead of working on the disaggregated data, so those are available if of interest.

This is a quick summary of my current understanding of surveys, Census Bureau data, and the HPS. It took a lot of research to get to this level of understanding. I call this out because there is one more aspect of the HPS data that has been even more difficult to make sense of and to determine how to incorporate into this project: weighting.

## Survey Weights

- Lots of work needed to understand specific weightings
- Person vs household vs replicate weights
- Examples of when to use weights vs when to not
- Talk about how adjusting with weights might actually reduce performance for this project
- This is one of the main areas to work on further or get help from domain experts for future iterations
- While unsure here, we can still move forward in a way that defaults to leaving things as they are (unweighted) until sure that you need to adjust based on weights
- Because of this complexity, we will want to avoid central tendency imputation or other imputation methods since we do not know if it is accurate to use those across the dataset as a whole, and we are not sure how to subset yet because that is part of what we are determining with this project

## Data Summary

- Which phase(s) are we using and why?
- Are we looking to limit the feature space?
- What type of preprocessing do we expect to do
- Talk about what data comes with the PUFs
- PUFs vs data tables

## Project

- Why unsupervised is useful here
- What unsupervised allows us to do that supervised does not
- Emphasize the speed and messiness of this data coming in compared to other surveys
- Talk about how we will account for representation issues through possibly anomaly detection or another technical approach
- Can use unsupervised output to then return to demographic data to try and provide more useful info
- DBSCAN or other density-based clustering comes to mind as particularly useful because it may be able to set aside some version of outlier collections as clusters that are underrepresented in the data (hypothesis to test)
- Similar for hierarchical clustering with intelligent cutoffs
- Is there a use case for matrix factorization
- How much are we relatively worried about outliers vs missing values etc
- Is there a use case for winsorizing
- Approach EDA as hypothesis exploratio as part of it, not just poking around with common steps
- Maybe try isolation forests as a new algorithm from outside of class

## References

Household Pulse Survey technical documentation: 
- https://www.census.gov/programs-surveys/household-pulse-survey/technical-documentation.html

Household Pulse Survey page with example usages:
https://www.census.gov/programs-surveys/household-pulse-survey/library/working-papers.html

Info about survey weights:
- https://www.pewresearch.org/methods/2018/01/26/how-different-weighting-methods-work/
- https://pages.nyu.edu/jackson/design.of.social.research/Readings/Johnson%20-%20Introduction%20to%20survey%20weights%20%28PRI%20version%29.pdf
- https://analythical.com/blog/weighting-data-explained

Replicate weights for the Census Current Population Survey (CPS):
- https://cps.ipums.org/cps/repwt.shtml
- 
R example of working with CPS data and replicate weights:
- https://www.adambibler.com/post/exploring-census-household-pulse-survey-part-1/

Articles about using CPS data and weights:
- https://www.jchs.harvard.edu/blog/using-the-census-bureaus-household-pulse-survey-to-assess-the-economic-impacts-of-covid-19-on-americas-households
- https://www.bgsu.edu/content/dam/BGSU/college-of-arts-and-sciences/center-for-family-and-demographic-research/documents/Workshops/2020-Household-Pulse-Survey.pdf

Put in links to where to find supplementary PDFs for HPS and Census data
- https://www2.census.gov/programs-surveys/demo/technical-documentation/hhp/2020_HPS_Background.pdf
- https://www2.census.gov/programs-surveys/demo/technical-documentation/hhp/2020_HPS_NR_Bias_Report-final.pdf
- https://www.census.gov/content/dam/Census/library/publications/2010/acs/Chapter_11_RevisedDec2010.pdf
- https://www.census.gov/content/dam/Census/programs-surveys/ahs/tech-documentation/2015/Quick_Guide_to_Estimating_Variance_Using_Replicate_Weights_2009_to_Current.pdf
- https://www2.census.gov/programs-surveys/cps/datasets/2018/supp/PERSON-level_Use_of_the_Public_Use_Replicate_Weight_File.doc
- https://www2.census.gov/programs-surveys/cps/datasets/2021/march/Guidance_on_Using_Replicate_Weights_2020-2021.pdf

Video presentations that include chunks about the HPS or other Census data: 
- https://www.youtube.com/watch?v=ltyT34S3C90
- https://www.youtube.com/watch?v=lLk6esuBI6M
- https://www.youtube.com/watch?v=aJDQmsmCv7A
- https://www.youtube.com/watch?v=zfXohiWjzzY
- https://www.youtube.com/watch?v=5Re4D1Ht74k