# DSCI 521: Data Analysis and Interpretation

# Term Project

## Team Members

- Team member 1
    - Name: Andrew Appleton
    - Email: aa4498@drexel.edu    
    - Background and skills: 10 years experience in Health Sciences, Limited Python experience
- Team member 2
    - Name:  Layla Bouzoubaa
    - Email: lb3338@drexel.edu
    - Background and skills: MS in Public Health with a concentration in Biostatistics. Advanced R user, novel Pythonista, strong familiarity with public health data science pipeline and research.
- Team member 3
    - Name:  Alicia Brandemarte
    - Email: amb847@drexel.edu
    - Background and skills: Healthcare Informatics, R, SQL, Business Intelligence.  Limited Python skills, seeking to expand Python Skills.
- Team member 4
    - Name: Stephan Dupoux
    - Email: sgd45@drexel.edu    
    - Background and skills: 5 years of building machine learning , 5 years of Python programming such as building ETL pipelines and other web based applications. 
- Team member 5
    - Name: Mary Lucas
    - Email: mml367@drexel.edu
    - Background and skills: Extensive healthcare clinical experience, some experience in SQL, Python, and R programming and machine learning. 
- Team member 6
    - Name: Shane Nelson
    - Email: sn888@drexel.edu
    - Background and skills: 2 years experience working at a clinical research organization. Somewhat limited Python experience however continuously seeking to improve skills.


# Introduction

The COVID-19 pandemic is an ongoing global crisis that has had a major impact on individuals and communities over the past two years.  It has resulted in significant disruptions to our way of life and the services and resources we depend on on a day to day basis.  

The pandemic and its effects in terms of loss to life and livelihood have also made it more evident that there are disparities in access to care and resources across the globe based on geography, race, and other factors.
  
All code can be found on our [Github repository](https://github.com/labouz/covid_vulnerability/)  The data is available for download from the links in the Data Sources section.

### Motivating Questions to Address
The main questions that we hoped to address in this work were: 

  * the impact COVID-19 has had on hospital capacity in different regions of the country over time
  * the community burden of the COVID-19 pandemic
  * health disparities in the COVID-19 impact across different regions of the country



### Design of the Study

This study is a descriptive and analytic design of time series data to understand and predict hospitalization burden due to the COVID19 pandemic in different states.  We implemented time series forecasting methods and modelling of hospital burden across the country due to the COVID-19 pandemic. 

Data used was accessed from open data repositories of the federal government at healthdata.gov and data.gov, and census.gov, as well as supplemental sources such as the Kaiser Family Foundation. Analysis was conducted using standard time series focusing and regression and classification methods.

No costs were incurred and no human subjects were involved in the study.

## Data

The main data we used was obtained from The US Department of Health and Human Services (HHS) as healthdata.gov and is publically available. The datasets are all non-identifiable and contain no protected health information. 

- [COVID-19 Reported Patient Impact and Hospital Capacity by State Timeseries](https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh)
- [COVID-19 Community Vulnerability Crosswalk - Rank Ordered by Score](https://catalog.data.gov/dataset/covid-19-community-vulnerability-crosswalk-rank-ordered-by-score)

Supplementary data on population metrics, socioeconomic status, and hospital counts in different states was obtained from other sources and will be referenced in the relevant sections of this report.


### Structure
The hospital impact and capacity timeseries data used is provided in various formats including CSV, TSV, GEOJSON, XML, and more.  It is updated on a daily basis and can be downloaded via the Socrata API or by direct export from the healthdata.gov link provided. 

As of the last download of the final dataset for this project the file contained 39,701 rows and 117 columns with each row representing a daily state aggregated report. Because of API limitations we opted to download a CSV file of the data for our work.

## Preamble 

To make the individual files more manageable we have divided our across two notebooks. This current notebook contains the written report and embedded images of visualizations generated from our analysis and modelling.  

The code that is needed to reproduce our work is contained in a separate notebook project_code.ipynb.  


In [7]:
# code to allow for visualizations
import warnings
warnings.filterwarnings('ignore')
import IPython
from IPython.display import Image

#### Exploratory Data Analysis


In [5]:
url = 'https://marymlucas.github.io/adult-icu_utilization/'
iframe = '<iframe src=' + url + ' width=100% height=600></iframe>'
IPython.display.HTML(iframe)

### Methods


### Data Preprocessing
We undertook various steps to prepare our data. These are outlined below.

#### Feature Selection
After an initial data exploration phase, we selected 35 variables out of the 117 to retain for our analysis. This selection was made based on examination of variable collinearity and domain knowledge. We noticed that many of the highly correlated variables were also derived from each other. For example there were features defined as "numerator" and "denominator" which were not relevant to our analysis. Our goal was to analyze impact of hospital capacity and state-wide burden, therefore many features examining covid diagnosis (suspected and confirmed) were dropped because they were not critical to our analysis. In addition, many features were focused on pediatric measures that were not applicable to our analysis, considering our ultimate target variable for icu bed utilization. 

#### Determining Under-Reporting
To make our analysis meaningful, we decided to remove states that had high levels of underreporting.  For each metric in each row (day of data reported in the state), there is a corresponding "coverage" variable (denoted as *metric_coverage*). This coverage feature reports how many hospitals in the state submitted information about that metric on the day. For example, in the case of our main feaure of interest (adult_icu_bed_utilization), the *adult_icu_bed_utilization_coverage* feature gives an indication of how many hospitals in the state reported their adult icu bed utilization that day. 

By comparing the coverage with the number of hospitals in each state we were able to derive a simple measure of the level of reporting for each state.

The most current list of hospital numbers was obtained from the Kaiser Family Foundation website (https://www.kff.org/other/state-indicator/total-hospitals) which reports data from the American Hospital Association (1999 - 2019) Annual Survey. 

It's important to note the disclaimer that "Data are for community hospitals, which represent 85% of all hospitals" where community hospitals are defined as "all nonfederal, short-term general, and specialty hospitals whose facilities and services are available to the public."  Because of this, we expected that for states with high reporting we may have instances of greater than 100% reporting, which we decided was acceptable for our analysis.

In selecting a threshold, we considered the guidelines by the HHS for hospitals on how to report their numbers - https://www.hhs.gov/sites/default/files/covid-19-faqs-hospitals-hospital-laboratory-acute-care-facility-data-reporting.pdf. While this is a recent update, we noted that the changes from previous requirements did not affect the requirement to report or the frequency of reporting, only what fields are reported (https://healthdata.gov/stories/s/COVID-19-Reporting-and-FAQS/kjst-g9cm/). Based on this and on exploration of the data, we came up with a definition of under-reporting as "greater than 25% of the hospitals not reporting acute_icu_bed_utilization."  After running our code, we the states of Rhode Island and New York, as they had the largest number of days of underreporting over the two year perion in our analysis.  In addition, after plotting a boxplot of total days reported per state, we noted that the US territories of Virgin Islands and American Samoa were outliers and had very low reporting days, so we dropped the territories from our analysis. 

#### Calculating Burden
We then calculated a burden metric for each state on each day.  To calculate burden, we used the criteria set by The US Department of Health and Human Services.  According to French et.al., (2021), “HHS has studied the relationship between hospital bed use and hospital strain and has identified occupancy >80% as an indicator of a strained condition. This analysis uses a continuous measure of ICU bed occupancy as a proxy for hospital strain, such that greater amounts of ICU bed use indicate larger amounts of hospital strain.” 

We determined from our survey of the relevant literature that 80% threshold is a good, validated metric for hospital strain that we could use as an indicator for our analysis.

The burden metric is binary 1/0 value that is 1 if the adult icu bed utilization is greater than 0.8, and less otherwise. 

### Analysis and Modelling


### Time Series Analysis - Simple Moving Average
.....

### Regression and Classification for Predicting Burden
-----

### Time Series Forecasting for Predicting Trends in Utilization
The goal here was to use standard forecasting methods to predict trends in ICU bed utilization during the pandemic. We chose to focus on Pennsylvania as this is our state of residence and where Drexel University is located. 

The data for PA was extracted and split into train and test subsets based on date, with the first 18 months (March 1, 2020 -  Sep 1, 2021) used as the training set and the last 6 months (Sep 1, 2021 - March 1, 2022) as the test set.  We then ran 3 different forecasting models on the data.  For simplicity, no adjustments, e.g. for stationarity, were made on either the training or the testing data.

The three models tested are described below:
- Autoregressive Moving Average (ARMA) 
    - doesn’t capture seasonal trends. 
    - assumes that the time series data is stationary, meaning that its statistical properties wouldn’t change over time. 
- Autoregressive Integrated Moving Average (ARIMA)
    - doesn’t assume stationarity 
    - assumes the data exhibits little to no seasonality
- Seasonal ARIMA (SARIMA)
    - is a variant of ARIMA 
    - can work with non-stationary data 
    - can capture some seasonality

The performance of each model was evaluated by calculating the root mean square error, and we also visualized the predicted trends compared to the actual trends.


## Results


### Time Series Analysis - Simple Moving Average

...


### Regression and Classification Modeling

...

### Time Series Forecasting - ARMA, ARIMA, SARIMA

![Train/Test](images/TS-train-test.png)


![ARMA](images/ARMA_PA.png)


![ARIMA](images/ARIMA_PA.png)


![SARIMA](images/SARIMA_PA.png)



### Disparity Analysis

For the disparity analysis, we were limited in what measures of disparity we could meaningfully use given the varying granularity of our different datasets. For example, most income and race data is reported at a county level (for example by the Census bureau). But our utilization timeseries data is on a state level. Because the counties in each state have different socioeconomic levels, it's difficult to aggregate these measures and analyse disparity at a state level in a meaningful and actionable way. 

We were able to include the social deprivation index for each state in our data and this variable was used in the regression and classification modelling that has been described and proved to have a predictive effect on the hospital burden.  

We decided that one way to compare states as a way to assess disparity was to create visualizations that showed the population density, social deprivation index, COVID vulnerability measure (based on data that was last updated 11/2021), and the multidimensional deprivation index.  


![Multidimensional Deprivation Index](images/mdi.png)



![COVID Vulnerability](images/covid_vulnerability.png)



![Social Deprivation Index](images/sdi.png)



![Population Density](images/popn_density.png)



## Reflections on Strengths and Limitations

### Strengths:

**THIS SECTION NEEDS REFINING**

Loads of freely available data means this study can be extended and expanded on
The analysis is timely and of great relevance to various stakeholders bla bla bla
The ability to create tools such as online dashboards that can be visualized and intepreted at a quick glance is great because it can be understood by laymen as well as policy makers bla bla bla.

### Limitations:

**THIS SECTION NEEDS UPDATING**

We had initially intended to explore our data primarily through the lens of health disparities, but we made the decision to defer this aspect for future work.  The reason behind this decision were threefold:


1. The data available to assess COVID vulnerability as determined by the US Department of Human and Health Services was not updated and there was no data dictionary or robust description of how the assessment of vulnerability was done. We reached out to the data curators but did not receive a response.

2. Disparities are difficult to measure without a clear way/measure to rank the different groups. Because the datasets we chose to use for time series forecasting only provide information at the state level, it was too broad for us to map socioeconomic data onto. In future, it would be interesting to analyse timeseries information at the County level and then use Census data to, for example,  map socioeconomic indicatirs for the different counties. This would allow a more useful calculation and analysis of health disparities between groups.

3. There were time limitations that did not allow us enough latitude to explore the COVID vulnerability data and to continue reaching out to the data curators for additional details or updated data.  This would be interesting to pursue going forward and would add to our examination of disparities.



## Summary

## References

<div class="csl-entry">French, G., Hulse, M., Nguyen, D., Sobotka, K., Webster, K., Corman, J., Aboagye-Nyame, B., Dion, M., Johnson, M., Zalinger, B., &#38; Ewing, M. (2021). Impact of Hospital Strain on Excess Deaths During the COVID-19 Pandemic — United States, July 2020–July 2021. <i>MMWR. Morbidity and Mortality Weekly Report</i>, <i>70</i>(46), 1613–1616. https://doi.org/10.15585/MMWR.MM7046A5</div>