<a href="https://colab.research.google.com/github/marymlucas/scrap/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSCI 521: Data Analysis and Interpretation

# Term Project

## Team Members

- Team member 1
    - Name: Andrew Appleton
    - Email: aa4498@drexel.edu    
    - Background and skills: 10 years experience in Health Sciences, Limited Python experience
- Team member 2
    - Name:  Layla Bouzoubaa
    - Email: lb3338@drexel.edu
    - Background and skills: MS in Public Health with a concentration in Biostatistics. Advanced R user, novel Pythonista, strong familiarity with public health data science pipeline and research.
- Team member 3
    - Name:  Alicia Brandemarte
    - Email: amb847@drexel.edu
    - Background and skills: Healthcare Informatics, R, SQL, Business Intelligence.  Limited Python skills, seeking to expand Python Skills.
- Team member 4
    - Name: Stephan Dupoux
    - Email: sgd45@drexel.edu    
    - Background and skills: 5 years of building machine learning , 5 years of Python programming such as building ETL pipelines and other web based applications. 
- Team member 5
    - Name: Mary Lucas
    - Email: mml367@drexel.edu
    - Background and skills: Extensive healthcare clinical experience, some experience in SQL, Python, and R programming and machine learning. 
- Team member 6
    - Name: Shane Nelson
    - Email: sn888@drexel.edu
    - Background and skills: 2 years experience working at a clinical research organization. Somewhat limited Python experience however continuously seeking to improve skills.


# Introduction

The COVID-19 pandemic is an ongoing global crisis that has had a major impact on individuals and communities over the past two years.  It has resulted in significant disruptions to our way of life and the services and resources we depend on on a day to day basis.  

The pandemic and its effects in terms of loss to life and livelihood have also made it more evident that there are disparities in access to care and resources across the globe based on geography, race, and other factors.
  
All code can be found on our [Github repository](https://github.com/labouz/covid_vulnerability/)  The data is available for download from the links in the Data Sources section.

### Motivating Questions to Address
The main questions that we hoped to address in this work were: 

  * the impact COVID-19 has had on hospital capacity in different regions of the country over time
  * the community burden of the COVID-19 pandemic
  * health disparities in the COVID-19 impact across different regions of the country



### Design of the Study

....

## Data

We analysed hospitalization burden due to the COVID19 pandemic in different states.  The main dataset we used was obtained from The US Department of Health and Human Services as healthdata.gov and is publically available. The datasets are non-identifiable and contain no protected health information. 

- [COVID-19 Reported Patient Impact and Hospital Capacity by State Timeseries](https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/g62h-syeh)
- [COVID-19 Community Vulnerability Crosswalk - Rank Ordered by Score](https://catalog.data.gov/dataset/covid-19-community-vulnerability-crosswalk-rank-ordered-by-score)

Supplementary data on the different states was obtained from other sources and will be mentioned in the relevant sections of this report.


### Structure
The hospital capacity timeseries data used is provided in various formats including CSV, TSV, GEOJSON, XML, and more.  It is updated on a daily basis and can be downloaded via the Socrata API or by direct export from the healthdata.gov link provided. 

As of the last download of the final dataset for this project the file contained 39,701 rows and 117 columns with each row representing a daily state aggregated report. 

In [None]:
# code to allow for visualizations
import IPython
from IPython.display import Image


#### Exploratory Data Analysis


In [None]:
url = 'https://marymlucas.github.io/adult-icu_utilization/'
iframe = '<iframe src=' + url + ' width=100% height=600></iframe>'
IPython.display.HTML(iframe)

### Methods


#### Data Preprocessing

##### Feature Selection
After an initial data exploration phase, we selected xx variables out of the 117 to retain for our analysis.  This selection was made based on .....


##### Calculating Burden
To calculate burden, we used the criteria set by The US Department of Health and Human Services.  According to French et.al., (2021), “HHS has studied the relationship between hospital bed use and hospital strain and has identified occupancy >80% as an indicator of a strained condition. This analysis uses a continuous measure of ICU bed occupancy as a proxy for hospital strain, such that greater amounts of ICU bed use indicate larger amounts of hospital strain.” 

We determined from our survey of the relevant literature that 80% threshold is a good, validated metric for hospital strain that we could use as an indicator for our analysis.


##### Determining Under-Reporting
To make our analysis meaningful, we decided to remove states that had high levels of underreporting.  For each variables in each row (day of data reported in the state), there is a corresponding "coverage" variable (denoted as metric-name_coverage). This coverage feature reports how many hospitals in the state submitted information about the related variable. For example, in the case of our main feaure of interest (adult_icu_bed_utilization), the adult_icu_bed_utilization_coverage feature gives an indication of how many hospitals in the state reported their adult icu bed utilization that day. 

By comparing the coverage with the number of hospitals in each state we were able to derive a simple measure of the level of reporting for each state.

The most current list of hospital numbers was obtained from the Kaiser Family Foundation website (https://www.kff.org/other/state-indicator/total-hospitals) which reports data from the American Hospital Association (1999 - 2019) Annual Survey. 

It's important to note the disclaimer that "Data are for community hospitals, which represent 85% of all hospitals" where community hospitals are defined as "all nonfederal, short-term general, and specialty hospitals whose facilities and services are available to the public."

In selecting a threshold, we considered the guidelines by the HHS for hospitals on how to report their numbers - https://www.hhs.gov/sites/default/files/covid-19-faqs-hospitals-hospital-laboratory-acute-care-facility-data-reporting.pdf. While this is a recent update, we noted that the changes from previous requirements did not affect the requirement to report, only what fields are reported (https://healthdata.gov/stories/s/COVID-19-Reporting-and-FAQS/kjst-g9cm/). Based on this we came up with a definition of under-reporting as "states where greater than 25% of the hospitals reported acute_icu_bed_utilization."


#### Data Analysis
We employed time series analysis and forecasting methods ...



## Results


## Reflections on Strengths and Limitations

### Strengths:

### Limitations:
We had initially intended to explore our data primarily through the lens of health disparities, but we made the decision to defer this aspect for future work.  The reason behind this decision were threefold:

1. The data available to assess COVID vulnerability as determined by the US Department of Human and Health Services was not updated and there was no data dictionary or robust description of how the assessment of vulnerability was done. We reached out to the data curators but have not received a response.

2. Disparities are difficult to measure without a clear way/measure to rank the different groups. Because the datasets we chose to use for time series forecasting only provide information at the state level, it was too broad for us to map socioeconomic data onto. In future, it would be interesting to analyse timeseries information at the County level and then use Census data to, for example,  map socioeconomic indicatirs for the different counties. This would allow a more useful calculation and analysis of health disparities between groups.

3. There were time limitations that did not allow us enough latitude to explore the COVID vulnerability data and to continue reaching out to the data curators for additional details or updated data.  This would be interesting to pursue going forward and would add to our examination of disparities.



## Summary

## References

<div class="csl-entry">French, G., Hulse, M., Nguyen, D., Sobotka, K., Webster, K., Corman, J., Aboagye-Nyame, B., Dion, M., Johnson, M., Zalinger, B., &#38; Ewing, M. (2021). Impact of Hospital Strain on Excess Deaths During the COVID-19 Pandemic — United States, July 2020–July 2021. <i>MMWR. Morbidity and Mortality Weekly Report</i>, <i>70</i>(46), 1613–1616. https://doi.org/10.15585/MMWR.MM7046A5</div>