### 1. Abstract
From 2020 through 2024, the COVID-19 pandemic placed unprecedented strain on the United States' healthcare system, with hospitals facing surges in patient volume and widespread staffing shortages. This paper explores the feasibility of using publicly available hospital capacity data to predict staffing shortages at the state level. Using the U.S. Department of Health and Human Services' COVID-19 Reported Patient Impact and Hospital Capacity dataset, we evaluate whether critical staffing shortages can be predicted using supervised machine learning models.

After performing data cleaning and preprocessing--including handling missing values and engineering relevant time-based features--we formulated the task as a binary classification problem: predicting whether a state reported a staffing shortage on a given day. We trained multiple models, including logistic regression for interpretability and tree-based ensemble methods for improved performance and robustness.

Our models achieved meaningful predictive performance, with the tree-based models outperforming logistic regression, particularly in identifying true staffing shortage events with fewer false positives. Key predictive features included the number of currently hospitalized COVID-19 patients, ICU occupancy rates, and whether a staffing shortage had been recently reported in the same state. In particular, recent trends in hospital strain often served as early warning signs of upcoming shortages.

These results suggest that it is indeed possible to make actionable, data-driven predictions about staffing shortages in near real time. Such models could be incorporated into decision support tools, enabling more proactive staffing and resource allocation strategies. Despite challenges such as data gaps and inconsistent reporting across states, our findings show that machine learning can effectively support public health planning and crisis response.

Future work could explore incorporating additional real-time data sources, like local case rates or mobility data, and adapting the models to individual hospital systems for finer-grained predictions.

### 2. Introduction

The COVID-19 pandemic placed immense pressure on healthcare systems across the world, with hospitals in the United States experiencing significant strain in terms of patient capacity, ICU utilization, and staff availability. Among the most critical challenges was the issue of hospital staffing shortages—instances where medical facilities reported insufficient staff to meet operational demands. These shortages not only affected the quality and timeliness of care but also placed additional burdens on the re...

This paper investigates whether it is possible to predict hospital staffing shortages using publicly available data, specifically focusing on COVID-19-related hospital reports in the United States. The primary dataset, sourced from [healthdata.gov](https://healthdata.gov/), contains detailed time series data on hospital capacity, staffing, and COVID-19 metrics for all 50 states, the District of Columbia, and several U.S. territories. It spans from early 2020 through April 2024, capturing the full arc of...

The goal of this research is to use supervised machine learning techniques to develop models that can accurately predict whether a state will report a critical staffing shortage on any given day. By identifying the most important predictive features—such as COVID-19 patient counts, ICU occupancy, and time-based trends—we aim to uncover patterns that may inform future public health planning and policy. Ultimately, the hope is that early warnings based on data-driven predictions could allow hospital admin...

In approaching this task, we encountered several challenges. The dataset is large and complex, with over 130 features and a wide variety of variable types, including numeric, categorical, and time-based columns. Many variables are sparsely reported or inconsistently documented across states and time periods, which required thoughtful data cleaning and imputation strategies. Furthermore, the target variable—whether a critical staffing shortage was reported on a given day—is relatively imbalanced, with “n...

The significance of predicting staffing shortages extends beyond academic interest. During the height of the pandemic, hospitals facing personnel gaps had to delay elective procedures, transfer patients to other facilities, or reduce the level of care provided in ICUs. Staffing constraints also exacerbated burnout among frontline workers, creating a feedback loop that intensified shortages over time. As such, being able to anticipate these gaps even a few days in advance could prove extremely valuable, ...

To carry out our study, we first conducted a thorough exploration and cleaning of the dataset. This included removing irrelevant or redundant features, converting time fields into usable formats, and visualizing missingness patterns across key variables. We also transformed the outcome variable into a binary indicator suitable for classification tasks and developed a pre-analysis plan to guide our modeling efforts. This plan outlined our hypotheses, selected features, evaluation metrics, and intended ma...

We chose a mix of interpretable and high-performing models for this task: logistic regression for its transparency and baseline value, and tree-based models like Random Forest and XGBoost for their ability to capture non-linear relationships and handle large feature sets. Evaluation was performed using a range of classification metrics, including accuracy, precision, recall, F1 score, and ROC AUC. We also implemented cross-validation to assess how well our models generalized to unseen data and reduce t...

Another important focus of our project was feature importance—understanding which inputs contributed most strongly to the predictions. Beyond raw model performance, we wanted to interpret which hospital characteristics, staffing trends, or COVID-19 variables were most informative. For example, we hypothesized that ICU occupancy and recent COVID-19 admission surges would be strong predictors of reported staffing shortages, reflecting the direct relationship between patient volume and staffing stress.

Finally, we considered the temporal nature of the data. While most machine learning models assume independent and identically distributed (i.i.d.) samples, real-world data like ours involves sequences over time. To account for this, we explored time-based validation approaches, such as training on early periods and testing on later periods, to check for temporal drift or performance degradation. This is particularly important in public health settings, where early-phase dynamics of a crisis may differ s...

In the sections that follow, we describe the dataset and our data cleaning steps in more detail, walk through the methods outlined in our pre-analysis plan, and present the results of our modeling and evaluation process. We conclude by discussing the implications of our findings and directions for future work, including how similar predictive models could be deployed in real-time to help hospital systems remain resilient in the face of ongoing or future public health emergencies.


### 3. Data
We obtained our dataset from the COVID-19 Hospital Reported Patient Impact and Hospital Capacity database, which includes state-level hospital data on COVID-19 cases, staffing shortages, and hospital capacity. The dataset spans from January 2020 until April 2024, and contains 135 variables, each row representing a daily report for a specific state or territory in the United States.

The dataset includes several critical variables, such as the number of hospitals reporting critical staffing shortages, the number of COVID-19 hospital admissions, and the utilization of inpatient beds (both overall and specifically for COVID-19 patients). We also detemrined variables related to staffing availability and pediatric care, among others. Our dataset covers all 50 U.S. states as well as territories: Puerto Rico (PR), U.S. Virgin Islands (VI), American Samoa (AS), and the District of Columbia (DC). The "state" variable identifies the state or territory, while the "date" variable marks the daily data collection. The date range spans from January 1, 2020, to April 27, 2024.

We began by converting the "date" column to a datetime format and creating a timestamp in seconds since the Unix epoch for time-based analysis. We also explored key variables, such as the number of hospitals reporting critical staffing shortages. After noticing discrepancies in the scales of the variables, we used the inverse hyperbolic sine (arcsinh) transformation to standardize the values for easier comparison. To handle missing data, we identified and dropped rows with critical missing values, such as inpatient bed usage or hospital onset COVID-19 data, while acknowledging the significant gaps in variables related to pediatric care and certain therapeutic supplies. We also flagged categorical variables, such as "state," for potential additional cleaning.







### 4. Methods
We plan to perform supervised learning with regression models to predict hospital strain and COVID-19 onset under two scenarios:

1. Predicting the percentage of hospitals with critical staff shortages based on the number of inpatient COVID-19 patients and the number of inpatient beds used.

2. Predicting the onset of COVID-19 in hospitals based on inpatient bed utilization and critical staff shortages.

We will begin with linear regression to model the relationship between COVID-19 hospitalizations and hospital bed utilization. This will allow us to interpret how increases in COVID-19 cases and staffing shortages are associated with higher usage of hospital resources. Linear regression will also provide us with interpretable coefficients and diagnostic tools, such as residual analysis and R², to evaluate the model's fit.

For classification tasks, we will use logistic regression to categorize state-day observations as “high strain” or “low strain” based on thresholds for hospital utilization or staff shortages. This approach will provide interpretable odds ratios and is appropriate for binary outcomes.

To capture non-linear patterns, interaction effects, and threshold behavior (e.g., a certain level of COVID-19 admissions triggering capacity overload), we will use decision trees. Building on decision trees, we will apply random forests for improved generalization and stability. Feature importance scores from the random forest model will help identify the drivers of hospital strain.

We may also use Principal Component Analysis (PCA) to reduce dimensionality, as many of our numerical features are correlated (e.g., different types of inpatient bed usage). PCA will simplify the feature space while preserving most of the variance.

All models will be evaluated using cross-validation to ensure generalizability and avoid overfitting. We will compare models based on R², RMSE, and MSE for regression tasks, and accuracy, precision, recall, and F1 score for classification tasks.

We will evaluate our models using R², RMSE, and MSE for regression, and using accuracy, precision, recall, and F1 score for classification.

Several challenges are anticipated during the analysis:
1. Data Quality: Incomplete, inconsistent, or erroneous data may affect model performance. In particular, missing data may be an issue if certain states do not report consistently. We will use imputation techniques (e.g., mean, median, or regression-based imputation) and may exclude unreliable data points.

2. Overfitting: The large dataset and numerous variables increase the risk of overfitting. We will address this by using cross-validation and regularization, and by splitting the data into training and test sets. PCA may also be used to reduce the number of features.

3. Non-Stationarity: Hospital rates may fluctuate independently of COVID-19 cases, and trends or seasonality could affect the data. To address this, we will use coverage metrics to adjust for population size and calculate the number of patients per hospital. We may also apply differencing or detrending to make the data stationary.

We will present our results through tables and visualizations:
- Regression Models: Evaluated using R², RMSE, and residual plots.
- Classification Results: Presented using confusion matrices and F1 scores.
- Visualizations: Bar charts for feature importances, line plots for hospital utilization trends, scatter plots, and heatmaps for correlations. If time permits, we may explore how model predictions vary by region or change over time.

### 5. Results: Results submission, cleaned up to read as part of a paper


### 6. Conclusion:

### 7. References/Bibliography

