# COVID Related Death Predictions

---
## Abstract
---

This project uses data released by the Mexican government which has over 1,000,000 people recorded (unique identifications stripped of course) as well as their COVID test result, whether or not they have a series of risk factors, the area they live in, differents types of care provided, and whether or not they have died. This project looks at only the people who tested positive for COVID, and tries to predict their outcome (live or die) based on their underlying health factors. Many variations of Logistic Regression and Decision Tree classification machine learning algorithms are considered using this data. In the end, one model from each algorithm is kept. For Logistic Regression, all features are used in a L2 regularization model, with presence of pneumonia and age being the two most contributing features, and acheives a test accuracy, precision, and recall of 90.6%, 60.3%, and 35.5%, respectively. The final Decision Tree model initially considers all features, but only uses the presence of pneumonia and age for making final classifications. This model acheives a test accuracy, precision, and recall of 90.3%, 54.6%, and 52.9%, respectively. The intended use for the Logistic Regression model is for a tool where individuals can enter their health characteristics and receive a prediction about whether or not they would die from COVID if they were to get infected. Since this model uses all features, it will be able to show someone how much each individual factor has contributed to their prediction. The intended use for the Decision Tree model is a quick paced setting where a quick classification needs to be made. An example would be a hospital, where someone may only have time to make a classification based on two factors. However, these models do suffer from low recall and precision, and this must be considered before using these models.

---
## Introduction
---

The Novel Coronavirus, known as COVID19, has been sweeping across the world throughout 2020. As a response, researchers have been scrambling to try and identify which groups of the population are most at risk. They have done a great job of identifying major risk factors that impact the toll COVID19 may take on a particular person. The major risk factors we have heard about in the news have been Age, Obesity, and Heart Disease, but there may be more. We have also seen that a person's chances for success in a battle with COVID19 often are dictated by the health care available to them. But how much of an impact do all of these factors individually have in dictating your standing against COVID19? And what if more than one of these factors apply to you? This project tries to answer these questions. <br>

---
## Motivations
---

1. Many people don't understand the severity of COVID19 and therefore aren't taking proper precautions. In countries like the United States, this has been a great hinderance in controlling the spread of COVID19. People are used to hearing things like "If you are Obese then you are high risk if you get COVID" or "If you have Heart Disease then you are at high risk if you get COVID". However, for many people, it is hard to understand the severity from such a stand alone statement. However, maybe hearing something similar to "Out of X number of people with the similar sets of risk factors that you have, Y of them are expected to died" and being shown accompanying visualizations of historic data may strike more interest and belief.<br>
2. Being able to estimate the probability of death for a given patient may be very useful in a hospital that is overwhelmed with patients. It may help is deciding where to send certain resources, and to anticipate which patients are likely to need resources throughout their treatment.

---
## Data
---

This data set came from a Mexican Government [website](https://www.gob.mx/salud/documentos/datos-abiertos-152127). I found a link to this dataset in a [kaggle](https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset) project that somebody has done. I thought that finding COVID data would be easier than this, but it turns out that HIPPA laws do not allow data on the individual patient level to be made available to the public even if their names are left out. Therefore, only summary level data sets (like by county, or by age group) are able to be viewed by the public. This data set from the Mexican Government is at the individual patient level and doesn't provide any patient identification information.<br>
The raw data was in Spanish, so was translated. Also, all qualitative features have accompanied value mappings. These have been handled in the data cleaning procedures (see link at bottom of Data section).

### Binary Qualitative Features

<b>Yes or No Features: </b>The presence of Pneumonia, Pregnancy, Diabetes, Chronic Obstructive Pulmonary Disease (COPD), Asthma, Immunosuppression, Hypertension, Cardiovascular Disease, Obesity, Kidney Failure, Other Diseases (umbrella feature), and whether the person is a smoker. Lastly, the <b>target variable "Died"</b> is a yes or no feature.

<b> Non Yes or No Features: </b>The Sex of the person (either Male of Female). For interpretability, I have turned this into two separate dummy variables.

### Non-Binary Qualitative Features

The Entity (state) the person resides in. This feature has been grouped into three subgroups as a function of the death rates and then encoded into three dummy variables.

### Continuous Feature

The persons Age.

#### Data cleaning procedures can be found <a href=https://github.com/harperd17/COVID-Death-Rates/blob/master/Notebooks/Data_Cleanup.ipynb>here. </a>

---
## Related Work
---

With COVID sweeping across the world, there are a lot of studies being done on COVID. The CDC in the United States is constantly posting updated data on COVID cases and deaths in the United States. All of the data available through the CDC is only at an aggregated level, meaning they only report numbers by some grouping such as state. They are very good about reporting all sorts of risk factors with the accompanied proportions of deaths and cases across these risk factors. Common risk factors presented are cardiovascular disease, age, as well as other risk factors. <br>
I'm also sure that hospitals have been utilizing their data collected from COVID cases so far in attempts to better manage the patient flow while experiencing higher than normal volume due to COVID. <br>
Lastly, I had found out about this data set through related work of a Kaggle project found [here](https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset) where the author uses this dataset in order to predict whether someone will require intensive care (ICU) based on their underlying risk factors. To preserve originality, I decided to predict death rather than ICU.<br>
---

## Proposed Method
---

Among different machine learning classification techniques, Logistic Regression and Decision Tree Analysis are explored in this project. Logistic Regression is a strong and very often used analysis tool for predicting binary outcomes, especially in the medical field. Since a motivation for this project is hospital use, Logistic Regression is a technique that many workers would likely recognize and feel more comfortable with, making it an abvious choice for exploration. <br>
Decision Tree analysis seems like a good analysis method to explore since one of the motivations for this project is easy interpretation for people that may not understand the impact COVID may have on them. <br>
These models will be developed using training data that consists of 95% of the data available. After performing cross validation on the proposed models, the final chosen Logistic Regression and Decision Tree models will be tested using the remaining 5% of the data that was set aside as test data.

#### Variables to Consider

For completeness, I would like to initially consider all of the features listed in the Data section above. Below you can see each variable and the associated Death Rate for that specific group of people.

![Binary%20Variable%20Death%20Rates-2.png](attachment:Binary%20Variable%20Death%20Rates-2.png)

![Age%20Rates.png](attachment:Age%20Rates.png)

<b>Figure 1:</b> The above plots show that by far, having pneumonia is associated with the highest increase in death rates. Interestingly, being pregnant appears to have a benefit for the death rate. Another interesting thing is that people with asthma appear to have a slightly lower death rate than those without asthma. Also, being a smoker seems to have very little impact on the death rate. Being a male seems to increase the death rate. Lastly, there are different death rates depending on which entity (state) someone lives in. I have a hunch that this may be due to differences in the quality of health care across the country. Lastly, the age seems like a good predictor and we can see how rapidly the death rates increases after the age of 50. It is also interesting how the death rate lowers after the age of 100.<b> Source code for visualizations can be found <a href=https://github.com/harperd17/COVID-Death-Rates/blob/master/Notebooks/Data_Visualizations.ipynb>here. </a> </b>

---
## Experiments
---
#### All model experimentation code can be found <a href=https://github.com/harperd17/COVID-Death-Rates/blob/master/Notebooks/Modelling_code.ipynb>here. </a>

### Logistic Regression Model

<b>Feature Selection: </b><br>
<b>Using Forward Stepwise Selection</b>, I found that out of the 18 possible features, only 4 are necessary to acheive a model with classification accuracy comparable to the maximum estimated test classification rate. This model has an estimated test accuracy, precision, and recall of 90.76%, 59.48%, and 38.19% respectively. The estimated test metrics were calculated using 10-fold cross validation on the training data.

![Forward%20Selection%20Metrics-2.png](attachment:Forward%20Selection%20Metrics-2.png)

<b>Figure 2: </b>This plot shows the different levels of estimated test precision, recall, and accuracy obtained for logistic regression models of different complexities in terms of the number of features obtained from forward stepwise feature selection. Estimated test metrics were obtained by performing 10-fold cross validation for each model. After the 4 best features have been added, the estimated test accuracy no longer increases, suggesting a model with 4 features is the simplest model for achieving test accuracy. It is interesting to note that the precision and recall both hit their maximums with only 2 features.

The model selected using forward stepwise feature slection is shown below:

| Feature        | Coefficient |
|----------------|:-----------:|
| Pneumonia      | 2.492       |
| Age            | 0.061       |
| Kidney Failure | 1.132       |
| Sex_1 (Female) | -0.427      |

<b>Using Backward Stepwise Selection</b>, I selected only the best 3 features to keep. This model has an estimated test accuracy, precision, and recall of 89.76%, 52.06%, and 33.04% respectively. The estimated test metrics were calculated using 10-fold cross validation on the training data. The model with only 3 features was not the best in terms of the test metrics, but when taking complexity into account it seems to be. Obviously, the best model in the chart below would be one with 18 features, but since simplicity is the goal for backward selection, I will choose the model with only 3 features due to the high recall at this point. I would like to note that when inspecting why there was such a drop when moving from 18 features to 17, I found that the reasoning was that the sklearn backtesting algorithm dropped the Age feature, which was shown to be a significant feature in forward selection.

![Backward%20Selection%20Metrics-2.png](attachment:Backward%20Selection%20Metrics-2.png)

<b>Figure 3: </b>This plot shows the different levels of estimated test metrics obtained for logistic regression models of different complexities in terms of the number of features obtained from backward stepwise feature selection. Estimated test metrics were obtained by performing 10-fold cross validation for each model. The maximized estimated test accuracy, precision, and recall, given the complexity of the underlying model is found from a model containing 3 features.

The model selected using backward stepwise feature selection is shown below:

| Feature       | Coefficient |
|---------------|:-----------:|
|Pneumonia      | 2.896       |
|Pregnant       | -1.587      |
|Hypertension   | 0.992       |

<b>L2 Regularization: </b>When training a logistic regression model using the L2 addition, I first standardized the Age feature. Then, I estimated the test metrics as a function of C. The maximized accuracy from 10-fold cross validation is found when C = 0.1 or higher. However, using a C of as low as 0.001 still yields comparable estimated test precision, recall, and accuracy of 60.07%, 35.56%, and 90.74% respectively. The chart below shows the accuracy of the model as a function of C (ranging from 10 raised to the power of -11 to 9).<br>
Another thing to note is the impact of C on the coefficients of the features. The largest coefficients in terms of absolute value are for the features Pneumonia, Sex_1 (Female), Sex_2 (Male), and Residence_1 (Grouping 1). <br>
Looking at the bottom figure below, we can see that at C=0.001, the coefficient is smaller. Also, the coefficient for Sex_2 is positive instead of negative. This is good because it matches our interpretation from Figure 1 which shows Males have an increased death rate compared to Females.<br>
Two other features I would like to note are Residence_3 (group 3) and Cardiovascular Disease. Without any regularization (C = 0), the ceofficient for Residence_3 is negative. This is odd because the values of the coefficients for Residence_1 and Residence_2 are also negative, which would give an interpretation that simply living anywhere lowers your chances of death from COVID. This interpretation makes no sense. It is likely that some collinearities are causing these three features to all be negative. With a C=0.001, the coefficient of Residence_3 becomes positive (increasing the chances of death), which matches what we would expect to see from Figure 1, showing that Residence_3 grouping has the highest death rate.<br>
Also, with C=0, the coefficient for Cardiovascular Disease is slightly negative. This doesn't make sense because this would imply having Cardiovascular Disease would decrease your chances of dying from COVID. However, when applying C=0.001, the coefficient becomes one sixth the degree. However, it is still negative, so will need to be noted as a limitation of the model.

![Regularization%20Metrics-3.png](attachment:Regularization%20Metrics-3.png)

![Regularization%20Coefficients.png](attachment:Regularization%20Coefficients.png)

<b>Figure 4:</b> The <b>top figure</b> above shows how changing the value of C, the addition to the cost function when fitting logistic regression, impacts the estimated test metrics of the logistic regression model containing all available features. We can see that using a C less than or equal to 0.001 greatly lowers the estimated test precision, recall, and accuracy. Increasing the value of C anywhere above 0.001 has very minimal impact on the estimated test accuracy. However, it is also important to note in the <b>bottom figure</b> how changing the value of C impacts the coefficients.

The model that uses C = 0.001 contains:

| Feature           | Coefficient |
|-------------------|:-----------:|
| Pneumonia         | 2.176       |
| Age               | 0.941       |
| Pregnant          | -0.003      |
| Diabetes          | 0.361       |
| COPD Diagnosis    | 0.108       |
| Asthma            | -0.043      |
| Immunosuppression | 0.179       |
| Hypertension      | 0.238       |
| Other Diseases    | 0.346       |
| Cardiovascular Disease  | -0.001 |
| Obesity           | 0.251       |
| Kidney Failure    | 0.582       |
| Smoker            | -0.068      |
| Sex_1 (Female)    | -0.229      |
| Sex_2 (Male)      | 0.229       |
| Residence_1       | -0.141      |
| Residence_2       | -0.012      |
| Residence_3       | 0.152       |

### Decision Tree Model

When trying out the fit of a tree as a function of the number of max leaf nodes, I came up with the following estimated test metrics. The best decision tree in terms of precision, recall, and accuracy (54.46%, 53.04%, and 90.39% respectively) as well as simplicity falls at only 3 leaf nodes.

![Tree%20Leaf%20Metrics-3.png](attachment:Tree%20Leaf%20Metrics-3.png)

<b>Figure 5: </b>The plot above shows how the estimated test accuracy using 10-fold cross validation changes as a function of the number of leaf nodes the Decision Tree building algorithm was limited to. We can see that around 6 leaf nodes, the estimated test accuracy flattens out at 90.6%. Increasing the leaf nodes further only increases the estimated test accuracy by about 0.2%. However, decreasing the complexity of the tree down to only containing 3 leaf nodes actually increases the recall a good ammount. I think that this increase in estimated test recall makes up for the slight decreases in precision and accuracy.

The decision tree created using 3 leaf nodes is shown below:

![image-5.png](attachment:image-5.png)

<b>Figure 6: </b>The decision tree above is the result of fitting a tree containing 3 leaf nodes. The tree shows that the only two features needed in order to classify a person as dying or not is Pneumonia and Age (standardized here). This model seems extremely simple, and it appears as if many people with other risk factors would definitely fall between the cracks.

#### Cost Complexity Pruning

Cost complexity pruning was applied to the tree fitting algorithm by adding an alpha term to the cost function. The alpha value ranges from 0 to 0.007. The trade off between the degree of "pruning" and the impurity and accuracy can be seen in the plot below.

![Tree_Pruning_Impurities.png](attachment:Tree_Pruning_Impurities.png)

![Pruning_Tree_Sizes.png](attachment:Pruning_Tree_Sizes.png)

![Tree%20Prune%20Metrics.png](attachment:Tree%20Prune%20Metrics.png)

<b>Figure 7: </b>The <b>top plot</b> above shows that as alpha increases, so does the level of total impurity. However, as alpha increases above 0.001, the rate of increase in impurtiy declines substantially. The <b>middle set of two plots </b> shows how the complexity of the tree in terms of nodes and depth changes as alpha increases. Again, after alpha increases above 0.001, the rate of decrease in complexity falls greatly, and the number of nodes and the depth of the tree flattens out. Lastly, the <b>bottom set of three plots</b> shows how the estimated test metrics change as the tree is pruned. The maximum precision, recall, and accuracy is found with pruning with alpha between 0.001 and 0.006. The estimated test metrics are calculated through 10-fold cross validation.

Since the alpha for which the commplexity of the tree flattens is 0.001 and the maximum estimated test metrics are achieved here, I have chosen this alpha to create a pruned tree. The tree can be see below and has an estimated precision, recall, and accuracy of 54.46%, 53.04%, and 90.39% respectively.

![image-2.png](attachment:image-2.png)

<b>Figure 8: </b>This decision tree shows a great amount of simplicity, only having 4 leaf nodes. The only features necessary in classifying someone as dying or not dying from COVID are Pneumonia and Age.

---
## Results and Discussion
---

The comparison between the models can be seen in the table below. For the first motivation of being able to give an average person a prediction of their outcome if they were to get COVID, the L2 Regularization model would be best. The main reason is that this model contains a complete listing of all the risk factors. This comes at the cost of a slightly lower recall, but given the first context listed in the motivation section, I think this is worth the trade off. In addition to this, the regularization appears to have done some containing for the collinearity between risk factors. Lastly, for each feature, the coefficient seems to roughly match the general direction of the visual relationships between the presence of that risk factor and the death rates in the charts from Figure 1, with the exception of Cardiovascular Disease. However, one draw back is that the interpretation of a Logistic Regression model to the average person is not nearly as simple as that of a decision tree. However, a decision tree complex enough to contain all possible risk factors may also be very difficult to interpret and hard to follow. Therefore, the best option would be to use the L2 regularization logistic regression model and create some sort of tool that allows a user to input their risk factors and get an expected output returned to them.<br>
Next, overall, the best model for correctly predicting COVID outcome was actually the decision tree. Both decision trees considered have the exact same estimated test metrics. Therefore, the minimum leaf node model, which yielded three leaves will be chosen for the decision tree model, due to the increased simplicity. The accuracy is slightly below that of the logistic regression models. However, at the cost of a slight decrease in precision, the recall is substantially higher than that of any logistic regression model found in this project. Since recall can be simply described as the proportion of people who died from COVID that were actually predicted to have died, this means that on average, 53.04% of deaths from COVID may actually be predicted using this model. The figure of 53.04% is quite low, but slightly better than a coin toss.<br>
Since precision can be simply described as the proportion of people predicted to die from COVID that actually did die from COVID, if the decision tree predicts that you will die from COVID, on average, there is only a 54.46% chance that you actually will. In the context of a hospital, this means that for every 2 beds reserved for high death probability patients, roughly only one would have actually been necessary. Again, this figure is quite low, but slightly better than a coin toss.

|Model                |Est. Accuracy  |Est. Precision|Est. Recall      |
|---------------------|:-------------:|--------------|----------------:|
|Logistic Regression  |               |              |                 |
|   Fwd. Stepwise     |90.76%         |59.48%        |38.19%           |
|   Bwd. Stepwise     |89.76%         |52.06%        |33.04%           |
|   L2 Regularization |90.74%         |60.07%        |35.56%           |
|Decision Tree        |               |              |                 |
|   Min. Leaf Nodes   |90.39%         |54.46%        |53.04%           |
|   Cost Comp. Prune  |90.39%         |54.46%        |53.04%           |

<b>Figure 9: </b>The chart above shows the summary of the estimated test metrics for all considered classification models.

When testing both models on the test data, the results are as follows in the confusion matrices below.

![image-2.png](attachment:image-2.png)

![image-2.png](attachment:image-2.png)

<b>Figure 9: </b> The <b>top confusion matrix</b> shows that the regularization model, with C = 0.001 yielded 90.63%, 60.25%, and 35.46% accuracy, precision, and recall, respectively. These are all very close to the figures estimated from 10-fold cross validation. The <b>bottom confusion matrix</b> shows that the decision tree with only three leaf nodes yielded 90.28%, 54.57%, and 52.91% accuracy, precision, and recall, respectively. These are all very similar to the estimated figures from 10-fold cross validation as well.

When testing the trained logistic regression and decision tree models, as shown in the two confusion matrices above in Figure 9, the models performed very similarly to the manner they did under 10-fold cross validation. This is a good sign that the models aren't overfitted, and that the bias and variance of the models are close to equilibrium.

---
## Conclusion and Summary
---

In conclusion, there are two models that can predict whether or not someone will die from COVID. Each one has its place for different applications. The L2 Regularization model is better for giving an interested person an idea of how they may stand against COVID. This could be somebody who doesn't have COVID but wants to know how it may impact them and their friends, or it could be somebody who has COVID and wants to find out their chances of fighting it off. This model includes all possible risk factors that have been recorded in this data set, and also has the potential to show someone which of their risk factors are putting them most in danger. The strongest feature increasing COVID risk in this L2 model is the Pneumonia feature, followed by Age, Kidney Failure, and Diabetes. The features that decrease COVID risk the most are being a Female and living in a state grouped in the first grouping (as represented by Residence_1). A possible reason for this is that states in this grouping may have better health care availability, or may have a higher annual household income, allowing for people to afford better care. However, this is only a speculation and warrants a separate study.<br>
The Decision Tree model seems like a good model for applying to hospitals due to the higher test recall. Also, it may be better due to its simplicity. In a quick paced setting, someone only needs to ask two questions (whether the person has Pneumonia, and how old they are) in order to make a classification. <br>
Lastly, I would like to note the accuracy of these models. The models performed similarly when tested as they did under cross validation. However, the precision, recall, and accuracy figures aren't very high. First, the accuracy initally seems high, being just over 90%. However, under the circumstances of roughly 90% of the data population having died, 90% doesn't seems quite as good, as the model could have assigned every observation a prediction of 0 and achieved a similar accuracy. Also, between the two models, the precision is between 54% and 60%. This means that out of all people these models classify as more likely to die if infected with COVID, only 54% to 60% are expected to actually die. These figures of course are over 50%, but not by much. Therefore, if someone tries doing a self prediction using the L2 Logistic Regression model and gets a positive result (meaning death), then they can't take this prediction 100% seriously. And in a hospital setting, you shouldn't plan on every person predicted to die to actually die. Instead, a plan may be to have resources to fill 60% of the predictions, and keep all predictions on a watch list to get more frequent check ups. Lastly, the recalls are low, especially for the L2 Logistic Regression model (recall of ~35%). This implies that out of all people that die from COVID, only 35% would have been foreseen by the L2 Logistic Regression model. For the Decision Tree model, we would expect ~54% of the actual COVID deaths to be foreseen by the model. This is much better than the 35% from the L2 Logistic Regression model, which is another reason the decision tree is better for a hospital, but it is still low.<br>
Overall, these two models are good starting points, but certainly not good enough to make confident decisions from. In a hospital setting, this model may be good to use in conjunction with other metrics as well, such as heart rate, internal temperature, etc.

---
## Limitations and Later Work
---

One large limitation to this project is that the data is for COVID cases in Mexico. This means it shouldn't be generalized to the United States since there are large differences in the health care systems. This can be seen when comparing the country-wide death rates. The United States is currently sitting at a ~2.8% death rate from COVID in terms of counted cases. However, Mexico is closer to ~10%. This is a large difference that is likely attributed to differences in available health care and average household income (your ability to pay for treatment). However, I am not very familiar with Mexico's health care system so this is only an inference. <br>
In addition, from the visualization notebook, we saw that of all people sent to the Intensive Care Unit, ~50% end up dying. Also, of people that require intubation, ~80% end up dying. These were not included in this analysis because when someone initially gets COVID, they don't know whether or not they will need intubation or intensive care. They only know what their underlying risk factors are. However, what could be very interesting to look into in the future is predicting a death rate for people given their underlying risk factors, given that they are already either in intensive care or are receiving intubation. This could be done by performing similar model experimentation, but starting out with a data set that consists of a subset that only includes people who either required intubation or required intensive care.<br>
Also, it seems like the Pneumonia feature has "hogged" alot of attention from the models. It would be interesting to do another study that involves ignoring this feature. <br>
Lastly, it would be interesting to do a separate study to see how the quality of health care and annual household income vary by state, and see if there is a link to the death rates of these states.

---
## References and Contributions
---

<b>Author and Data Scientist: </b>David Harper

<b>Language: </b>Python

<b>IDE: </b> Jupyter Notebook via Anaconda

<b>Libraries: </b>pandas, json, numpy, matplotlib, sklearn, and seaborn.