# <font color='blue'>Capstone A</font>
## <font color='blue'>Using Hospital Bed Capacity Prediction During COVID-19 to Determine Feature Importance</font>

<b>Abstract.</b>  The Covid-19 pandemic has led to the generation of multiple types of models and feature selection methods in the field of Machine Learning. Since there has been rapid change and new regulations being introduced during the pandemic, modeling and feature selection methods have become increasingly complicated. The purpose of this study is to investigate and dive into key features to help create an understanding for the public and help show preventive measures. This study focuses on the exploration of feature selection though building multiple models, one simple linear model, one more complex model and an average of the two for prediction on impatient hospitalization rates.<br><br>

<b>Authors:</b>
* Helen Barrera, SMU MSDS Student
* Justin Ehly, SMU MSDS Student
* Blake Freeman, SMU MSDS Student
* Brad Blanchard, SMU Faculty
* Chris Papesh, UNLV Faculty

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sbs
import copy

# change OS Justin
os.chdir(r'C:\Users\justi\github\covid_Capstone\data')


In [2]:
df = pd.read_csv('OxCGRT_latest.csv', low_memory=False)

# create dataframe of only USA State level data
covid = copy.deepcopy(df.loc[(df.RegionName.notna()) & (df.CountryCode == 'USA')]) 
covid.reset_index(drop=True, inplace=True)

In [3]:
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33405 entries, 0 to 33404
Data columns (total 51 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   CountryName                            33405 non-null  object 
 1   CountryCode                            33405 non-null  object 
 2   RegionName                             33405 non-null  object 
 3   RegionCode                             33405 non-null  object 
 4   Jurisdiction                           33405 non-null  object 
 5   Date                                   33405 non-null  int64  
 6   C1_School closing                      32645 non-null  float64
 7   C1_Flag                                27360 non-null  float64
 8   C2_Workplace closing                   32582 non-null  float64
 9   C2_Flag                                24546 non-null  float64
 10  C3_Cancel public events                32603 non-null  float64
 11  C3

In [4]:
# fix date column
from datetime import datetime
covid.Date = pd.to_datetime(covid.Date, format='%Y%m%d')

# create the state column for the data merge (this was determined at a later ddate from the initial pull and added back into the main pipeline)
covid['state'] = covid.RegionCode.str.slice(-2)

# drop 'CountryName', 'CountryCode', 'RegionName', 'RegionCode','Jurisdiction' because they will not be needed moving forward since we are working at the state level and only in the USA
# drop the wildcard since it is blank
covid = covid.drop(columns =['CountryName', 'CountryCode', 'RegionName', 'RegionCode',
       'Jurisdiction', 'M1_Wildcard'])

In [5]:
# get date range
print(covid.Date.min(), covid.Date.max())

2020-01-01 00:00:00 2021-10-16 00:00:00


In [6]:

# reduce covid df to friday 01-24-20 thru thursday 05-27-21 to match the other data set from the USGovt website
covid.reset_index(inplace=True, drop=True)
start_date = pd.to_datetime('20200124')
end_date = pd.to_datetime('20210527')
date_reduce_idx = np.where((covid.Date >= start_date) & (covid.Date <= end_date))
covid = covid.loc[date_reduce_idx]
covid.reset_index(inplace=True, drop=True)
min(covid.Date), max(covid.Date)

(Timestamp('2020-01-24 00:00:00'), Timestamp('2021-05-27 00:00:00'))

In [7]:
print('Total weeks in the dataset: %d' % (covid.shape[0]/7/51))

Total weeks in the dataset: 70


In [8]:
desc = pd.DataFrame(covid.describe())
desc.T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
C1_School closing,24982.0,1.976863,0.9472418,0.0,2.0,2.0,3.0,3.0
C1_Flag,22351.0,0.41761,0.4931762,0.0,0.0,0.0,1.0,1.0
C2_Workplace closing,24950.0,1.397836,0.7826192,0.0,1.0,1.0,2.0,3.0
C2_Flag,22077.0,0.7759659,0.416954,0.0,1.0,1.0,1.0,1.0
C3_Cancel public events,24969.0,1.296568,0.6383088,0.0,1.0,1.0,2.0,2.0
C3_Flag,22487.0,0.7876996,0.4089455,0.0,1.0,1.0,1.0,1.0
C4_Restrictions on gatherings,24990.0,2.617967,1.579179,0.0,2.0,3.0,4.0,4.0
C4_Flag,19050.0,0.7234646,0.4472964,0.0,0.0,1.0,1.0,1.0
C5_Close public transport,24971.0,0.4719074,0.6151551,0.0,0.0,0.0,1.0,2.0
C5_Flag,10171.0,0.3179628,0.4657078,0.0,0.0,0.0,1.0,1.0


---
### Missing Values
---
codebook https://github.com/OxCGRT/covid-policy-tracker/blob/master/documentation/codebook.md

- <b>All C1 - H8</b> 
    - features should be changed to categorical, but since we need to merge the daily data to weekly data we will leave these as floats.
    - NaN means there was no data available, s/b set to 99 and then 'no_data' because it may be useful
        - for binary 99 we will set the value to 0.5 that way no weight is given to either side
        - for non-binary indexes, 0 = no measure and blank = no data, so we set the 99 to the mean of the week, if all 99's then the result is set to 0
    <br><br>
- <b>Changes to be aware of:</b>
    - 27 September 2021: v3.4 note about removal of E3, E4 and H4
    - 28 June 2021: v3.3 presenting the imputed vaccine indicators (V2 summary and V3 summary) into a separate table
    - 21 June 2021: v3.02 edits to vaccine policy indicators table, fixing age ranges12 June 2021: v3.01 added section for vaccine policies
    - 5 May 2021: v2.10 added 'or all businesses open with alterations resulting in significant differences compared to non-Covid-19 operations' to C2 level 1
    - 18 March 2021: v2.9 added H8 'Protection of elderly people' indicator
    - 05 March 2021: v2.8 added 'non elderly' to definition of Clinically vulnerable groups' for H7
    - 14 January 2021: v2.7 changed 'country' to 'country/territory' and removed 'private' from C4 definition, replaced E1 flag 'formal sector workers only' to 'formal sector workers only or informal sector workers only', and 'informal workers too' to 'all workers'

In [20]:
covid.shape[0]

24990

In [25]:
ran = range(0,covid.shape[0],7)
for l in ran:
    

TypeError: 'Index' object is not callable

---
### Missing Values p2
---
- codebook https://github.com/OxCGRT/covid-policy-tracker/blob/master/documentation/codebook.md
- Indices methodology: https://github.com/OxCGRT/covid-policy-tracker/blob/master/documentation/index_methodology.md
<br>
- Oxford researchers were very conservative in the computation of indicies (more can be read at [indices methology link](https://github.com/OxCGRT/covid-policy-tracker/blob/master/documentation/index_methodology.md)  above)
- We can be more liberal for our research and will assume there was no change in policies where data is unreported for certain dates but does exist prior


In [None]:
from sklearn.impute import SimpleImputer
# replace NaN previous index
covid.iloc[:,40:50] = covid.iloc[:,40:50].fillna(method = 'ffill', axis=0)


In [None]:
covid.isna().sum()

Date                                         0
C1_School closing                            8
C1_Flag                                   2639
C2_Workplace closing                        40
C2_Flag                                   2913
C3_Cancel public events                     21
C3_Flag                                   2503
C4_Restrictions on gatherings                0
C4_Flag                                   5940
C5_Close public transport                   19
C5_Flag                                  14819
C6_Stay at home requirements                28
C6_Flag                                   4488
C7_Restrictions on internal movement        63
C7_Flag                                   2675
C8_International travel controls            31
E1_Income support                           72
E1_Flag                                   4974
E2_Debt/contract relief                     72
E3_Fiscal measures                       14626
E4_International support                 14754
H1_Public inf

### Note: 
https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u
from the healthdata.gov data we will get our matching keys:
* collection_week: This date indicates the start of the period of reporting (the starting Friday).
* state: [FAQ - 1. d)] The two digit state/territory code for the hospital.

ToDo:
* reduce the DF to a date range to Jan 24th (Friday) - May27 2021 (Thursday) this will match the data set weeks we are merging with
* combine the covid dataframe into weekly numbers with weeks beginning on 1/24/20
* create a state feature from RegionName


In [None]:
# combine the weeks in the covid df
import datetime
min(covid.Date) + datetime.timedelta(6)
#covid.groupby(by='Date').sum()[['ConfirmedCases']]
# look up pd.Grouper
# https://stackoverflow.com/questions/45281297/group-by-week-in-pandas/45281418

Timestamp('2020-01-30 00:00:00')

In [None]:
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24990 entries, 0 to 24989
Data columns (total 46 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Date                                   24990 non-null  datetime64[ns]
 1   C1_School closing                      24982 non-null  float64       
 2   C1_Flag                                22351 non-null  float64       
 3   C2_Workplace closing                   24950 non-null  float64       
 4   C2_Flag                                22077 non-null  float64       
 5   C3_Cancel public events                24969 non-null  float64       
 6   C3_Flag                                22487 non-null  float64       
 7   C4_Restrictions on gatherings          24990 non-null  float64       
 8   C4_Flag                                19050 non-null  float64       
 9   C5_Close public transport              24971 non-null  float6

In [None]:
covid.Date[0] - pd.to_timedelta(7, unit='d')

Timestamp('2020-01-17 00:00:00')

In [None]:
weekly_covid = copy.deepcopy(covid)
weekly_covid = test_covid.groupby('state').mean()#,pd.Grouper(key='Date', freq='W-FRI')], dropna=False

weekly_covid.reset_index(inplace=True, drop=False)
weekly_covid.info()

NameError: name 'test_covid' is not defined

In [None]:
weekly_covid.info()

---
## EDA
---

---
### Graph to compare Cases and Deaths over time
--

In [None]:
# get an idea of how the cases and deaths align over time when scaled
fig, ax1 = plt.subplots(figsize=(12,5))

# plot conf_cases
color = 'tab:red'
ax1.set_xlabel('Dates')
ax1.set_ylabel('Confirmed Casess')
ax1.plot(covid.groupby(by='Date').sum()['ConfirmedCases'], color=color, label='Conf_Cases')
ax1.tick_params(axis='y', labelcolor=color)

# add additional axes to same plot
ax2 = ax1.twinx()

# plot conf_deaths
color = 'tab:blue'
ax2.set_ylabel('Confirmed Deaths')
ax2.plot(covid.groupby(by='Date').sum()['ConfirmedDeaths'], color=color, label='Conf_Deaths')
ax2.tick_params(axis='y', labelcolor=color)

# prevent any offsets
fig.tight_layout()

# get plotted objects and their labels
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, loc='lower right')
ax2.set_title('Explore the Similarity in Tranjectories of Cases and Deaths')
ax2.legend(lines + lines2, labels + labels2, loc=0)
plt.show()

In [None]:
covid