# COVID-19 Cases & Policies by Country

For this project, I'm interested in how affective different prevention policies (e.g. masks, restaurant closures) have been against the spread of COVID-19 and COVID-19-related deaths.

There are existing visualizations and analyses for this.  My goal is to play with combining multiple datasets, filtering, and exploring trends within those datasets using tools I haven't used before.

Analysis Observations are called out in Markdown after the results plots. 

About COVID dataset:  https://github.com/owid/covid-19-data/tree/master/public/data \
About Policies dataset:  https://github.com/OxCGRT/covid-policy-tracker

In [6]:
import pandas as pd

In [7]:
from datetime import datetime

In [8]:
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

#### Import datasets

In [None]:
# COVID impact by country:
cov = pd.read_csv(
    "https://covid.ourworldindata.org/data/owid-covid-data.csv", 
    parse_dates=True
)

In [None]:
# Policies enforced in response:
pol = pd.read_csv(
    'https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv', 
    parse_dates=True,
    low_memory=False
)

#### Update COVID df so headers match Policy df; reformat Policy dates

In [None]:
# Rename columns
cov.rename(columns={'iso_code': 'CountryCode', 'location': 'CountryName', 'date': 'Date'}, inplace=True)

In [None]:
# Move 'Date' to first column (for merge)
first_column = cov.pop('Date')
cov.insert(0, 'Date', first_column)

In [None]:
# Reformat pol 'Date' to be yyyy-dd-mm
to_date = []
for d in range(0, len(pol)):
    to_date.append( \
        str(pol['Date'][d])[0:4] \
        + '-' + str(pol['Date'][d])[4:6] \
        + '-' + str(pol['Date'][d])[6:8] \
        )

In [None]:
# Update Policy 'Date' with new formatted dates
pol['Date'] = to_date

In [None]:
# Move Policy 'Date' to first column (for merge)
first_column = pol.pop('Date')
pol.insert(0, 'Date', first_column)

In [None]:
# Sort by date
cov.sort_values(by=['Date'], inplace=True)
pol.sort_values(by=['Date'], inplace=True)

In [None]:
# Convert Date to datetime
cov['Date'] = pd.to_datetime(cov['Date'])
pol['Date'] = pd.to_datetime(pol['Date'])

#### Merge the two DataFrames

In [None]:
df = pd.merge_asof(cov, pol, on='Date', by='CountryName')

#### Remove regions (e.g. World)
I realized regions and general categories were included as "countries" when I plotted the data.  For total cases per day, 'World' stood out as a clear outlier.  After removing 'World,' I was plotting populations and noticed the continents and the income ranges were also included.

In [None]:
df = df.loc[(df['CountryName'] != 'World') \
    & (df['CountryName'] != 'Europe') \
    & (df['CountryName'] != 'European Union') \
    & (df['CountryName'] != 'Asia') \
    & (df['CountryName'] != 'North America') \
    & (df['CountryName'] != 'South America') \
    & (df['CountryName'] != 'Oceania') \
    & (df['CountryName'] != 'Africa') \
    & (df['CountryName'] != 'Upper middle income') \
    & (df['CountryName'] != 'Low income') \
    & (df['CountryName'] != 'Lower middle income') \
    & (df['CountryName'] != 'High income')
    ]

#### Include only Feb'20 through Feb '21 to remove effect of vaccines

In [None]:
df = df.loc[(df['Date'] > '2020-01-31') & (df['Date'] < '2021-03-01')]

In [None]:
#Get list of countries
df.CountryName.unique()

In [None]:
# List of columns
list(df.columns)

In [None]:
# Plot total cases vs time for a single country
plt.plot(df[df.CountryName == 'United States'].Date, df[df.CountryName == 'United States'].total_cases)
plt.title('U.S. Total COVID-19 cases vs time')
plt.xlabel('Date')
plt.ylabel('Cases')
plt.show()

In [None]:
# Plot multiple countries total cases on same plot
fig, ax = plt.subplots(figsize=(15,7))

df.groupby(['Date','CountryName']).sum()['total_cases'].unstack().plot(ax=ax)

ax.set_title('Total COVID cases (all countries)')
ax.set_xlabel('Date')
ax.set_ylabel('Total Cases')
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), ncol=5)

plt.show()    

In [None]:
# Plot multiple countries daily cases on same plot
fig, ax = plt.subplots(figsize=(15,7))

df.groupby(['Date','CountryName']).sum()['new_cases_smoothed_per_million'].unstack().plot(ax=ax)

ax.set_title('New COVID cases per million (smoothed) (all countries)')
ax.set_xlabel('Date')
ax.set_ylabel('New Cases')
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), ncol=5)

plt.show()    

#### Identify best way to filter data, e.g.

##### By GDP:

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x = df['CountryName'].sort_values().unique(), 
    y = df.groupby(['CountryName'])['gdp_per_capita'].max(),
    name = 'GDP per Capita - all countries'
))
fig.update_layout(xaxis_tickangle = 90)
fig.show()

##### By Population:

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x = df['CountryName'].sort_values().unique(), 
    y = df.groupby(['CountryName'])['population'].max(),
    name = 'Population - all countries'
))
fig.update_layout(xaxis_tickangle = 90)
fig.show()

##### By Total Cases:

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x = df['CountryName'].sort_values().unique(), 
    y = df.groupby(['CountryName'])['total_cases'].max(),
    name = 'Total Cases (max) - all countries'
))
fig.update_layout(xaxis_tickangle = 90)
fig.show()

##### By median stringency index:

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x = df['CountryName'].sort_values().unique(), 
    y = df.groupby(['CountryName'])['stringency_index'].median(),
    name = 'Total Cases (max) - all countries'
))
fig.update_layout(xaxis_tickangle = 90)
fig.show()

In [None]:
px.histogram(df.groupby(['CountryName'])['stringency_index'].median())

#### Filter by low, medium, and high stringency countries, using histogram to ID threshold values

In [None]:
median_stringency = df.groupby(['CountryName'])['stringency_index'].median()

In [None]:
low_str = median_stringency[(median_stringency > 20) & (median_stringency < 25)]
hi_str = median_stringency[median_stringency > 85]
med_str = median_stringency[(median_stringency == 68.980)]

In [None]:
print(med_str)

#### Plot new cases vs time for low, medium, and high stringency countries
Questions:
- Do New Cases covary with Stringency? e.g.
    - Does Stringency increase when Cases go up?  
    - Do Cases go up when Stringency goes down?
- Did countries with High Stringency fare better than those with Low Stringency?

##### Starting with Medium stringency countries:  

In [None]:
# Re-create the first line plot, with a different line for each country (1) daily cases and (2) stringency index
# Include secondary axis for stringency index 
# Solid line = cases; dashed = stringency index

#Initialize Figure
fig = make_subplots(specs=[[{"secondary_y": True}]])
colors = ['black', 'red', 'green', 'blue', 'fuchsia']

for item in range(0, len(med_str)):
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == med_str.index[item]].Date, 
        y = df[df.CountryName == med_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = med_str.index[item] + ' - New Cases',
        line = dict(color=colors[item])
        ),
        secondary_y=False
        )
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == med_str.index[item]].Date, 
        y = df[df.CountryName == med_str.index[item]].stringency_index,
        mode = 'lines',
        name = med_str.index[item] + ' - Stringency',
        line = dict(color=colors[item], dash='dot', width=2)
        ),
        secondary_y=True
        )
                  
fig.update_layout(
    title = 'New Cases and Stringency:  Medium Stringency Countries'
    )

fig.update_xaxes(title_text='Date')

fig.update_yaxes(title_text='New Cases per Million', secondary_y=False)
fig.update_yaxes(title_text='Stringency Index', secondary_y=True)

fig.show()

Observations:
- The United States is such an outlier with New Cases that it's difficult to see trends across countries
- We can see Gambia had dip in stringency in July'20, then an increase in New Cases, which likely triggered the increase in Stringency again
- The United States overall Stringency does not vary much despite large swings in New Cases.  Stringency was highest in May'20, even though New Cases gradually increased overall from there.  This may be due to pushback on lockdowns and mandates. 

##### Next for Low Stringency Countries:

In [None]:
# Re-create the first line plot, with a different line for each country (1) daily cases and (2) stringency index
# Include secondary axis for stringency index 
# Solid line = cases; dashed = stringency index

#Initialize Figure
fig = make_subplots(specs=[[{"secondary_y": True}]])
colors = ['black', 'red', 'green', 'blue', 'fuchsia']

for item in range(0, len(low_str)):
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == low_str.index[item]].Date, 
        y = df[df.CountryName == low_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = low_str.index[item] + ' - New Cases',
        line = dict(color=colors[item])
        ),
        secondary_y=False
        )
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == med_str.index[item]].Date, 
        y = df[df.CountryName == med_str.index[item]].stringency_index,
        mode = 'lines',
        name = low_str.index[item] + ' - Stringency',
        line = dict(color=colors[item], dash='dot')
        ),
        secondary_y=True
        )
                  
fig.update_layout(
    title = 'New Cases and Stringency:  Low Stringency Countries'
    )

fig.update_xaxes(title_text='Date')

fig.update_yaxes(title_text='New Cases per Million', secondary_y=False)
fig.update_yaxes(title_text='Stringency Index', secondary_y=True)

fig.show()

Observations:
- These countries all saw a large spike in cases early in the pandemic, and then limited peaks of new cases following
- This is in stark contrast to e.g. the U.S. or Cape Verde, which both saw rises and falls throughout 2020
- The overall number of New Cases per Million is generally an order of magnitude less than that for the medium stringency cases
- Interestingly, Mauritius had a dip in Stringency in Nov'20, a subsequent increase in Cases, but then kept Stringency low.  Cases still managed to decline and stay flat.

##### And for High Stringency Countries:

In [None]:
#Initialize Figure
fig = make_subplots(specs=[[{"secondary_y": True}]])
colors = ['black', 'red', 'green', 'blue', 'fuchsia']

for item in range(0, len(hi_str)):
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == hi_str.index[item]].Date, 
        y = df[df.CountryName == hi_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = hi_str.index[item] + ' - New Cases',
        line = dict(color=colors[item])
        ),
        secondary_y=False
        )
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == hi_str.index[item]].Date, 
        y = df[df.CountryName == hi_str.index[item]].stringency_index,
        mode = 'lines',
        name = hi_str.index[item] + ' - Stringency',
        line = dict(color=colors[item], dash='dot')
        ),
        secondary_y=True
        )
                  
fig.update_layout(
    title = 'New Cases and Stringency:  High Stringency Countries'
    )

fig.update_xaxes(title_text='Date')

fig.update_yaxes(title_text='New Cases per Million', secondary_y=False)
fig.update_yaxes(title_text='Stringency Index', secondary_y=True)

fig.show()

Observations:
- All of these countries appear to have gone to "high stringency" early in the pandemic and stayed there
- It would be interesting to look at government type by stringency; it's possible more authoritarian governments had generally higher stringencies
- We can see Oman (pink) decreased in stringency in Nov'20 following a decrease in cases
- Eritrea (red) had an increase in stringency in late 2020 following an increase in cases

##### High, Medium, and Low on the same Plot:

In [None]:
#Initialize Figure
fig = make_subplots(specs=[[{"secondary_y": True}]])
low_colors = ['green', 'lime', 'chartreuse', 'darkgreen', 'darkolivegreen']
med_colors = ['mediumblue', 'skyblue', 'deepskyblue', 'darkslateblue', 'dodgerblue']
hi_colors = ['red', 'maroon', 'indianred', 'brown', 'darkgoldenrod']

for item in range(0, len(low_str)):
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == low_str.index[item]].Date, 
        y = df[df.CountryName == low_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = low_str.index[item] + ' - New Cases',
        line = dict(color=low_colors[item])
        ),
        secondary_y=False
        )
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == med_str.index[item]].Date, 
        y = df[df.CountryName == med_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = med_str.index[item] + ' - New Cases',
        line = dict(color=med_colors[item])
        ),
        secondary_y=False
        )
         
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == hi_str.index[item]].Date, 
        y = df[df.CountryName == hi_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = hi_str.index[item] + ' - New Cases',
        line = dict(color=hi_colors[item])
        ),
        secondary_y=False
        )
        
fig.update_layout(
    title = 'New Cases vs Time:  High (red), Med (blue), and Low (green) Stringency Countries'
    )

fig.update_xaxes(title_text='Date')

fig.update_yaxes(title_text='New Cases per Million', secondary_y=False)
fig.update_yaxes(title_text='Stringency Index', secondary_y=True)

fig.show()

Observations:
- Again, the U.S. is such an outlier it's tough to see other trends
- At a glance it appears the low stringency countries fared better than medium and even high stringency countries
- To investigate further, let's take out the medium stringency countries

##### Just High and Low to simplify the plot:

In [None]:
#Initialize Figure
fig = make_subplots(specs=[[{"secondary_y": True}]])
low_colors = ['green', 'lime', 'chartreuse', 'darkgreen', 'darkolivegreen']
med_colors = ['mediumblue', 'skyblue', 'deepskyblue', 'darkslateblue', 'dodgerblue']
hi_colors = ['red', 'maroon', 'indianred', 'brown', 'darkgoldenrod']

for item in range(0, len(low_str)):
    
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == low_str.index[item]].Date, 
        y = df[df.CountryName == low_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = low_str.index[item] + ' - New Cases',
        line = dict(color=low_colors[item])
        ),
        secondary_y=False
        )
    
#    fig.add_trace(go.Scatter( 
#        x = df[df.CountryName == med_str.index[item]].Date, 
#        y = df[df.CountryName == med_str.index[item]].new_cases_smoothed_per_million,
#        mode = 'lines',
#        name = med_str.index[item] + ' - New Cases',
#        line = dict(color=med_colors[item])
#        ),
#        secondary_y=False
#        )
         
    fig.add_trace(go.Scatter( 
        x = df[df.CountryName == hi_str.index[item]].Date, 
        y = df[df.CountryName == hi_str.index[item]].new_cases_smoothed_per_million,
        mode = 'lines',
        name = hi_str.index[item] + ' - New Cases',
        line = dict(color=hi_colors[item])
        ),
        secondary_y=False
        )
        
fig.update_layout(
    title = 'New Cases vs Time:  High (red) and Low (green) Stringency Countries'
    )

fig.update_xaxes(title_text='Date')

fig.update_yaxes(title_text='New Cases per Million', secondary_y=False)
fig.update_yaxes(title_text='Stringency Index', secondary_y=True)

fig.show()

Observations:
- The countries with the highest stringency fared WORSE, and had generally more New Cases per Day throughout 2020, compared to the lower stringency countries.
- One possibility is that other policies beyond those included in the stringency metric may be more impactful in reducing New Cases.  For example, New Zealand is included in the Low Stringency group, but famously locked down it's borders more tightly than most other countries.  So while it's listed as "low stringency," it did take other drastic steps to prevent spread.
- Another possibility is there are other economical and political factors involved that impact High Stringency countries (e.g. type of government, GDP, proximity to other highly impacted countries, etc) that cause those countries to have higher spread.
- Other visualizations of this data have plotted not vs absolute date, but vs "#Days From 100K Cases," or some other relative time point.  This likely makes more sense, since the virus hit countries at different time points.
- Future exploration would include looking into filtering results prior to plotting, and also looking at other factors beyond Stringency that may have impacted virus spread. 

### References:
1. Mathieu, E., Ritchie, H., Ortiz-Ospina, E. et al. A global database of COVID-19 vaccinations. Nat Hum Behav (2021). https://doi.org/10.1038/s41562-021-01122-8
2. Hannah Ritchie, Edouard Mathieu, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Esteban Ortiz-Ospina, Joe Hasell, Bobbie Macdonald, Diana Beltekian and Max Roser (2020) - "Coronavirus Pandemic (COVID-19)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/coronavirus' [Online Resource]
3. Thomas Hale, Noam Angrist, Rafael Goldszmidt, Beatriz Kira, Anna Petherick, Toby Phillips, Samuel Webster, Emily Cameron-Blake, Laura Hallas, Saptarshi Majumdar, and Helen Tatlow. (2021). “A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker).” Nature Human Behaviour. https://doi.org/10.1038/s41562-021-01079-8