# **Group Assignment**
## Covid-19 Data - Exploratory Data Analysis
**Group Participants:**
- Uxía Lojo
- Emiliano Puertas
- María Camila Sanabria
- Joshua Vanderspuy
- Sebastian Zambrano

### **Part II: Exploratory Data Analysis**

## **Introduction**
# Exploratory Data Analysis (EDA): Predicting Death Cases

**Objective**:  
Perform an exploratory data analysis to understand the severity of the virus, the distribution of the cases, and the impact of vaccination on deaths.

**_Key Questions:_**
- **How can demographic factors like age (older vs. younger populations) influence healthcare demand and mortality prediction?**

By analyzing the population ratios, businesses and healthcare organizations can better predict the impact of demographic factors on healthcare needs and adjust their services or policies accordingly, especially in regions with a higher proportion of elderly individuals.
- **What insights can we draw from the relationship between vaccination rates and COVID-19 outcomes (cases and deaths), and how can businesses optimize their responses or interventions?**

By examining the correlation between vaccination efforts and new cases/deaths, businesses and health authorities can refine vaccination strategies and focus on areas with lower vaccination uptake, potentially reducing future health-related disruptions.
- **How do the peaks in new COVID-19 cases and deaths over time help businesses anticipate potential risks and prepare for future disruptions?**

Identifying the periods of sharp increases in cases and deaths can help businesses predict and prepare for similar surges in the future, especially in areas where healthcare systems may be strained during high-demand periods.
- **How can the patterns in mortality rate and case trends across different countries guide global or regional strategies for managing public health crises?**

The comparison of mortality trends and case counts across countries helps identify which regions may be more vulnerable to health crises, guiding businesses and governments to tailor their responses and allocate resources where they’re most needed.

**Notebook Organization:**

[1. Importing Data](#1-importing-data)

[2. Creating new variables](#2-creating-new-variables)
- [2.1.Mortality Rates](#21-mortality-rate)
- [2.2. Cases per million](#cases-per-million)
- [2.3. Population Variables](#4-population-variables)

[3. Visual Analysis](#3-visual-analysis)
- [3.1. Population Ratio by Country](#population-ratios-by-country)
- [3.2. Male to Female Ratio](#population-ratio-by-country-male-to-female-ratio)
- [3.3. Evolution of New Cases](#evolution-of-new-cases)
- [3.4. Evolution of New Deaths](#evolution-of-new-deaths)
- [3.5. Evolution of New Cases and Deaths Over Time by Country](#evolution-of-new-cases-and-deaths-over-time-by-country)
- [3.6. Evolution of Mortality Rate Over Time](#evolution-of-mortality-rate-over-time)
- [3.7. Cases Per Million](#cases-per-million)
- [3.8. New Persons Fully Vaccinated vs. New Cases](#new-persons-fully-vaccinated-vs-new-cases)
- [3.9. New Persons Fully Vaccinated vs. New Deceases](#new-persons-fully-vaccinated-vs-new-deceases)

The structure will be divided into temporal evolution analysis, country comparisons, and the impact of key variables on mortality.

### **_1. Importing Data_**

Downloading the libraries

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import itertools

Downloading our resulting table

In [2]:
macrotable = pd.read_csv("macrotable.csv")
macrotable.head()

Unnamed: 0,week,location_key,new_confirmed,new_deceased,new_hospitalized_patients,new_persons_fully_vaccinated,life_expectancy,population,population_male,population_female,population_age_00_09,population_age_10_19,population_age_20_29,population_age_30_39,population_age_40_49,population_age_50_59,population_age_60_69,population_age_70_79,population_age_80_and_older,country_name
0,2019-12-30/2020-01-05,DE,1.0,0.0,,,,82786787.0,40126479.0,41172726.0,7401202.0,7586334.0,9513883.0,10265460.0,10205383.0,13258896.0,10159451.0,7543815.0,5313340.0,Germany
1,2020-01-13/2020-01-19,DE,1.0,0.0,,,,82786787.0,40126479.0,41172726.0,7401202.0,7586334.0,9513883.0,10265460.0,10205383.0,13258896.0,10159451.0,7543815.0,5313340.0,Germany
2,2020-01-20/2020-01-26,DE,2.0,0.0,,,,82786787.0,40126479.0,41172726.0,7401202.0,7586334.0,9513883.0,10265460.0,10205383.0,13258896.0,10159451.0,7543815.0,5313340.0,Germany
3,2020-01-20/2020-01-26,US,0.0,0.0,,,77.973595,341338766.0,167361582.0,172779253.0,42034674.0,43600405.0,47667343.0,45409648.0,42444067.0,45231261.0,38309532.0,22458802.0,12982024.0,United States of America
4,2020-01-27/2020-02-02,DE,6.0,0.0,,,,82786787.0,40126479.0,41172726.0,7401202.0,7586334.0,9513883.0,10265460.0,10205383.0,13258896.0,10159451.0,7543815.0,5313340.0,Germany


### **_2. Creating new variables_**

#### **2.1. Mortality Rate** 
- `mortality_rate:` The proportion of weekly deceased cases relative to the weekly confirmed cases. This shows the lethality of the disease for each country and time period.

In [3]:
macrotable['mortality_rate']=round(macrotable['new_deceased']/macrotable['new_confirmed'],2).fillna(0)

#### **2.2. Cases per million**
- `cases_per_million:` The number of new confirmed cases per one million people. This standardizes case counts for better comparison across countries.

In [4]:
macrotable['cases_per_million'] = macrotable['new_confirmed'] / (macrotable['population'] / 1_000_000)

#### **2.3. Population variables**
- `population_older_60:` Total population aged 60 years or older.
- `population_older_60_rate:` The proportion of the population aged 60 years or older relative to the total population.
- `population_younger:` Total population aged 0-19.
- `population_younger_rate:` The proportion of the population aged 0-19 relative to the total population.
- `male_female_ratio: ` The ratio of the male population to the female population.

- `population_older_60:`

In [5]:
macrotable['population_older_60'] = macrotable['population_age_60_69'] + macrotable['population_age_70_79'] + macrotable['population_age_80_and_older']

- `population_older_60_rate:`

In [6]:
macrotable['population_older_60_rate'] = round(macrotable['population_older_60'] / macrotable['population'],2)

- `population_younger:`

In [7]:
macrotable['population_younger'] = macrotable['population_age_00_09'] + macrotable['population_age_10_19']

- `population_younger_rate:`

In [8]:
macrotable['population_younger_rate'] = round(macrotable['population_younger'] / macrotable['population'],2)

- `male_female_ratio: `

In [9]:
macrotable['male_female_ratio'] = macrotable['population_male'] / macrotable['population_female']

### **_3. Visual Analysis_**

First, we will create some auxiliary tables for our graphs

In [10]:
#Auxiliary column for our graphs
macrotable['week_start'] = macrotable['week'].apply(lambda x: pd.to_datetime(x.split('/')[0]))

In [11]:
auxiliary=macrotable.groupby(by='week_start')[['new_confirmed','new_deceased']].sum().reset_index()

In [12]:
auxiliary2=macrotable[['country_name','population_older_60_rate','population_younger_rate', 'male_female_ratio']].groupby(by='country_name').mean().reset_index()
auxiliary2[['population_older_60_rate', 'population_younger_rate', 'male_female_ratio']] = auxiliary2[['population_older_60_rate', 'population_younger_rate', 'male_female_ratio']] * 100

### **Population Ratios by Country**
To start our exploratory data analysis, we examine the population ratio by age groups for each country. Specifically, we focus on the ratio of older adults (aged 60+) to younger individuals (under 19). This comparison is critical because age demographics play a pivotal role in healthcare demands and mortality risks. Understanding these patterns can help contextualize the challenges faced by different countries and identify which populations may be at higher risk.

In [13]:
fig=px.bar(
    auxiliary2,
    x='country_name',
    y=['population_older_60_rate','population_younger_rate'],
    barmode='group',
    color='variable',
    category_orders={'variable': ['population_older_60_rate', 'population_younger_rate']}, 
    color_discrete_map={
        'population_older_60_rate': '#0A8754', 
        'population_younger_rate': '#508CA4'
    },
    title= 'Population Ratio by Country',
    labels={'variable': 'Variable','country_name':'Country' },
    text_auto=True
)


fig.for_each_trace(lambda t: t.update(name='Older than 60' if t.name == 'population_older_60_rate' else 'Younger than 19'))
fig.update_traces(texttemplate='%{y:.2f}%')

fig.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='Country',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5
    },
    legend=dict(
        orientation="h",
        x=0.4,
        y=1.1)
)

fig.show()

From our first approach, we observe that Germany, Spain, and Italy exhibit a higher ratio of older adults (aged 60+) compared to younger individuals (under 19) than the USA. This demographic structure could be significant in understanding the healthcare demands and predicting death cases, as a higher proportion of older populations may correlate with greater vulnerability to certain health risks

### **Population Ratio by Country (Male to Female Ratio)**

Next, we analyze the male-to-female ratio across the four countries to understand gender distribution patterns. Gender demographics can provide context for healthcare demands, as certain health conditions and mortality risks may vary by gender. This analysis serves as a foundational step to assess whether gender imbalances could impact our predictions for death cases.

In [14]:
fig=px.bar(
    auxiliary2,
    x='country_name',
    y='male_female_ratio',
    title= 'Population Ratio by Country',
    labels={'variable': 'Variable','country_name':'Country' },
    text_auto=True
)
fig.update_traces(marker=dict(color='#E4D9FF'),
    texttemplate='%{y:.2f}%')

fig.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='Country',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5
    },
    legend=dict(
        orientation="h",
        x=0.4,
        y=1.1)
)

fig.show()

The analysis reveals that all four countries exhibit a slightly higher proportion of females compared to males, with the ratio being close to 1 in all cases. Although the gender distribution is nearly balanced, it is worth considering how such slight variations might influence healthcare outcomes.

#### **Evolution of New Cases**
We now examine the general evolution of new cases over time, aggregating data across all four countries. This analysis provides a high-level view of the pandemic's progression and helps identify key periods of increased transmission. Recognizing these trends is essential to understanding how the spread of cases may correlate with mortality rates and other variables

In [15]:
fig1=px.line(
    auxiliary, #data
    x='week_start',
    y=['new_confirmed'],
    title='Evolution of New Cases Over Time',
     labels={'week_start': 'Date'}
)

fig1.for_each_trace(lambda t: t.update(name='New Cases' if t.name == 'new_confirmed' else 'New Deaths'))

fig1.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='Date',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5
    },
    legend=dict(
        orientation="h",
        x=0.4,
        y=1.1)
)

fig1.show()

The graph shows a slight peak in new cases around January 2021, followed by a dramatic surge in January 2022. This significant increase suggests a major outbreak or wave during early 2022, likely influenced by factors such as the christmas and new year's festivities, seasonal effects, new variants, or changes in public health measures. These findings highlight critical moments where healthcare systems may have faced heightened strain, emphasizing the importance of preparedness during peak periods.

#### **Evolution of New Deaths**
Next, we analyze the general evolution of deaths due to COVID-19 over time, aggregated across all four countries. Understanding the timeline of death peaks helps us assess the impact of the pandemic during its most critical phases. This analysis is crucial for identifying patterns and key moments when healthcare systems may have faced the greatest challenges

In [16]:
fig2=px.line(
    auxiliary, #data
    x='week_start',
    y=['new_deceased'],
    title='Evolution of New Deaths Over Time',
     labels={'week_start': 'Date'}
)

fig2.for_each_trace(lambda t: t.update(name='New Cases' if t.name == 'new_confirmed' else 'New Deaths'))

fig2.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='Date',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5
    },
    legend=dict(
        orientation="h",
        x=0.4,
        y=1.1)
)

fig2.show()

The graph reveals distinct peaks in deaths during April 2020, January 2021, September 2021, and January 2022, followed by a declining trend towards the end of the period. These peaks align with known waves of the pandemic, possibly driven by surges in cases, the emergence of new variants, or delays in vaccination rollout. The decline at the end suggests the impact of widespread vaccination and improved healthcare responses, marking a potential turning point in managing the pandemic's severity.

### **Evolution of New Cases and Deaths Over Time by Country**
To gain deeper insights, we examine the evolution of new COVID-19 cases and deaths by country. Breaking the data down by country allows us to compare the trajectories and identify any significant differences between regions. This analysis helps contextualize how factors such as population size, healthcare infrastructure, and pandemic response might have influenced case and death trends

In [17]:
fig3=px.line(
    macrotable, #data
    x='week_start',
    y=['new_confirmed','new_deceased'],
    title='Evolution of New Cases and Deaths Over Time by Country',
    facet_col='country_name',
    labels={'week_start': 'Date','country_name':'Country' }
)

fig3.for_each_annotation(lambda a: a.update(text=a.text.split('=')[1]))
fig3.for_each_trace(lambda t: t.update(name='New Cases' if t.name == 'new_confirmed' else 'New Deaths'))

fig3.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='Date',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5,
    },
    legend=dict(
        orientation="h",
        x=0.4,
        y=1.17)
)
fig3.show()

The graph highlights that the USA exhibits significantly larger peaks in both new cases and deaths compared to Germany, Spain, and Italy. This disparity is likely due to the USA's substantially higher population, which naturally results in a greater absolute number of cases and deaths. However, further analysis adjusting for population size could provide more nuanced insights into the relative severity of the pandemic across these countries.

#### **Evolution of Mortality Rate Over Time**
We now analyze the mortality rate over time for each country. Mortality rate, defined as the ratio of deaths to confirmed cases, provides an essential perspective on the severity of the pandemic and the effectiveness of healthcare responses. By examining its evolution, we can identify key moments where the risk of death was highest and assess how this metric has changed over time.

In [18]:
fig4=px.line(
    macrotable, #data
    x='week_start',
    y='mortality_rate',
    title='Mortality Rate by Country',
    facet_col='country_name',
    labels={'mortality_rate': 'Mortality Rate', 'week_start': 'Date', 'country_name' : 'Country', 'population_older_60_rate': 'Older People Ratio'}
)

fig4.for_each_annotation(lambda a: a.update(text=a.text.split('=')[1]))


fig4.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='Date',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5,
    },
    legend=dict(
        orientation="h",
        x=0.4,
        y=1.17)
)
fig4.show()

The graph shows a general decline in mortality rates over time across Germany and the US, with several noticeable peaks. These peaks may correspond to periods of healthcare system strain, the emergence of more severe variants, or delays in medical interventions. The overall downward trend likely reflects improvements in treatment protocols, increased vaccination rates, and broader testing, which captured more mild cases, thereby reducing the apparent mortality rate.

#### **Cases per Million**
To adjust for population differences, we analyze COVID-19 cases per million people, with a separate line representing each country. This metric allows us to compare the pandemic's impact more equitably across nations, providing a clearer picture of relative case surges regardless of population size.

In [19]:
fig5=px.line(
    macrotable, #data
    x='week_start',
    y='cases_per_million',
    title='New Cases per Million',
    color='country_name',
    labels={'week_start': 'Date'}
)


fig5.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='Date',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5,
    },
    legend=dict(
        orientation="h",
        x=0.345,
        y=1.17)
)
fig5.show()

The graph reveals that all countries exhibit very similar trends in case peaks over time, with surges occurring at roughly the same periods. This synchronicity suggests that global factors, such as the emergence of new variants, seasonal effects, or shared policy responses, played a significant role in driving the waves of infections across these nations. Despite differences in population size, the timing of the peaks underscores the interconnected nature of the pandemic.

#### **New Persons Fully Vaccinated vs. New Cases**
To explore the relationship between vaccination efforts and the spread of COVID-19, we analyze a scatter plot comparing the number of new persons fully vaccinated to the number of new cases in the United States (only available data). This visualization aims to uncover patterns that might indicate the effectiveness of vaccination campaigns in controlling the spread of the virus.

In [20]:
fig7 = px.scatter(
    macrotable, 
    x='new_persons_fully_vaccinated',
    y='new_confirmed',
    hover_name='country_name',
    facet_col='country_name',
    title='New Persons Fully Vaccinated vs. New Cases ',
    labels={'new_persons_fully_vaccinated': 'New People Vaccinated', 'new_confirmed': 'New Cases', 'country_name' : 'Country'}
)

fig7.for_each_annotation(lambda a: a.update(text=a.text.split('=')[1]))

fig7.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='New People Vaccinated',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5,
    },
    legend=dict(
        orientation="h",
        x=0.345,
        y=1.17)
)

fig7.show()

The scatter plot shows a high concentration of data points where new cases are numerous but the number of newly vaccinated individuals is low. This suggests a potential correlation between lower vaccination levels and higher case counts. While other factors may also contribute, the visualization reinforces the critical role of vaccination in mitigating the spread of COVID-19.

#### **New Persons Fully Vaccinated vs. New Deceases**
Finally, we analyze the relationship between the number of new persons fully vaccinated and new deaths in the United States. This scatter plot provides further insight into how vaccination efforts may correlate with mortality, helping to assess whether higher vaccination rates could be linked to a reduction in deaths.

In [21]:

fig6 = px.scatter(
    macrotable, 
    x='new_persons_fully_vaccinated',
    y='new_deceased',
    hover_name='country_name',
    facet_col='country_name',
    title='New Persons Fully Vaccinated vs. New Deceases ',
    labels={'new_persons_fully_vaccinated': 'New People Vaccinated', 'new_deceased': 'New Deceases', 'country_name' : 'Country'}
)

fig6.for_each_annotation(lambda a: a.update(text=a.text.split('=')[1]))

fig6.update_layout(
    paper_bgcolor='white',
    plot_bgcolor='white',
    xaxis_title='New People Vaccinated',
    yaxis_title=None,
    legend_title=None,
    font=dict(
        family='Arial',
        size=12,
        color='black'
    ),
    title={
        'font': {
            'size': 20,
            'weight': 'bold'
            },
        'x': 0.5,
    },
    legend=dict(
        orientation="h",
        x=0.345,
        y=1.17)
)

fig6.show()


Similar to the previous analysis of new cases, the scatter plot shows a concentration of data points where new deaths are higher and vaccination levels are relatively low. This suggests that regions with fewer vaccinated individuals tend to experience more deaths, reinforcing the notion that vaccination plays a key role in reducing mortality. However, other factors may also influence this trend, highlighting the need for a more comprehensive analysis.

### **Conclusions**
1. Demographics and Mortality

**Insight**: Older populations (Germany, Spain, Italy) face higher mortality risks. Gender doesn't seem to be that relevant.\
**Action**: Focus healthcare resources on the elderly.

2. Case Surges

**Insight**: Major peaks (Jan 2021, Jan 2022) stress healthcare systems.\
**Action**: Prepare for surge periods with scalable healthcare systems.

3. Mortality Trends

**Insight**: Declining mortality rates, with occasional spikes.\
**Action**: Swift mobilization of resources during peaks is crucial.

4. Vaccination Impact

**Insight** Lower vaccination correlates with higher deaths.\
**Action**: Prioritize vaccination efforts to reduce deaths.

5. Synchronized Trends

**Insight**: COVID-19 peaks happen globally at similar times.\
**Action**: Use global trends to plan international responses.

6. Future Preparedness

**Insight** Future surges are likely.\
**Action**: Prepare with stronger healthcare systems and vaccination plans.
