# Global COVID-19 Analysis

This notebook provides a comprehensive analysis of the global impact of COVID-19 using country- and continent-level data. The analysis includes data cleaning, feature engineering, descriptive statistics, ranking analysis, continental comparisons, correlation analysis, and visualizations.

## 1. Libraries and Configuration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

plt.style.use('default')
sns.set_context('notebook')

## 2. Data Loading
The dataset is loaded from the data directory.

In [None]:
DATA_PATH = 'data/covid_19.csv'
df = pd.read_csv(DATA_PATH)
df.head()

## 3. Data Cleaning & Preprocessing
This step handles missing values, standardizes column names, and ensures correct data types.

In [None]:
df = df.rename(columns={
    'Cases': 'Confirmed',
    'population': 'Population',
    'country': 'Country',
    'continent': 'Continent'
})

df['Recovered'] = df['Recovered'].fillna(0)
df['Deaths'] = df['Deaths'].fillna(0)
df['Tests'] = df['Tests'].fillna(df['Tests'].median())

numeric_cols = ['Confirmed', 'Deaths', 'Recovered', 'Tests', 'Population']
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

## 4. Feature Engineering
New indicators such as fatality rate, recovery rate, and cases per million are derived.

In [None]:
df['Fatality_Rate'] = np.where(df['Confirmed'] > 0, df['Deaths'] / df['Confirmed'], np.nan)
df['Recovery_Rate'] = np.where(df['Confirmed'] > 0, df['Recovered'] / df['Confirmed'], np.nan)
df['Cases_per_Million'] = np.where(df['Population'] > 0,
                                      (df['Confirmed'] / df['Population']) * 1_000_000,
                                      np.nan)
df['Deaths_per_Million'] = np.where(df['Population'] > 0, 
                                      (df['Deaths'] / df['Population']) * 1_000_000,
                                       np.nan)

## 5. Descriptive Statistics
This section summarizes global COVID-19 statistics.

In [None]:
global_totals = df[['Confirmed', 'Deaths', 'Recovered', 'Tests']].sum()
global_totals

avg_fatality_rate = df['Fatality_Rate'].mean()
avg_recovery_rate = df['Recovery_Rate'].mean()

avg_fatality_rate, avg_recovery_rate

median_fatality_rate = df['Fatality_Rate'].median()
median_recovery_rate = df['Recovery_Rate'].median()

median_fatality_rate, median_recovery_rate

summary_stats = pd.DataFrame({
    'Metric': ['Fatality Rate', 'Recovery Rate'],
    'Mean': [avg_fatality_rate, avg_recovery_rate],
    'Median': [median_fatality_rate, median_recovery_rate]
})

summary_stats

## 6. Ranking Analysis
Top 10 countries are identified based on confirmed cases, deaths, and testing volume.

In [None]:
exclude_entities = ['All', 'Europe', 'Asia', 'North-America', 'South-America', 'Africa', 'Oceania']
df_countries = df[~df['Country'].isin(exclude_entities)]

top10_cases = df_countries.groupby('Country')['Confirmed'].max().sort_values(ascending=False).head(10)
top10_deaths = df_countries.groupby('Country')['Deaths'].max().sort_values(ascending=False).head(10)
top10_tests = df_countries.groupby('Country')['Tests'].max().sort_values(ascending=False).head(10)

top10_cases, top10_deaths, top10_tests

## 7. Continental Comparison
COVID-19 impact is compared across continents.

In [None]:
continent_df = df[(df['Continent'].notna()) & (df['Continent'] != 'All')]

continent_summary = continent_df.groupby('Continent')[['Confirmed', 'Deaths']].sum().reset_index()
continent_summary['Fatality_Rate'] = continent_summary['Deaths'] / continent_summary['Confirmed']
continent_summary

continent_cases = (
    continent_summary
    .set_index('Continent')['Confirmed']
    .sort_values(ascending=False)
)

continent_cases

## 8. Correlation Analysis
Relationships between testing, population size, and confirmed cases are examined.

In [None]:
country_level = df_countries.groupby('Country')[['Confirmed', 'Tests', 'Population']].max().reset_index()

country_level[['Tests', 'Confirmed']].corr(), country_level[['Population', 'Confirmed']].corr()

## 9. Visualizations
Key findings are visualized using bar charts, scatter plots, and pie charts.

In [None]:
os.makedirs('figures', exist_ok=True)

plt.figure(figsize=(10,6))
plt.barh(top10_cases.index, top10_cases.values)
plt.title('Top 10 Countries by Confirmed Cases')
plt.tight_layout()
plt.savefig('figures/top10_confirmed_countries.png', dpi=300)
plt.show()

plt.figure(figsize=(8, 8))
plt.pie(
    continent_cases,
    labels=continent_cases.index,
    autopct='%1.1f%%',
    startangle=140
)
plt.title("Global Case Distribution by Continent")
plt.show()

plt.figure(figsize=(8, 6))
plt.scatter(country_level['Tests'], country_level['Confirmed'])
plt.xlabel("Number of tests")
plt.ylabel("Number of cases")
plt.title("The Relationship Between the Number of Tests and the Number of Detected Cases")
plt.show()


plt.figure(figsize=(8, 6))
plt.scatter(country_level['Population'], country_level['Confirmed'])
plt.xlabel("population")
plt.ylabel("Number of cases")
plt.title("The Relationship Between Population Size and Case Numbers")
plt.show()


## 10. Conclusion
The analysis highlights strong cross-country and cross-continental differences in COVID-19 outcomes, emphasizing the role of testing capacity, healthcare infrastructure, and regional policies.