# COVID-19 Data Analysis — COVID vs Happiness

This notebook loads the provided COVID-19 datasets (confirmed & deaths) and the Worldwide Happiness Report, performs cleaning and exploratory data analysis (EDA), merges the datasets, and explores relationships between COVID-19 impact and happiness indicators (Score, GDP per capita, Healthy life expectancy, Social support, etc.).

Notes:
- Place the CSV files in the same folder as this notebook (covid19_Confirmed_dataset.csv, covid19_deaths_dataset.csv, worldwide_happiness_report.csv).
- The notebook uses simple heuristics (grouping, fuzzy-matching) to merge country names; manual adjustments may be needed for certain names.


In [ ]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from difflib import get_close_matches
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


In [ ]:
# Load datasets
cov_deaths = pd.read_csv('covid19_deaths_dataset.csv')
cov_confirmed = pd.read_csv('covid19_Confirmed_dataset.csv')
happiness = pd.read_csv('worldwide_happiness_report.csv')

print('deaths shape:', cov_deaths.shape)
print('confirmed shape:', cov_confirmed.shape)
print('happiness shape:', happiness.shape)


## Inspect the COVID data format
The COVID files are in the 'wide' format with date columns. We'll transform to aggregated country-level latest totals.

In [ ]:
# helper: get the last date column available in a wide-format covid dataframe
def last_date_col(df):
    # date-like columns generally start after the 4th column (Province/State,Country/Region,Lat,Long)
    cols = df.columns.tolist()
    # find first date column by trying to parse; fallback to last column
    for i, c in enumerate(cols):
        try:
            # month/day/year format may raise or succeed
            datetime.strptime(c, '%m/%d/%y')
            first_date_idx = i
            break
        except Exception:
            continue
    return cols[first_date_idx:]

date_cols_deaths = last_date_col(cov_deaths)
date_cols_conf = last_date_col(cov_confirmed)
print('deaths date columns sample:', date_cols_deaths[:3], '... last:', date_cols_deaths[-1])
print('confirmed date columns sample:', date_cols_conf[:3], '... last:', date_cols_conf[-1])


In [ ]:
# Aggregate to country-level using the latest available date
latest_death_col = date_cols_deaths[-1]
latest_conf_col = date_cols_conf[-1]

deaths_by_country = cov_deaths.groupby('Country/Region')[date_cols_deaths].sum().reset_index()
confirmed_by_country = cov_confirmed.groupby('Country/Region')[date_cols_conf].sum().reset_index()

deaths_latest = deaths_by_country[['Country/Region', latest_death_col]].rename(columns={latest_death_col: 'Deaths_latest'})
conf_latest = confirmed_by_country[['Country/Region', latest_conf_col]].rename(columns={latest_conf_col: 'Confirmed_latest'})

cov_latest = pd.merge(conf_latest, deaths_latest, on='Country/Region', how='outer')
cov_latest['Confirmed_latest'] = cov_latest['Confirmed_latest'].fillna(0).astype(int)
cov_latest['Deaths_latest'] = cov_latest['Deaths_latest'].fillna(0).astype(int)
cov_latest['CFR'] = np.where(cov_latest['Confirmed_latest']>0, cov_latest['Deaths_latest'] / cov_latest['Confirmed_latest'], np.nan)
cov_latest.sort_values('Confirmed_latest', ascending=False).head()


## Prepare happiness dataframe
- Clean column names and types


In [ ]:
h = happiness.copy()
# Standardize column names
h.columns = [c.strip() for c in h.columns]
h = h.rename(columns={'Country or region': 'Country', 'Overall rank': 'Rank'})
h.head()


## Merge datasets (fuzzy country matching)
We'll try direct matches first, then use difflib.get_close_matches for unmatched countries.

In [ ]:
cov = cov_latest.copy()
cov = cov.rename(columns={'Country/Region': 'Country'})

# Build mapping from COVID country names to Happiness country names using best matches
h_countries = h['Country'].unique().tolist()
cov_countries = cov['Country'].unique().tolist()

mapping = {}
for c in cov_countries:
    if c in h_countries:
        mapping[c] = c
    else:
        # try close match
        matches = get_close_matches(c, h_countries, n=1, cutoff=0.8)
        if matches:
            mapping[c] = matches[0]
        else:
            # small manual fixes for common name differences
            manual = {
                'US': 'United States',
                'Russia': 'Russia',
                'Korea, South': 'South Korea',
                'Iran': 'Iran',
                'Taiwan*': 'Taiwan',
                'Vietnam': 'Vietnam',
                'Congo (Kinshasa)': 'Congo (Kinshasa)',
                'Bolivia': 'Bolivia',
                'Czechia': 'Czech Republic',
                'Venezuela': 'Venezuela',
                'United Kingdom': 'United Kingdom'
            }
            if c in manual:
                mapping[c] = manual[c]
            else:
                mapping[c] = None

# Apply mapping and build merged df
cov['Country_happiness'] = cov['Country'].map(mapping)
merged = cov.merge(h, left_on='Country_happiness', right_on='Country', how='left', suffixes=('_cov','_happy'))

print('Total COVID countries:', len(cov_countries))
print('Mapped to happiness names (not null):', merged['Country_happiness'].notnull().sum())
merged[['Country', 'Country_happiness', 'Deaths_latest', 'Confirmed_latest', 'Score', 'GDP per capita']].head()


In [ ]:
# Show unmatched countries for manual inspection
unmatched = merged[merged['Score'].isnull()][['Country', 'Country_happiness']]
unmatched.head(30)


If there are unmatched countries you care about, you can update the `manual` mapping above and re-run the merge.

Next: create analysis metrics and visualizations.

In [ ]:
# Prepare final analysis dataframe (drop rows without happiness Score)
analysis = merged.dropna(subset=['Score']).copy()

# Create per-100-cases and per-100-deaths metrics where meaningful
analysis['Deaths_per_Confirmed_pct'] = analysis['CFR'] * 100
analysis['log_confirmed'] = np.log1p(analysis['Confirmed_latest'])
analysis['log_deaths'] = np.log1p(analysis['Deaths_latest'])

analysis[['Country_cov','Country_happiness','Score','GDP per capita','Deaths_latest','Confirmed_latest','Deaths_per_Confirmed_pct']].sample(10)


## Visualizations
1) Scatter: Happiness Score vs CFR (Deaths / Confirmed)
2) Scatter: Happiness Score vs Confirmed (log)
3) Correlation heatmap between happiness indicators and COVID metrics


In [ ]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=analysis, x='Score', y='Deaths_per_Confirmed_pct')
plt.xlabel('Happiness Score')
plt.ylabel('Deaths / Confirmed (%)')
plt.title('Happiness Score vs COVID Case Fatality (%)')
plt.grid(True)
plt.show()


In [ ]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=analysis, x='Score', y='log_confirmed')
plt.xlabel('Happiness Score')
plt.ylabel('log(Confirmed + 1)')
plt.title('Happiness Score vs Log Confirmed Cases')
plt.grid(True)
plt.show()


In [ ]:
# Correlation matrix
corr_cols = ['Score','GDP per capita','Social support','Healthy life expectancy','Freedom to make life choices','Generosity','Perceptions of corruption','Confirmed_latest','Deaths_latest','CFR']
sub = analysis[corr_cols].copy()
sub = sub.dropna()
corr = sub.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation between Happiness indicators and COVID metrics')
plt.show()


## Simple statistical checks
- Compute Pearson correlation between Score and CFR, Score and log_confirmed


In [ ]:
from scipy.stats import pearsonr

df_corr = sub.dropna()
for a,b in [('Score','CFR'), ('Score','log_confirmed')]:
    x = df_corr[a].values
    y = df_corr[b].values if b in df_corr else analysis[b].values
    try:
        r, p = pearsonr(x, y)
        print(f'Pearson r between {a} and {b}: r={r:.3f}, p={p:.3g}')
    except Exception as e:
        print('Could not compute for', a, b, '->', e)


## Initial conclusions & next steps
- The scatter plots and correlation matrix give a first look at whether happier countries experienced lower CFR or fewer cases per country (note: per-country totals are strongly influenced by population size and testing policies).
- Next steps to improve rigor:
  - Obtain population data to compute per-capita metrics (cases/deaths per 100k).
  - Use time-series analysis to compare pandemic trajectories and relate them to policy / social indicators.
  - Improve country-name matching (use ISO codes or a curated mapping).
  - Consider controlling for confounders (age structure, testing rates, urbanization).

Save this notebook and include the generated plots in your report/README.
