# Exploratory Data Analysis of OVID Covid-19 Dataset

## Introduction
This notebook is for performing Exploratory Data Analysis (EDA) on the COVID-19 dataset to understand the structure of the data, clean and preprocess it, and extract any insights that could help with our risk classification task. We will:
- Load and inspect the dataset
- Clean and preprocess it
- Analyze univariate, time series, and bivariate relationships
- Engineer features which may become useful for classification
---
## Data Loading and Inspection

We start by importing necessary libraries and loading the dataset. File paths may be adjusted as needed.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Load the dataset
data = pd.read_csv('../data/owid_covid_data.csv', parse_dates=['date'])

# Display the first few rows
print("First 5 rows of the dataset:")
display(data.head())

# Display data information and summary statistics
print("Dataset Information:")
data.info()

print("\nSummary Statistics:")
display(data.describe())

First 5 rows of the dataset:


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-05,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
1,AFG,Asia,Afghanistan,2020-01-06,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
2,AFG,Asia,Afghanistan,2020-01-07,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
3,AFG,Asia,Afghanistan,2020-01-08,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,
4,AFG,Asia,Afghanistan,2020-01-09,0.0,0.0,,0.0,0.0,,...,,37.746,0.5,64.83,0.511,41128772,,,,


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429435 entries, 0 to 429434
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype         
---  ------                                      --------------   -----         
 0   iso_code                                    429435 non-null  object        
 1   continent                                   402910 non-null  object        
 2   location                                    429435 non-null  object        
 3   date                                        429435 non-null  datetime64[ns]
 4   total_cases                                 411804 non-null  float64       
 5   new_cases                                   410159 non-null  float64       
 6   new_cases_smoothed                          408929 non-null  float64       
 7   total_deaths                                411804 non-null  float64       
 8   new_deaths                                  410608 no

Unnamed: 0,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,429435,411804.0,410159.0,408929.0,411804.0,410608.0,409378.0,411804.0,410159.0,408929.0,...,243817.0,161741.0,290689.0,390299.0,319127.0,429435.0,13411.0,13411.0,13411.0,13411.0
mean,2022-04-21 01:06:25.463691008,7365292.0,8017.36,8041.026,81259.57,71.852139,72.060873,112096.199396,122.357074,122.713844,...,33.097723,50.649264,3.106912,73.702098,0.722139,152033600.0,56047.65,9.766431,10.925353,1772.6664
min,2020-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.7,1.188,0.1,53.28,0.394,47.0,-37726.1,-44.23,-95.92,-2936.4531
25%,2021-03-05 00:00:00,6280.75,0.0,0.0,43.0,0.0,0.0,1916.1005,0.0,0.0,...,22.6,20.859,1.3,69.5,0.602,523798.0,176.5,2.06,-1.5,116.872242
50%,2022-04-20 00:00:00,63653.0,0.0,12.0,799.0,0.0,0.0,29145.475,0.0,2.794,...,33.1,49.542,2.5,75.05,0.74,6336393.0,6815.199,8.13,5.66,1270.8014
75%,2023-06-08 00:00:00,758272.0,0.0,313.286,9574.0,0.0,3.143,156770.19,0.0,56.253,...,41.5,82.502,4.21,79.46,0.829,32969520.0,39128.04,15.16,15.575,2883.02415
max,2024-08-14 00:00:00,775866800.0,44236230.0,6319461.0,7057132.0,103719.0,14817.0,763598.6,241758.23,34536.89,...,78.1,100.0,13.8,86.75,0.957,7975105000.0,1349776.0,78.08,378.22,10293.515
std,,44775820.0,229664.9,86616.11,441190.1,1368.32299,513.636567,162240.412419,1508.778583,559.701638,...,13.853948,31.905375,2.549205,7.387914,0.148903,697540800.0,156869.1,12.040658,24.560706,1991.892769


## Data Cleaning and Preprocessing

Here, we will look for missing values and clean the data if needed.

In [None]:
data = data[data['iso_code'].str.contains('USA')]

print("\nRows with USA iso_code: ...\n", data)
print("Missing values per column in USA set:")
print(data.isnull().sum())

print("\nMissing values after filling:")
print(data.isnull().sum())


Rows with USA iso_code: ...
        iso_code      continent       location       date  total_cases  \
403451      USA  North America  United States 2020-01-05          0.0   
403452      USA  North America  United States 2020-01-06          0.0   
403453      USA  North America  United States 2020-01-07          0.0   
403454      USA  North America  United States 2020-01-08          0.0   
403455      USA  North America  United States 2020-01-09          0.0   
...         ...            ...            ...        ...          ...   
405120      USA  North America  United States 2024-07-31  103436829.0   
405121      USA  North America  United States 2024-08-01  103436829.0   
405122      USA  North America  United States 2024-08-02  103436829.0   
405123      USA  North America  United States 2024-08-03  103436829.0   
405124      USA  North America  United States 2024-08-04  103436829.0   

        new_cases  new_cases_smoothed  total_deaths  new_deaths  \
403451        0.0         