<a href="https://colab.research.google.com/github/izik-adio/Predictive-Modelling-for-COVID-19/blob/main/eda_to_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
This notebook presents a data science capstone project focused on **predictive modeling for COVID-19** in public health. The goal is gaining actionable insights from historical COVID-19 data. Key tasks include:  

1. **Data Preparation**: Cleaning, transforming, and engineering features from the CORD-19 dataset to ensure high-quality inputs for analysis.  
2. **Exploratory Data Analysis (EDA)**: Identifying trends, correlations, and key factors influencing COVID-19 spread and severity through visualizations.  
3. **Predictive Modeling**: Building and evaluating time-series and classification models to forecast trends and outcomes.  
4. **Visualization & Reporting**: Delivering insights via clear visualizations and a structured report to support decision-making for public health policies and resource allocation.  

This project is a comprehensive effort to leverage data science for improving public health responses during the COVID-19 pandemic.

### The dataset used is gotten from [Kaggle](https://www.kaggle.com/datasets/imdevskp/corona-virus-report), below is a brief overview of what is contained in each csv file

* full_grouped.csv - Day to day country wise no. of cases (Has County/State/Province level data)
* covid_19_clean_complete.csv - Day to day country wise no. of cases (Doesn't have County/State/Province level data)
* country_wise_latest.csv - Latest country level no. of cases
* day_wise.csv - Day wise no. of cases (Doesn't have country level data)
* usa_county_wise.csv - Day to day county level no. of cases
* worldometer_data.csv - Latest data from https://www.worldometers.info/


In [7]:
import pandas as pd

In [8]:
data_url = "https://raw.githubusercontent.com/izik-adio/Predictive-Modelling-for-COVID-19/refs/heads/main/data/"
files = ["country_wise_latest.csv", "covid_19_clean_complete.csv", "day_wise.csv", "full_grouped.csv", "usa_county_wise.csv", "worldometer_data.csv"]

country_wise_latest = pd.read_csv(data_url + files[0])
covid_19_clean_complete = pd.read_csv(data_url + files[1])
day_wise = pd.read_csv(data_url + files[2])
full_grouped = pd.read_csv(data_url + files[3])
usa_country_wise = pd.read_csv(data_url + files[4])
worldometer_data = pd.read_csv(data_url + files[5])

#Data Cleaning and preprocessing

In [13]:
# Handle inconsistencies in column names:
country_wise_latest = country_wise_latest.rename(columns={'Country/Region': 'Country_Region'})
worldometer_data = worldometer_data.rename(columns={'Country/Region': 'Country_Region'})
full_grouped = full_grouped.rename(columns={'Country/Region': 'Country_Region'})
covid_19_clean_complete = covid_19_clean_complete.rename(columns={'Country/Region': 'Country_Region'})

In [14]:
# Convert 'Date' column to datetime if needed.  Adjust format if necessary
covid_19_clean_complete['Date'] = pd.to_datetime(covid_19_clean_complete['Date'])
full_grouped['Date'] = pd.to_datetime(full_grouped['Date'])
day_wise['Date'] = pd.to_datetime(day_wise['Date'])

In [15]:
# Merging the datasets
# Prioritize covid_19_clean_complete, which has daily data

merged_data = covid_19_clean_complete.copy()

# Merge with full_grouped (adds additional daily metrics)
merged_data = pd.merge(merged_data, full_grouped, on=['Country_Region', 'Date', 'Confirmed', 'Deaths', 'Recovered','Active','WHO Region'],how='left')

#Merge with country_wise_latest (adds latest overall metrics)
merged_data = pd.merge(merged_data, country_wise_latest, on=['Country_Region'], how='left', suffixes=('_daily','_latest'))

#Merge with worldometer_data(population and other metrics)
merged_data = pd.merge(merged_data, worldometer_data, on=['Country_Region'], how='left', suffixes=('_covid','_worldometer'))

In [16]:
merged_data.duplicated().sum()

0

In [18]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 42 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Province/State          14664 non-null  object        
 1   Country_Region          49068 non-null  object        
 2   Lat                     49068 non-null  float64       
 3   Long                    49068 non-null  float64       
 4   Date                    49068 non-null  datetime64[ns]
 5   Confirmed_daily         49068 non-null  int64         
 6   Deaths_daily            49068 non-null  int64         
 7   Recovered_daily         49068 non-null  int64         
 8   Active_daily            49068 non-null  int64         
 9   WHO Region_daily        49068 non-null  object        
 10  New cases_daily         34353 non-null  float64       
 11  New deaths_daily        34353 non-null  float64       
 12  New recovered_daily     34353 non-null  float6

In [20]:
# drop all columns with too small Non-Null values
merged_data.drop(columns=['NewCases', "NewDeaths", "NewRecovered"], inplace=True)

In [24]:
#Example:  Calculate case fatality rate
merged_data['CFR'] = (merged_data['Deaths_latest'] / merged_data['Confirmed_latest']) * 100
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Province/State          14664 non-null  object        
 1   Country_Region          49068 non-null  object        
 2   Lat                     49068 non-null  float64       
 3   Long                    49068 non-null  float64       
 4   Date                    49068 non-null  datetime64[ns]
 5   Confirmed_daily         49068 non-null  int64         
 6   Deaths_daily            49068 non-null  int64         
 7   Recovered_daily         49068 non-null  int64         
 8   Active_daily            49068 non-null  int64         
 9   WHO Region_daily        49068 non-null  object        
 10  New cases_daily         34353 non-null  float64       
 11  New deaths_daily        34353 non-null  float64       
 12  New recovered_daily     34353 non-null  float6

In [25]:
merged_data[[ 'WHO Region_daily','WHO Region_latest', 'WHO Region']].head()

Unnamed: 0,WHO Region_daily,WHO Region_latest,WHO Region
0,Eastern Mediterranean,Eastern Mediterranean,EasternMediterranean
1,Europe,Europe,Europe
2,Africa,Africa,Africa
3,Europe,Europe,Europe
4,Africa,Africa,Africa


In [None]:
#Assuming 'merged_data' is your DataFrame

columns_to_drop = ['Province/State', 'Lat', 'Long', 'WHO Region_latest', 'WHO Region','New cases_daily', 'New deaths_daily', 'New recovered_daily', 'Confirmed_daily', 'Deaths_daily', 'Recovered_daily','Active_daily','Continent','Population','TotalCases','TotalDeaths','TotalRecovered','ActiveCases','Serious,Critical','Tot Cases/1M pop','Deaths/1M pop','TotalTests','Tests/1M pop'] #Add or remove as needed


merged_data_reduced = merged_data.drop(columns=columns_to_drop, errors='ignore') #errors='ignore' handles cases where columns don't exist


print(merged_data_reduced.info())

In [None]:
#Save the merged dataframe.
merged_data.to_csv('merged_covid_data.csv', index=False)