<a href="https://colab.research.google.com/github/izik-adio/Predictive-Modelling-for-COVID-19/blob/main/eda_to_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
This notebook presents a data science capstone project focused on **predictive modeling for COVID-19** in public health. The goal is gaining actionable insights from historical COVID-19 data. Key tasks include:  

1. **Data Preparation**: Cleaning, transforming, and engineering features from the CORD-19 dataset to ensure high-quality inputs for analysis.  
2. **Exploratory Data Analysis (EDA)**: Identifying trends, correlations, and key factors influencing COVID-19 spread and severity through visualizations.  
3. **Predictive Modeling**: Building and evaluating time-series and classification models to forecast trends and outcomes.  
4. **Visualization & Reporting**: Delivering insights via clear visualizations and a structured report to support decision-making for public health policies and resource allocation.  

This project is a comprehensive effort to leverage data science for improving public health responses during the COVID-19 pandemic.

### The dataset used is gotten from [Kaggle](https://www.kaggle.com/datasets/imdevskp/corona-virus-report), below is a brief overview of what is contained in each csv file

* full_grouped.csv - Day to day country wise no. of cases (Has County/State/Province level data)
* covid_19_clean_complete.csv - Day to day country wise no. of cases (Doesn't have County/State/Province level data)
* country_wise_latest.csv - Latest country level no. of cases
* day_wise.csv - Day wise no. of cases (Doesn't have country level data)
* usa_county_wise.csv - Day to day county level no. of cases
* worldometer_data.csv - Latest data from https://www.worldometers.info/


In [1]:
import pandas as pd

In [4]:
data_url = "https://raw.githubusercontent.com/izik-adio/Predictive-Modelling-for-COVID-19/refs/heads/main/data/"
files = ["country_wise_latest.csv", "covid_19_clean_complete.csv", "day_wise.csv", "full_grouped.csv", "usa_county_wise.csv", "worldometer_data.csv"]

country_wise_latest = pd.read_csv(data_url + files[0])
covid_19_clean_complete = pd.read_csv(data_url + files[1])
day_wise = pd.read_csv(data_url + files[2])
full_grouped = pd.read_csv(data_url + files[3])
usa_country_wise = pd.read_csv(data_url + files[4])
worldometer_data = pd.read_csv(data_url + files[5])

In [23]:
country_wise_latest.tail(2)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
185,Zambia,4552,140,2815,1597,71,1,465,3.08,61.84,4.97,3326,1226,36.86,Africa
186,Zimbabwe,2704,36,542,2126,192,2,24,1.33,20.04,6.64,1713,991,57.85,Africa


In [24]:
worldometer_data.tail(2)

Unnamed: 0,Country/Region,Continent,Population,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,TotalTests,Tests/1M pop,WHO Region
207,Vatican City,Europe,801.0,12,,,,12.0,,0.0,,14981.0,,,,Europe
208,Western Sahara,Africa,598682.0,10,,1.0,,8.0,,1.0,,17.0,2.0,,,Africa


In [22]:
day_wise.tail(2)

Unnamed: 0,Date,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,No. of countries
186,2020-07-26,16251796,648621,9293464,6309711,204606,4104,134721,3.99,57.18,6.98,187
187,2020-07-27,16480485,654036,9468087,6358362,228693,5415,174623,3.97,57.45,6.91,187


In [8]:
covid_19_clean_complete.sample(2)

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
39367,Channel Islands,United Kingdom,49.3723,-2.3644,2020-06-20,570,48,512,10,Europe
48454,,Nicaragua,12.865416,-85.207229,2020-07-25,3439,108,2492,839,Americas


In [15]:
full_grouped.sample(2)

Unnamed: 0,Date,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,WHO Region
28853,2020-06-24,Eritrea,144,0,39,105,1,0,0,Africa
4118,2020-02-13,Angola,0,0,0,0,0,0,0,Africa


In [13]:
usa_country_wise.sample(2)

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Date,Confirmed,Deaths
437792,84005119,US,USA,840,5119.0,Pulaski,Arkansas,US,34.770541,-92.313551,"Pulaski, Arkansas, US",6/1/20,903,34
278084,84018171,US,USA,840,18171.0,Warren,Indiana,US,40.347281,-87.356027,"Warren, Indiana, US",4/14/20,3,1


In [28]:
def get_var_name(var):
    return [name for name, val in globals().items() if val is var]

for df in [country_wise_latest, covid_19_clean_complete, full_grouped, day_wise, usa_country_wise, worldometer_data]:
    print('-'*70)
    print(f"{get_var_name(df)[0]} has {df.shape[0]} rows and {df.shape[1]} columns")
    print(df.info())

----------------------------------------------------------------------
country_wise_latest has 187 rows and 15 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country/Region          187 non-null    object 
 1   Confirmed               187 non-null    int64  
 2   Deaths                  187 non-null    int64  
 3   Recovered               187 non-null    int64  
 4   Active                  187 non-null    int64  
 5   New cases               187 non-null    int64  
 6   New deaths              187 non-null    int64  
 7   New recovered           187 non-null    int64  
 8   Deaths / 100 Cases      187 non-null    float64
 9   Recovered / 100 Cases   187 non-null    float64
 10  Deaths / 100 Recovered  187 non-null    float64
 11  Confirmed last week     187 non-null    int64  
 12  1 week change           187 n