# **README for Python Assignment

# Part I: ETL Process for Covid-19 Data

This project is an **ETL (Extract, Transform, Load)** pipeline designed to process and aggregate Covid-19 data from various sources into a macrotable for analysis. The resulting macrotable provides insights at the weekly level for different countries, summarizing key metrics such as confirmed cases, deaths, and population details.

---

## **Table of Contents**
1. [Overview](#overview)
2. [Features](#features)
3. [Setup](#installation-and-setup)
4. [File Structure](#file-structure)
5. [Data Sources and Processing](#data-sources-and-processing)
6. [How to Run](#how-to-run)
7. [Part II: Explanatory Data Analysis](#part-ii-explanatory-data-analysis)
8. [Conclusion](#conclusion)


---

## **Overview**
The script consolidates multiple datasets (demographics, epidemiology, health, hospitalizations, vaccination data, and more) to create a clean and comprehensive dataset. The final macrotable is saved as a CSV file, containing weekly and country_name  aggregated data with key metrics.

---

## **Features**
- **Data Cleaning**: Removes unnecessary columns and fills missing values based on logical imputation methods.
- **Data Filtering**: Supports filtering by date range and country.
- **Weekly Aggregation**: Groups data by week and country to provide insights into trends over time.
- **Population Statistics**: Merges demographic data to enhance analysis with population metrics.
- **Cumulative Metrics**: Calculates cumulative confirmed and deceased cases for trend analysis.

---

## **Setup**
### **Prerequisites**
- Python 3.8+
- Required libraries: `pandas`, `argparse`, `os`, `datetime`


## **Data Sources and Processing**

# ETL Process

## Step 1: Extract

The script loads data from six CSV files located in the specified directory:

1. **`demographics.csv`**: Population data, including gender and age distributions.
2. **`epidemiology.csv`**: Daily counts of confirmed cases and deaths.
3. **`health.csv`**: Health-related data such as hospital capacity.
4. **`hospitalizations.csv`**: Data on Covid-19-related hospitalizations.
5. **`index.csv`**: Economic, health, and social indices.
6. **`vaccinations.csv`**: Vaccination counts.

---

## Step 2: Transform

### 1. Merging Datasets
- Merges all datasets into a single macrotable using `location_key` and `date` as keys.

### 2. Cleaning Data
- **Removing Non-Essential Columns**:
  - Columns like `wikidata_id`, `datacommons_id`, and `iso_3166_1_alpha_3` are dropped as they are not relevant to the analysis.
- **Handling Missing Values**:
  - Columns with more than 60% missing values are dropped.
  - Missing values in population-related fields are filled with the column mean.
  - `new_confirmed` and `new_deceased` are filled with `0` to indicate no reported cases.
  - Columns from the `vaccination` and `hospitilaztion` were dropped as a majority of it were nulls, data only available for the US and not crucial for predicting deaths due to COVID-19 hence, not useful for our analysis.
- **Filtering Critical Columns**:
  - Rows with missing `date`, `country_code`, or `cumulative_confirmed` are removed.

### 3. Aggregation
- Groups data by week and `country_name`.
- Computes weekly totals for:
  - `new_confirmed` (confirmed cases)
  - `new_deceased` (deaths)
- Calculates cumulative totals:
  - `cumulative_confirmed`
  - `cumulative_deceased`

### 4. Demographic Integration
- Adds population data, such as male, female, and age-specific populations, by merging with the `demographics` dataset.

---

## Step 3: Load

- The final macrotable is saved as a CSV file to the user-specified output location.


## **How to Run**
## Command Syntax

Run the ETL script with the following command:

```bash
python etl.py <input-folder-path> -o <output-file-path> --start <start-date> --end <end-date> --countries <country-names>

## **Output**

## File Format

The output is a CSV file with the following columns:

- **`week`**: Weekly range (e.g., `2020-01-01/2020-08-22`).
- **`country_name`**: Country name.
- **`new_confirmed`**: Weekly confirmed cases.
- **`new_deceased`**: Weekly deaths.
- **`cumulative_confirmed`**: Cumulative confirmed cases.
- **`cumulative_deceased`**: Cumulative deaths.
- **Demographic Columns**:
  - `population`, `population_male`, `population_female`, and age-group distributions.


  ## **Decisions and Assumptions**

- **Exclusion of Hospitalization and Vaccination Data**:
  - These datasets were excluded due to excessive missing values exceeding the 60% threshold and not material for predicting deaths.

- **Filling Missing Values**:
  - Population-related columns were filled with the column mean.
  - `new_confirmed` and `new_deceased` were set to `0` for null values.

- **Cumulative Metrics**:
  - Cumulative confirmed and deceased cases were calculated to show trends over time.

- **Demographic Data Integration**:
  - Added population statistics to contextualize weekly metrics.

---

## **Part II: Explanatory Data Analysis**

## **Overview**

The exploratory data analysis (EDA) aims to uncover meaningful insights and patterns in the processed macrotable. The analysis focuses on demographic factors, temporal trends, and relationships between key variables. Additionally, new variables were introduced, such as **Mortality Rate** and **Vulnerability Index**, to enhance the depth of insights and enrich the final regression analysis for COVID-19 death prediction/modelling. The details provided here summarize key findings, with further elaboration available in the Jupyter Notebook.

---

## **Data Quality Considerations**

### **Low Data Quality for Spain and Italy**

- **Spain**:
  - Spain's data exhibits significant gaps in **confirmed case reporting**, making it challenging to analyze trends accurately.
  - Inconsistent weekly records and missing values for confirmed cases hinder the ability to reliably assess the progression of the pandemic in the country.
  - These issues particularly impact insights derived from variables like the **confirmed-to-deceased ratio** and the **vulnerability index**, which rely on confirmed case data.

- **Italy**:
  - Italy's data quality issues are most prominent in **mortality reporting**, with large gaps and missing values in the dataset.
  - Variability in the consistency of reported deaths across weeks reduces the reliability of calculated metrics such as the **mortality rate**.
  - This impacts the ability to analyze trends over time and weakens comparisons with other countries.

- **Germany**:
  - Germany only has **reliable data until Q1 2022**.

### **Impact on Analysis**
- The limitations in Spain and Italy's data directly affect:
  - **Comparisons Across Countries**: Data inconsistencies reduce the reliability of cross-country insights for confirmed cases (Spain) and mortality trends (Italy).
  - **Temporal Analysis**: Missing or incomplete weekly records hinder the ability to identify consistent temporal patterns for these countries.
  - **Regression Analysis**: Incomplete data affects the accuracy of predictive models, particularly for Italy's mortality trends and Spain's confirmed case progression.

These data quality issues underscore the need for caution when interpreting insights for Spain and Italy. Acknowledging these gaps is crucial to ensure realistic expectations and reliable conclusions in the analysis.

---

## **Data Insights**

### 1. **Demographic Overview**
- Analyzed age and gender distributions across countries.
- While demographic proportions are largely consistent across countries, **the United States has a significantly larger population**, which heavily influences absolute case counts.

### 2. **Key Variables**
- New Confirmed Cases
- New Deceased Cases
- Cumulative Cases (Confirmed and Deceased)
- Population Demographics (Male, Female, Age Groups)
- Derived Variables:
  - **Mortality Rate**: Percentage of deceased cases relative to confirmed cases.
  - **Vulnerability Index**: Proportion of confirmed cases weighted by the population age group most vulnerable to severe outcomes (70+).

---

## **EDA Visualizations and Insights**

- Throughout the EDA analysis, we created our visualization based on the start and end date criteria (**`week`**: Weekly range (e.g., `2020-01-01/2020-08-22`).)

### **1. Weekly Trends of New Confirmed Cases**
- **Description**: Line chart comparing weekly confirmed cases across countries.
- **Insights**: 
  - The United States shows dramatic spikes during 2021 and 2022, reflecting larger outbreaks.
  - Germany, Italy, and Spain exhibit more consistent trends with smaller peaks.

### **2. Mortality Rate Over Time**
- **Description**: Line chart tracking the mortality rate across countries.
- **Insights**:
  - The mortality rate has steadily declined over time, particularly in the United States.
  - Differences in trends highlight the impact of healthcare infrastructure and pandemic management strategies.

### **3. Mortality Rate vs. Confirmed Cases**
- **Description**: Scatter plot examining the relationship between confirmed cases and mortality rates.
- **Insights**:
  - The mortality rate remains relatively stable despite significant increases in confirmed cases, suggesting improved treatments and preventive measures over time.

### **4. Age-Weighted Confirmed Cases**
- **Description**: Bar chart showing confirmed cases weighted by population age groups for each country.
- **Insights**:
  - Age-weighted cases emphasize the vulnerability of older populations.
  - The United States exhibits a disproportionate share of cases across all age groups.

### **5. Vulnerability Index**
- **Description**: Bar chart comparing average vulnerability indices across countries.
- **Insights**:
  - The vulnerability index highlights higher risks in the United States and Germany due to larger populations in vulnerable age groups (70+).

---

## **Correlation Analysis**

### **Correlation Heatmap**
- **Description**: Heatmap showing correlations between key variables (e.g., confirmed cases, deaths, population).
- **Insights**:
  - Strong correlations between new confirmed cases and new deceased cases reflect the pandemic's progression.
  - Population shows a weaker correlation, indicating demographic factors alone do not fully explain the trends.

---

## **Regression Analysis**

### **Overview**
To explore the predictive power of key variables, a regression model was built using:
- Features:
  - **New Confirmed Cases**
  - **Population**
  - **Population Age 70–79**
  - **Population Age 80+**
- Target:
  - **New Deceased Cases**

### **Actual vs. Predicted Deaths**
- **Description**: Scatter plot with a regression line showing the relationship between actual and predicted deaths.
- **Insights**:
  - The regression model reasonably predicts mortality trends.
  - Deviations highlight country-specific factors not captured by the model.

### **Temporal Analysis of Predictions**
- **Description**: Line chart comparing actual and predicted deaths over time for each country.
- **Insights**:
  - Predicted deaths align closely with actual trends, confirming the model's validity.
  - Temporal delays in predictions suggest factors like reporting lags or delayed healthcare interventions.

---

## **Conclusion**
The EDA demonstrates the value of integrating demographic data, derived variables, and statistical models. Key takeaways include:
1. Temporal trends and demographic factors heavily influence case and death rates.
2. New variables like **Mortality Rate** and **Vulnerability Index** provide additional depth to the analysis.
3. Correlation and regression analyses highlight relationships between key variables and support mortality predictions.

Further details, including the step-by-step code and additional charts, are available in the accompanying Jupyter Notebook.
