# Exploring COVID-19 Data: An In-depth Analysis

<div>
<img src=https://thanhtra.com.vn/data/images/0/2021/12/05/congdinh/blue-covid-banner.jpg width="1500">
</div>

# Introduction

In the face of the global COVID-19 pandemic, understanding the data behind the numbers is crucial. This project dives deep into COVID-19 datasets, leveraging data analysis and SQL querying to extract meaningful insights. We explore key metrics such as total cases, deaths, vaccinations, and their percentages, shedding light on the pandemic's impact worldwide.

## Project Highlights

1. **Connect to the Database:**
   - Utilize SQLite to connect to an SQLite database for efficient data manipulation.

2. **Load the Dataset:**
   - Import COVID-19 datasets from CSV files into the SQLite database for easy access.

3. **Data Exploration:**
   - Preprocess `COVID_DEATHS` and `COVID_VACCINATIONS` datasets, correcting data types and creating processed tables.

4. **Global COVID-19 Overview:**
   - Analyze global COVID-19 statistics, including total cases, deaths, vaccination efforts, and related percentages.
   - Visualize trends in new COVID-19 cases, deaths, and vaccinations worldwide over time.

5. **COVID-19 Impact Across Continents:**
   - Explore COVID-19 metrics for continents, examining infection rates, mortality, and vaccination progress.

6. **COVID-19 Impact Across Income Levels:**
   - Analyze COVID-19 impact and vaccination responses based on income levels, highlighting disparities.

7. **COVID-19 Impact Across Countries:**
   - Delve into COVID-19 metrics for individual countries, focusing on population, total cases, deaths, vaccinations, and trends.
   - Visualize regional patterns, vaccination progress, and challenges faced by countries.

8. **COVID-19 Trends in Vietnam:**
   - Focus specifically on Vietnam's COVID-19 situation, examining total cases, deaths, vaccinations, and trends over time.
   - Visualize the progression of the pandemic in Vietnam, highlighting key insights.

## Skills Utilized

- **Joins:** Connecting data from multiple sources for comprehensive analysis.

- **Common Table Expressions (CTE's):** Simplifying complex queries and calculations.

- **Temporary Tables:** Storing interim results for efficient data manipulation.

- **Window Functions:** Performing computations over specified subsets of data.

- **Aggregate Functions:** Summarizing data to reveal trends and patterns.

- **Creating Views:** Organizing data for future reference and visualization.

- **Converting Data Types:** Ensuring data compatibility and accuracy.


This project not only showcases SQL proficiency but also aims to uncover trends, disparities, and progress in the fight against COVID-19. Join us on this data-driven journey as we unravel the story behind the numbers.

# Overview of the dataset

In this project, we focus on analyzing specific aspects of the extensive [COVID-19 dataset](https://ourworldindata.org/covid-deaths). This dataset is diligently maintained by [Our World in Data](https://ourworldindata.org), with daily updates to ensure it reflects the latest information relevant to the COVID-19 pandemic. It encompasses a diverse range of crucial metrics sourced from reputable institutions, guaranteeing accuracy and reliability.

Our analysis hones in on the following key variables:

- **Confirmed Cases:** Total and new confirmed cases, smoothed averages, and cases per million people.
- **Confirmed Deaths:** Total and new deaths attributed to COVID-19, smoothed averages, and deaths per million people.
- **Vaccinations:** Total doses administered, people vaccinated, fully vaccinated, and booster doses.

The dataset we're using contains 390,786 rows spanning from January 1, 2020, to April 18, 2024. It is divided into two CSV files for easier analysis:

- **CovidDeaths.csv**: Contains the data on confirmed cases and deaths.
- **CovidVaccinations.csv**: Contains the data on vaccinations.

For further insights, you can explore the complete dataset [here](https://github.com/owid/covid-19-data/tree/master/public/data).

# Connect to the database

We will use the SQLite database in this project to easily work with SQL on Google Colab.

In [1]:
import sqlite3

# Connect to an SQLite database; use ':memory:' for an in-memory database
conn = sqlite3.connect('covid_data.db')

In [2]:
%%capture
# Install ipython-sql
!pip install ipython-sql

In [3]:
# Load the SQL extension
%load_ext sql

# Create a SQLite database
%sql sqlite:///covid_data.db

# Load the dataset

We will import the dataset from CSV files into the SQLite database we've created.

In [4]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Datasets/CovidDeaths.csv")
df.to_sql("COVID_DEATHS", conn, if_exists="append", chunksize=100, index=False, method="multi")

390786

In [6]:
df = pd.read_csv("/content/drive/MyDrive/Datasets/CovidVaccinations.csv")
df.to_sql("COVID_VACCINATIONS", conn, if_exists="append", chunksize=100, index=False, method="multi")

390786

# Data exploration

## Data preprocessing

Let's start with the `COVID_DEATHS` dataset, which contains the data on confirmed cases and deaths.

In [27]:
%%sql
SELECT continent, location, date, population, total_cases, new_cases, total_deaths, new_deaths
FROM COVID_DEATHS
ORDER BY location, date
LIMIT 100 -- due to limited resources

 * sqlite:///covid_data.db
Done.


continent,location,date,population,total_cases,new_cases,total_deaths,new_deaths
Asia,Afghanistan,1/1/2021,41128772,51848.0,0.0,2158.0,0.0
Asia,Afghanistan,1/1/2022,41128772,157902.0,0.0,7352.0,0.0
Asia,Afghanistan,1/1/2023,41128772,207579.0,257.0,7849.0,4.0
Asia,Afghanistan,1/1/2024,41128772,230375.0,0.0,7973.0,0.0
Asia,Afghanistan,1/10/2020,41128772,,0.0,,0.0
Asia,Afghanistan,1/10/2021,41128772,53489.0,780.0,2277.0,56.0
Asia,Afghanistan,1/10/2022,41128772,158345.0,0.0,7369.0,0.0
Asia,Afghanistan,1/10/2023,41128772,207780.0,0.0,7851.0,0.0
Asia,Afghanistan,1/10/2024,41128772,230642.0,0.0,7973.0,0.0
Asia,Afghanistan,1/11/2020,41128772,,0.0,,0.0


We can observe that the dataset is updated weekly, with new cases and new deaths reported only on Sundays. Therefore, in this analysis, the dates of events (if mentioned) are only accurate in weeks.

Additionally, there seems to be an issue with the date column as the ordering does not function correctly in the previous query. Let's examine the table schema to identify the problem.

In [28]:
%%sql
PRAGMA table_info(COVID_DEATHS)

 * sqlite:///covid_data.db
Done.


cid,name,type,notnull,dflt_value,pk
0,iso_code,TEXT,0,,0
1,continent,TEXT,0,,0
2,location,TEXT,0,,0
3,date,TEXT,0,,0
4,population,INTEGER,0,,0
5,total_cases,REAL,0,,0
6,new_cases,REAL,0,,0
7,new_cases_smoothed,REAL,0,,0
8,total_deaths,REAL,0,,0
9,new_deaths,REAL,0,,0


We see that the `date` column is now stored as a string, which is not the correct datatype for a date. To fix this, we will convert this column to a `DATE` datatype.

Next, we'll save the processed `date` column along with the other columns we're interested in into a temporary table called `COVID_DEATHS_PROCESSED` for future analysis.

In [29]:
%%sql
DROP TABLE IF EXISTS COVID_DEATHS_PROCESSED;

CREATE TABLE COVID_DEATHS_PROCESSED
(
    continent TEXT,
    location TEXT,
    date DATE,
    population REAL,
    total_cases REAL,
    new_cases REAL,
    total_deaths REAL,
    new_deaths REAL
);

INSERT INTO COVID_DEATHS_PROCESSED
WITH date_split AS (
    SELECT
      continent,
      location,
      SUBSTR(date, 1, INSTR(date, '/') - 1) AS month,
      SUBSTR(SUBSTR(date, INSTR(date, '/') + 1), 1, INSTR(SUBSTR(date, INSTR(date, '/') + 1), '/') - 1) AS date,
      SUBSTR(SUBSTR(date, INSTR(date, '/') + 1), INSTR(SUBSTR(date, INSTR(date, '/') + 1), '/') + 1) AS year,
      population,
      total_cases,
      new_cases,
      total_deaths,
      new_deaths
    FROM COVID_DEATHS
),
date_normalize AS (
    SELECT
      continent,
      location,
      year,
      CASE WHEN LENGTH(month) == 1 THEN '0' || month ELSE month END AS month,
      CASE WHEN LENGTH(date) == 1 THEN '0' || date ELSE date END AS date,
      population,
      total_cases,
      new_cases,
      total_deaths,
      new_deaths
    FROM date_split
)

SELECT
    continent,
    location,
    DATE(year || '-' || month || '-' || date) AS date,
    population,
    total_cases,
    new_cases,
    total_deaths,
    new_deaths
FROM date_normalize
ORDER BY location, date

 * sqlite:///covid_data.db
Done.
Done.
390786 rows affected.


[]

Let's take a look at our new table.

In [30]:
%%sql
SELECT *
FROM COVID_DEATHS_PROCESSED
LIMIT 100 -- due to limited resources

 * sqlite:///covid_data.db
Done.


continent,location,date,population,total_cases,new_cases,total_deaths,new_deaths
Asia,Afghanistan,2020-01-05,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-06,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-07,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-08,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-09,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-10,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-11,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-12,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-13,41128772.0,,0.0,,0.0
Asia,Afghanistan,2020-01-14,41128772.0,,0.0,,0.0


In [31]:
%%sql
PRAGMA table_info(COVID_DEATHS_PROCESSED)

 * sqlite:///covid_data.db
Done.


cid,name,type,notnull,dflt_value,pk
0,continent,TEXT,0,,0
1,location,TEXT,0,,0
2,date,DATE,0,,0
3,population,REAL,0,,0
4,total_cases,REAL,0,,0
5,new_cases,REAL,0,,0
6,total_deaths,REAL,0,,0
7,new_deaths,REAL,0,,0


Everything looks good. Now, let's explore the number of unique locations and continents in our dataset using the processed table.

In [32]:
%%sql
SELECT DISTINCT continent, location
FROM COVID_DEATHS_PROCESSED
ORDER BY 1, 2

 * sqlite:///covid_data.db
Done.


continent,location
,Africa
,Asia
,Europe
,European Union
,High income
,Low income
,Lower middle income
,North America
,Oceania
,South America


We can observe that there are rows where the `continent` is null, and the `location` is the continent itself (such as Asia, Africa, Europe, etc.), or descriptors like High income, Low income, Lower middle income, Upper middle income, or World.

For our further analysis in this project, to obtain accurate numbers for each country, we will filter out the rows where the `continent` is null. If we need to gather the numbers for each continent, we will use the rows where the `continent` is null and the `location` is the respective continent.

Let's redirect our focus to the second table, `COVID_VACCINATIONS`.

In [33]:
%%sql
SELECT continent, location, date, total_vaccinations, new_vaccinations
FROM COVID_VACCINATIONS
ORDER BY location, date
LIMIT 100 -- due to limited resources

 * sqlite:///covid_data.db
Done.


continent,location,date,total_vaccinations,new_vaccinations
Asia,Afghanistan,1/1/2021,,
Asia,Afghanistan,1/1/2022,,
Asia,Afghanistan,1/1/2023,,
Asia,Afghanistan,1/1/2024,,
Asia,Afghanistan,1/10/2020,,
Asia,Afghanistan,1/10/2021,,
Asia,Afghanistan,1/10/2022,,
Asia,Afghanistan,1/10/2023,,
Asia,Afghanistan,1/10/2024,,
Asia,Afghanistan,1/11/2020,,


We've identified an issue with the data column, much like the `COVID_DEATHS` dataset. To address this, we'll preprocess it as we did earlier with the `COVID_DEATHS` dataset and then save it in a temporary table named `COVID_VACCINATIONS_PROCESSED`.

In [34]:
%%sql
DROP TABLE IF EXISTS COVID_VACCINATIONS_PROCESSED;

CREATE TABLE COVID_VACCINATIONS_PROCESSED
(
    continent TEXT,
    location TEXT,
    date DATE,
    total_vaccinations REAL,
    new_vaccinations REAL
);

INSERT INTO COVID_VACCINATIONS_PROCESSED
WITH date_split AS (
    SELECT
      continent,
      location,
      SUBSTR(date, 1, INSTR(date, '/') - 1) AS month,
      SUBSTR(SUBSTR(date, INSTR(date, '/') + 1), 1, INSTR(SUBSTR(date, INSTR(date, '/') + 1), '/') - 1) AS date,
      SUBSTR(SUBSTR(date, INSTR(date, '/') + 1), INSTR(SUBSTR(date, INSTR(date, '/') + 1), '/') + 1) AS year,
      total_vaccinations,
      new_vaccinations
    FROM COVID_VACCINATIONS
),
date_normalize AS (
    SELECT
      continent,
      location,
      year,
      CASE WHEN LENGTH(month) == 1 THEN '0' || month ELSE month END AS month,
      CASE WHEN LENGTH(date) == 1 THEN '0' || date ELSE date END AS date,
      total_vaccinations,
      new_vaccinations
    FROM date_split
)

SELECT
    continent,
    location,
    DATE(year || '-' || month || '-' || date) AS date,
    total_vaccinations,
    new_vaccinations
FROM date_normalize
ORDER BY location, date

 * sqlite:///covid_data.db
Done.
Done.
390786 rows affected.


[]

Let's take a look at our new table.

In [35]:
%%sql
SELECT *
FROM COVID_VACCINATIONS_PROCESSED
LIMIT 100 -- due to limited resources

 * sqlite:///covid_data.db
Done.


continent,location,date,total_vaccinations,new_vaccinations
Asia,Afghanistan,2020-01-05,,
Asia,Afghanistan,2020-01-06,,
Asia,Afghanistan,2020-01-07,,
Asia,Afghanistan,2020-01-08,,
Asia,Afghanistan,2020-01-09,,
Asia,Afghanistan,2020-01-10,,
Asia,Afghanistan,2020-01-11,,
Asia,Afghanistan,2020-01-12,,
Asia,Afghanistan,2020-01-13,,
Asia,Afghanistan,2020-01-14,,


We've identified another issue with our table: the `total_vaccinations` column has data in only a few rows, while most entries are empty. This could lead to a disjointed line chart if we aim to display total vaccinations on our dashboard. Additionally, the `new_vaccinations` column has many null values, which is unexpected given the increase in `total_vaccinations`.

To address these issues, we'll take the following steps:
- For every null value in the `total_vaccinations` column, we will replace it with the nearest non-null value of `total_vaccinations` for that country.
- We will then calculate the `new_vaccinations` by subtracting two consecutive `total_vaccinations` values.

Finally, we will save the processed data in a new temporary table called `COVID_VACCINATIONS_PROCESSED_V2`.

In [36]:
%%sql
DROP TABLE IF EXISTS COVID_VACCINATIONS_PROCESSED_V2;

CREATE TABLE COVID_VACCINATIONS_PROCESSED_V2
(
    continent TEXT,
    location TEXT,
    date DATE,
    total_vaccinations REAL,
    new_vaccinations REAL
);

INSERT INTO COVID_VACCINATIONS_PROCESSED_V2
WITH total_vaccinations_portion AS (
    SELECT
        continent,
        location,
        date,
        total_vaccinations,
        SUM(CASE WHEN total_vaccinations IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY continent, location ORDER BY date) AS total_vaccinations_partition,
        new_vaccinations
    FROM COVID_VACCINATIONS_PROCESSED
),
new_total_vaccinations AS (
    SELECT
        continent,
        location,
        date,
        FIRST_VALUE(total_vaccinations) OVER(PARTITION BY total_vaccinations_partition, continent, location) AS total_vaccinations,
        new_vaccinations
    FROM total_vaccinations_portion
)

SELECT
    continent,
    location,
    date,
    total_vaccinations,
    total_vaccinations - LAG(total_vaccinations) OVER(PARTITION BY continent, location ORDER BY date) AS new_vaccinations
FROM new_total_vaccinations
ORDER BY location, date

 * sqlite:///covid_data.db
Done.
Done.
390786 rows affected.


[]

Let's once more review our processed table.

In [37]:
%%sql
SELECT *
FROM COVID_VACCINATIONS_PROCESSED_V2
LIMIT 100 -- due to limited resources

 * sqlite:///covid_data.db
Done.


continent,location,date,total_vaccinations,new_vaccinations
Asia,Afghanistan,2020-01-05,,
Asia,Afghanistan,2020-01-06,,
Asia,Afghanistan,2020-01-07,,
Asia,Afghanistan,2020-01-08,,
Asia,Afghanistan,2020-01-09,,
Asia,Afghanistan,2020-01-10,,
Asia,Afghanistan,2020-01-11,,
Asia,Afghanistan,2020-01-12,,
Asia,Afghanistan,2020-01-13,,
Asia,Afghanistan,2020-01-14,,


Finally, let's create a view that combines the processed `COVID_DEATHS` and processed `COVID_VACCINATIONS` tables to facilitate easier analysis of the data in both tables later. Let's name it `COVID_COMBINE_VIEW`.

In [38]:
%%sql
CREATE VIEW COVID_COMBINE_VIEW
AS
SELECT
    dae.*,
    vac.total_vaccinations,
    vac.new_vaccinations
FROM COVID_DEATHS_PROCESSED dae
JOIN COVID_VACCINATIONS_PROCESSED_V2 vac
    ON IFNULL(dae.continent, "") = IFNULL(vac.continent, "")
    AND dae.location = vac.location
    AND dae.date = vac.date
ORDER BY 1, 2, 3

 * sqlite:///covid_data.db
(sqlite3.OperationalError) table COVID_COMBINE_VIEW already exists
[SQL: CREATE VIEW COVID_COMBINE_VIEW
AS
SELECT
    dae.*,
    vac.total_vaccinations,
    vac.new_vaccinations
FROM COVID_DEATHS_PROCESSED dae
JOIN COVID_VACCINATIONS_PROCESSED_V2 vac
    ON IFNULL(dae.continent, "") = IFNULL(vac.continent, "")
    AND dae.location = vac.location
    AND dae.date = vac.date
ORDER BY 1, 2, 3]
(Background on this error at: https://sqlalche.me/e/20/e3q8)


So now we can query from the view.

In [39]:
%%sql
SELECT *
FROM COVID_COMBINE_VIEW
LIMIT 100 -- due to limited resources

 * sqlite:///covid_data.db
Done.


continent,location,date,population,total_cases,new_cases,total_deaths,new_deaths,total_vaccinations,new_vaccinations
,Africa,2020-01-05,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-06,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-07,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-08,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-09,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-10,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-11,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-12,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-13,1426736614.0,,0.0,,0.0,,
,Africa,2020-01-14,1426736614.0,,0.0,,0.0,,


Ensure that the created view has the same number of rows as the original tables.

In [40]:
%%sql
SELECT 'COVID_COMBINE_VIEW' AS TABLE_NAME, COUNT(*) AS ROW_COUNT
FROM COVID_COMBINE_VIEW

UNION ALL
SELECT 'COVID_DEATHS' AS TABLE_NAME, COUNT(*) AS ROW_COUNT
FROM COVID_DEATHS

UNION ALL
SELECT 'COVID_VACCINATIONS' AS TABLE_NAME, COUNT(*) AS ROW_COUNT
FROM COVID_VACCINATIONS

 * sqlite:///covid_data.db
Done.


TABLE_NAME,ROW_COUNT
COVID_COMBINE_VIEW,390786
COVID_DEATHS,390786
COVID_VACCINATIONS,390786


Everything looks good. Let's delve into our analysis.

## Global COVID-19 Overview

Let's explore global COVID-19 statistics.

In [41]:
# Let's look at the global numbers
%%sql
SELECT
  MAX(population) AS global_population,
  MAX(total_cases) AS global_total_cases,
  MAX(total_deaths) AS global_total_deaths,
  MAX(total_cases) / MAX(population) AS global_covid_percentage,
  MAX(total_deaths) / MAX(total_cases) AS global_death_percentage,
  MAX(total_vaccinations) AS global_total_vaccinations,
  MAX(total_vaccinations) / MAX(population) AS global_avg_vaccination
FROM COVID_COMBINE_VIEW
WHERE location == 'World'

 * sqlite:///covid_data.db
Done.


global_population,global_total_cases,global_total_deaths,global_covid_percentage,global_death_percentage,global_total_vaccinations,global_avg_vaccination
7975105024.0,775251765.0,7043660.0,0.0972089725046861,0.0090856420043106,13570830469.0,1.7016491228843282


**Global COVID-19 Overview**

1. **COVID-19 Impact:**
   - The global COVID percentage stands at 9.72%, indicating a significant portion of the global population affected by the virus.
   - The global death percentage is 0.91%, reflecting the impact of COVID-19 on mortality worldwide.

2. **Vaccination Efforts:**
   - There have been a total of 13.57 billion vaccinations administered globally.
   - The average vaccination rate is 1.70 doses per person, indicating progress in vaccination campaigns.

3. **Global Comparison:**
   - These global figures provide a snapshot of the scale of the COVID-19 pandemic and the response through vaccinations.
   - While progress has been made with vaccinations, the COVID percentage highlights the ongoing challenges in controlling the spread.

This analysis provides an overview of the global COVID-19 situation, including total cases, deaths, vaccination efforts, and related percentages. It offers insights into the scale of the pandemic and the progress in vaccination campaigns worldwide.

Let's explore the trend of new COVID-19 cases, deaths, and vaccinations worldwide over time.

In [42]:
%%sql
SELECT
  location,
  date,
  new_cases AS global_new_cases,
  new_deaths AS global_new_deaths,
  new_vaccinations AS global_new_vaccinations
FROM COVID_COMBINE_VIEW
WHERE location == 'World'
  AND NOT (new_cases == 0 AND new_deaths == 0)
ORDER BY date

 * sqlite:///covid_data.db
Done.


location,date,global_new_cases,global_new_deaths,global_new_vaccinations
World,2020-01-05,2.0,3.0,
World,2020-01-12,45.0,1.0,
World,2020-01-19,90.0,2.0,
World,2020-01-26,1896.0,56.0,
World,2020-02-02,12538.0,310.0,
World,2020-02-09,23059.0,545.0,
World,2020-02-16,31734.0,864.0,
World,2020-02-23,9578.0,692.0,
World,2020-03-01,8272.0,519.0,
World,2020-03-08,20207.0,650.0,


**Global COVID-19 Trends**

1. **COVID-19 Cases and Deaths:**
   - The data reveals fluctuating trends in new COVID-19 cases and deaths over time.
   - Peak periods are noticeable, such as in March and April 2020, as well as during the winter months of 2020 and 2021.
   - A decline in cases and deaths is observed in mid-2021, followed by a rise again in late 2021 and early 2022. Subsequent decreases and sporadic increases are seen up to March 2024.

2. **Vaccination Efforts:**
   - The timeline shows the introduction and scaling up of global vaccination efforts, particularly from December 2020 onwards.
   - Notable milestones in new vaccinations occurred in late 2020 to early 2021, with significant increases in doses administered.
   - Vaccination efforts continued throughout 2021 and 2022, with varying rates of new vaccinations per week.

3. **Recent Trends (2023 - 2024):**
   - In recent years (2023 - 2024), a decrease in new cases and deaths is observed, potentially indicating the impact of vaccination campaigns.
   - The data shows a more consistent rate of new vaccinations compared to earlier periods.

This analysis provides a detailed overview of the global COVID-19 situation, highlighting trends in new cases, deaths, and vaccination efforts from January 2020 to March 2024. The insights showcase the impact of various factors on the pandemic's trajectory and the progress made in vaccination campaigns globally.

## COVID-19 Impact Across Continents

Let's analyze COVID-19 statistics for individual continents.

In [43]:
%%sql
SELECT
  location,
  MAX(population) AS population,
  MAX(total_cases) AS total_cases,
  MAX(total_deaths) AS total_deaths,
  MAX(total_vaccinations) AS total_vaccinations,
  MAX(total_cases) / MAX(population) AS covid_percentage,
  MAX(total_deaths) / MAX(total_cases) AS death_percentage,
  MAX(total_vaccinations) / MAX(population) AS avg_vaccination
FROM COVID_COMBINE_VIEW
WHERE continent IS NULL
  AND location NOT IN ('World', 'High income', 'Upper middle income', 'Lower middle income', 'Low income')
GROUP BY location
ORDER BY location

 * sqlite:///covid_data.db
Done.


location,population,total_cases,total_deaths,total_vaccinations,covid_percentage,death_percentage,avg_vaccination
Africa,1426736614.0,13140491.0,259095.0,863138096.0,0.0092101729717016,0.0197172997569116,0.6049736773629895
Asia,4721383370.0,301392304.0,1636933.0,9102086165.0,0.0638355923213242,0.0054312368905079,1.927843060327465
Europe,744807803.0,252450811.0,2099805.0,1395261693.0,0.3389475915573887,0.0083176797558396,1.873317770544356
European Union,450146793.0,185619587.0,1260979.0,951109336.0,0.4123534586638719,0.0067933509624714,2.112887064375909
North America,600323657.0,124541232.0,1661466.0,1158127259.0,0.2074568119177086,0.0133406902542926,1.929171448594104
Oceania,45038860.0,14894138.0,32334.0,87655293.0,0.3306952707062301,0.0021709212040334,1.946214735452896
South America,436816679.0,68826532.0,1354009.0,964561963.0,0.1575638827655663,0.0196727767716089,2.2081619346774075


**COVID-19 Impact Across Continents**

1. **Variation in COVID Impact:**
   - **Europe** and the **European Union** have the highest COVID percentages at 33.89% and 41.24%, respectively.
   - **Africa** and **Asia** have lower COVID percentages at 0.92% and 6.38%, respectively.
   - **Oceania** and **North America** show moderate COVID percentages at 33.07% and 20.75%, respectively.

2. **Differences in Death Percentages:**
   - **Europe** has a death percentage of 0.83%, slightly higher than the **European Union** at 0.68%.
   - **Asia** has the lowest death percentage among continents at 0.54%.
   - **Africa** and **South America** have higher death percentages at 1.97%.

3. **Vaccination Progress:**
   - **European Union** and **South America** have the highest average vaccination rates per person, with 2.11 and 2.21 doses respectively.
   - **Asia** and **North America** also have high average vaccination rates of 1.93 doses per person.
   - **Africa** has the lowest average vaccination rate at 0.60 doses per person.

This analysis provides a snapshot of the diverse impact of COVID-19 across continents, highlighting differences in infection rates, mortality, and vaccination efforts. It shows the importance of global collaboration and targeted responses to combat the pandemic effectively.

How about examining the COVID-19 impact across different income groups?

In [44]:
%%sql
SELECT
  location,
  MAX(population) AS population,
  MAX(total_cases) AS total_cases,
  MAX(total_deaths) AS total_deaths,
  MAX(total_vaccinations) AS total_vaccinations,
  MAX(total_cases) / MAX(population) AS covid_percentage,
  MAX(total_deaths) / MAX(total_cases) AS death_percentage,
  MAX(total_vaccinations) / MAX(population) AS avg_vaccination
FROM COVID_COMBINE_VIEW
WHERE continent IS NULL
  AND location IN ('High income', 'Upper middle income', 'Lower middle income', 'Low income')
GROUP BY location
ORDER BY location

 * sqlite:///covid_data.db
Done.


location,population,total_cases,total_deaths,total_vaccinations,covid_percentage,death_percentage,avg_vaccination
High income,1250514600.0,428541299.0,2983098.0,2839741192.0,0.3426919597740002,0.0069610513781543,2.27085808674285
Low income,737604900.0,2328082.0,48045.0,333087284.0,0.003156272416303,0.0206371596876742,0.4515795434656142
Lower middle income,3432097300.0,97535147.0,1341389.0,4948362891.0,0.0284185261880541,0.0137528782316799,1.4417898032785958
Upper middle income,2525921300.0,245634014.0,2667162.0,5449551989.0,0.097245315600292,0.0108582763297594,2.157451219481779


**COVID-19 Impact Across Income Levels**

1. **COVID-19 Impact:**
   - **High Income** countries have the highest COVID percentage at 34.27%, indicating a significant portion of the population affected.
   - **Low Income** countries show the lowest COVID percentage at 0.32%, suggesting relatively lower spread.

2. **Death Rates:**
   - **Low Income** countries have the highest death percentage at 2.06%, despite lower overall cases.
   - **Upper Middle Income** countries have a relatively lower death percentage at 1.09%.

3. **Vaccination Efforts:**
   - **High Income** and **Upper Middle Income** countries have the highest average vaccination rates, with 2.27 and 2.16 doses per person, respectively.
   - **Low Income** countries have the lowest average vaccination rate at 0.45 doses per person.

4. **Economic Disparities and COVID Impact:**
   - There is a clear disparity in COVID impact and vaccination rates based on income levels.
   - Lower income countries have lower COVID percentages but higher death percentages, indicating potential challenges in healthcare infrastructure.

This analysis sheds light on the disparities in COVID-19 impact and vaccination responses based on the income levels of countries, highlighting the importance of equitable access to vaccines and healthcare resources.

## COVID-19 Impact Across Countries

Let's delve into the latest COVID-19 metrics for each country, focusing on population, total cases, total deaths, total vaccinations, and derived metrics.

In [45]:
%%sql
SELECT
  location,
  MAX(population) AS population,
  MAX(total_cases) AS total_cases,
  MAX(total_deaths) AS total_deaths,
  MAX(total_vaccinations) AS total_vaccinations,
  MAX(total_cases) / MAX(population) AS covid_percentage,
  MAX(total_deaths) / MAX(total_cases) AS death_percentage,
  MAX(total_vaccinations) / MAX(population) AS avg_vaccination
FROM COVID_COMBINE_VIEW
WHERE continent IS NOT NULL
GROUP BY location
ORDER BY location

 * sqlite:///covid_data.db
Done.


location,population,total_cases,total_deaths,total_vaccinations,covid_percentage,death_percentage,avg_vaccination
Afghanistan,41128772.0,232948.0,7985.0,22606931.0,0.0056638695655683,0.0342780362999467,0.5496621926859377
Albania,2842318.0,334863.0,3605.0,3088966.0,0.117813348119387,0.0107655966768499,1.0867770601319064
Algeria,44903228.0,272017.0,6881.0,15267442.0,0.0060578495603924,0.025296213104328,0.3400076716088206
American Samoa,44295.0,8359.0,34.0,,0.1887120442487865,0.0040674721856681,
Andorra,79843.0,48015.0,159.0,157072.0,0.601367684080007,0.0033114651671352,1.9672607492203449
Angola,35588996.0,107357.0,1937.0,27722924.0,0.003016578495218,0.0180426055124491,0.7789746021494959
Anguilla,15877.0,3904.0,12.0,24604.0,0.2458902815393336,0.0030737704918032,1.5496630345783209
Antigua and Barbuda,93772.0,9106.0,146.0,136512.0,0.0971078786844687,0.0160333845815945,1.4557863754638911
Argentina,45510324.0,10130118.0,130845.0,116978521.0,0.2225894502530898,0.012916433944797,2.5703732849715597
Armenia,2780472.0,451831.0,8777.0,2256919.0,0.1625015465000187,0.0194254046313776,0.8117035524903685


**COVID-19 Impact Across Countries**

1. **Population vs. Total Cases:**
   - The country with the highest total cases is India, with 45,034,360 cases, likely due to its massive population of 1.42 billion.
   - Iceland has a relatively small population but a high case count compared to its size.
   - China, with its large population, has a relatively low total case count compared to countries with similar populations, indicating effective control measures.

2. **COVID-19 Impact on Mortality:**
   - Mexico has a high death count (335,011) relative to its total cases (7,709,747), indicating a significant impact on mortality.
   - Lesotho, despite a low total case count (36,138), has a relatively high death count (709), suggesting a higher death rate.

3. **Vaccination Progress:**
   - Gibraltar leads in vaccination with 628.88% of its population fully vaccinated, followed by Malta (227.66%) and Iceland (562.92%).
   - Some countries like American Samoa and Eritrea do not have vaccination data available.
   - Larger countries like India and China have made significant progress in vaccination numbers, with 2.2 billion and 3.5 billion doses administered respectively.

4. **Country Comparisons:**
   - The United States has the highest total vaccination count (4.87 billion doses), reflecting its large population.
   - European countries like Spain, Italy, and Germany have vaccination rates above 2 doses per person, showing progress in vaccination campaigns.
   - African countries like Madagascar and Mali have lower vaccination rates, likely due to challenges in distribution and access.

5. **Regional Patterns:**
   - Countries in Europe, particularly Luxembourg, Iceland, and Malta, show high vaccination rates per capita.
   - South American countries like Brazil and Argentina have relatively high total case counts and death counts.
   - Southeast Asian countries like Indonesia and Malaysia have moderate vaccination rates but relatively low case counts compared to their populations.

6. **Challenges:**
   - Some countries, like Cook Islands and Nauru, have low total case counts, likely due to their isolated locations or limited testing.
   - Several countries, including Eritrea and American Samoa, have no available data for total vaccinations, possibly due to reporting issues or lack of infrastructure.

7. **Conclusion:**
   - Vaccination rates vary significantly across countries, influenced by factors such as population size, access to vaccines, and healthcare infrastructure.
   - Higher population countries face challenges in controlling the spread of the virus, leading to higher total cases.
   - Mortality rates can vary widely, influenced by factors like healthcare capacity, age demographics, and public health measures.
   - Continuous monitoring and international cooperation are crucial to managing and mitigating the impact of COVID-19 on a global scale.

These insights provide a glimpse into the diverse impact of COVID-19 across different countries and regions, highlighting both successes and challenges in the ongoing battle against the pandemic.

## COVID-19 Trends in Vietnam

After analyzing the global, continental, and individual country numbers, let's shift our focus to the cases and deaths specifically in Vietnam. Here, we delve into key metrics including total cases, deaths, vaccinations, and percentages related to the COVID-19 situation in Vietnam.

In [46]:
%%sql
SELECT
  location,
  date,
  population,
  total_cases,
  total_deaths,
  total_vaccinations,
  total_cases / population AS covid_percentage,
  total_deaths / total_cases AS death_percentage,
  total_vaccinations / population AS avg_vaccination
FROM COVID_COMBINE_VIEW
WHERE continent IS NOT NULL
  AND location == 'Vietnam'
  AND NOT (new_cases == 0 AND new_deaths == 0)
ORDER BY location, date

 * sqlite:///covid_data.db
Done.


location,date,population,total_cases,total_deaths,total_vaccinations,covid_percentage,death_percentage,avg_vaccination
Vietnam,2020-01-26,98186856.0,2.0,,,2.036932519766189e-08,,
Vietnam,2020-02-02,98186856.0,6.0,,,6.11079755929857e-08,,
Vietnam,2020-02-09,98186856.0,13.0,,,1.324006137848023e-07,,
Vietnam,2020-02-16,98186856.0,16.0,,,1.6295460158129518e-07,,
Vietnam,2020-03-08,98186856.0,20.0,,,2.0369325197661894e-07,,
Vietnam,2020-03-15,98186856.0,53.0,,,5.397871177380402e-07,,
Vietnam,2020-03-22,98186856.0,94.0,,,9.57358284290109e-07,,
Vietnam,2020-03-29,98186856.0,174.0,,,1.7721312921965849e-06,,
Vietnam,2020-04-05,98186856.0,240.0,,,2.444319023719428e-06,,
Vietnam,2020-04-12,98186856.0,258.0,,,2.627642950498384e-06,,


**Data Analysis: COVID-19 Trends in Vietnam**

**Overview:**
- **Location**: Vietnam
- **Population**: 98,186,856

**Key Metrics:**
- **Total Cases**: The number of confirmed COVID-19 cases.
- **Total Deaths**: The number of deaths due to COVID-19.
- **Total Vaccinations**: The total number of COVID-19 vaccinations administered.
- **COVID-19 Percentage**: Percentage of the population infected with COVID-19.
- **Death Percentage**: Percentage of COVID-19 cases resulting in death.
- **Average Vaccination**: Average number of vaccinations per day.

**Insights:**

**Total Cases and Deaths:**
- The number of COVID-19 cases in Vietnam started from 2 on 2020-01-26 and gradually increased over time.
- Total cases reached 1,142,641 by 2022-09-11, with a significant increase seen in 2022.
- Total deaths started from 0 and reached 43,130 by 2022-09-11. The death percentage remained relatively low.

**Vaccination Efforts:**
- Vaccination data starts from 2021-03-07, and the total number of vaccinations reached 258,723,446 by 2022-09-11.
- Average daily vaccinations increased over time, indicating an acceleration in vaccination efforts.
- The COVID-19 percentage began to decrease noticeably after the vaccination campaign started, suggesting its effectiveness.

**Trends Over Time:**
- **COVID-19 Cases**: The number of cases showed fluctuations but generally increased over time.
- **Vaccinations**: Vaccination numbers rose steadily from the start, with a notable increase in the number of daily vaccinations.
- **COVID-19 Percentage**: After peaking in mid-2021, the COVID-19 percentage declined, likely due to vaccination efforts.

**Conclusion:**
- Vietnam's COVID-19 situation saw a significant increase in cases in 2022, likely prompting an escalation in vaccination efforts.
- The increasing vaccination rate correlates with a decrease in COVID-19 percentage, indicating the effectiveness of vaccination campaigns.
- Despite the rising cases, the death percentage remained relatively low, possibly due to the vaccination campaign's impact.

This analysis provides a snapshot of Vietnam's COVID-19 situation, highlighting the trends in cases, deaths, vaccinations, and their impact on the population.

# Conclusion

In conclusion, this project has provided a comprehensive analysis of COVID-19 data using SQL querying techniques. Here are the key takeaways:

- **Global Overview:** We explored the global impact of COVID-19, highlighting total cases, deaths, vaccination efforts, and related percentages. The data revealed significant progress in vaccination campaigns, impacting COVID-19 percentages and trends over time.

- **Continental and Income-Level Analysis:** Our analysis across continents and income levels showcased disparities in COVID-19 impact and vaccination rates. Higher income countries generally exhibited higher vaccination rates, while lower income countries faced challenges in controlling the spread.

- **Country-Specific Insights:** Delving into individual countries' data provided a nuanced view of the pandemic. We observed varied trends in total cases, deaths, and vaccination rates, reflecting regional differences in healthcare infrastructure and response strategies.

- **Vietnam's COVID-19 Trends:** Focusing on Vietnam, we analyzed its total cases, deaths, vaccinations, and trends over time. The data revealed a notable increase in cases in 2022, corresponding with an acceleration in vaccination efforts. The decreasing death percentage suggested positive impacts from these efforts.

## Key Learnings

- **Data Analysis Skills:** Utilizing SQL and data visualization tools, we gained insights into complex datasets.
- **Regional Disparities:** Highlighted the importance of equitable access to vaccines and healthcare resources across different regions and income levels.
- **Impact of Vaccination:** Demonstrated the effectiveness of vaccination campaigns in reducing COVID-19 percentages and mortality rates.



## Future Considerations


- **Continued Monitoring:** Ongoing monitoring of COVID-19 data is crucial for adapting strategies and interventions.
- **Equitable Distribution:** Ensuring equitable distribution of vaccines to all regions and income levels remains a critical challenge.
- **Healthcare Infrastructure:** Investing in healthcare infrastructure and response capabilities is essential for future pandemic preparedness.

This project underscores the power of data analysis in understanding and combating global health crises. By examining COVID-19 data through multiple lenses, we gain valuable insights that can inform policy decisions, healthcare strategies, and international collaborations in the ongoing fight against the pandemic.