I will be SQL to review the global COVID 19 data (Source: [https://ourworldindata.org/covid-deaths](https://ourworldindata.org/covid-deaths)) and providing some insights.

Key skills: Window Function, Aggregate Function, CTE, Data Type Conversion 

Jupyter Notebook

I am only displaying top 5 row for easier viewing in this notebook

1\. First Let's look at the Total Cases, Total Death and Total Vaccination data for Canada

<span style="color:#008000;">-- We have null total vaccination values in the most recent 5 days. vaccination data reporting appears to be lagging cases and death by 5 days</span>

In [19]:
select top (5) date,[location], total_cases, total_deaths, total_vaccinations
FROM dbo.covid 
where location = 'Canada'
order by 4 desc


date,location,total_cases,total_deaths,total_vaccinations
02-08-2022,Canada,4099374,42969,
31-07-2022,Canada,4092722,42951,
01-08-2022,Canada,4093713,42951,
30-07-2022,Canada,4091778,42949,
29-07-2022,Canada,4090442,42937,


2\. Let's look at aggregate summation of cases at each location to see the location with the higest cases. 

I have intentionally decided to recreate the total\_cases from new\_cases.

US has the highest total cases while Canada is ranked 32 in terms of total cases.

Some countries do not have any reported covid cases.

In [20]:

select top(5) [location],SUM(new_cases) As TotalCases,SUM(new_deaths) As TotalDeath,SUM(new_vaccinations) As Totalvaccination
FROM dbo.covid
-- continent data included as part of location
where continent is not null
group by location
order by 3 DESC

location,TotalCases,TotalDeath,Totalvaccination
United States,91585520,1031240,603657696
Brazil,33785346,679093,446216563
India,44067144,519105,1959202084
Russia,18350864,374765,148805839
Mexico,6783380,320838,150702693


3\. Let's look at the list of countries without any reported COVID cases.

There are 15 countries with NULL reported COVID total cases

We need to investigate further to ensure the total\_cases column is error\_free

In [21]:
--first lets create a temporary table from previous example
Drop TABLE if exists TempCovid
CREATE Table TempCovid(Country NVARCHAR(255),TotalCases numeric)
insert into TempCovid
SELECT location, SUM(new_cases) As TotalCases
FROM dbo.covid
where continent is not null
GROUP by location

-- lets filter out country with NULL total cases
Select top (5) *
FROM TempCovid
where TotalCases is NULL

Country,TotalCases


4\. Lets see the trend of ratio of covid cases and death per population in Canada

There are null data in the existing total\_death column for Jan 23, 2020 to March 08, 2020.

In [22]:
SELECT top (5) date, location,population,total_cases,total_deaths,(cast(total_cases as float)/population)*100 As PercentCases, (total_deaths/population)*100 As percentdeath
from dbo.covid
where location in ('Canada')
order by 5 desc


date,location,population,total_cases,total_deaths,PercentCases,percentdeath
02-08-2022,Canada,38155012,4099374,42969,10.743998717652088,0.1126169217297061
31-07-2022,Canada,38155012,4092722,42951,10.726564572958331,0.1125697457518818
01-08-2022,Canada,38155012,4093713,42951,10.729161872626328,0.1125697457518818
30-07-2022,Canada,38155012,4091778,42949,10.72409045501021,0.112564503976568
29-07-2022,Canada,38155012,4090442,42937,10.72058894910058,0.1125330533246851


5\. Lets try to correct the null values in new\_deaths and new cases in the canadian data

In [23]:
-- lets filter out the new_deaths and new_cases with null data in Canada
select top (5) date, new_cases, new_deaths
from covid
where new_deaths is null or new_cases is null

date,new_cases,new_deaths


In [24]:



-- lets start by creating a new table from the original table
Drop TABLE if exists TempCovid1
CREATE Table TempCovid1(location NVARCHAR(255),pop numeric, new_cases numeric,new_deaths numeric)
insert into TempCovid1
SELECT location, population, new_cases, new_deaths
FROM dbo.covid
where location = 'Canada'

-- let's change all the null values in new_deaths to zero so that it doesn't affect our aggregate
UPDATE TempCovid1
SET new_deaths = 0
WHERE new_deaths is NULL

-- let's change all the null values in new_deaths to zero so that it doesn't affect our aggregate
UPDATE TempCovid1
SET new_cases = 0
WHERE new_cases is NULL

-- let's recalculate the summation of cases and death
select avg(pop), sum(new_cases) as TotalCases, sum(new_deaths) as TotalDeaths, (SUM(new_cases)/max(pop))*100 As PercentCases, (sum(new_deaths)/max(pop))*100 As percentdeath
FROM TempCovid1

-- data cleaning is neccesary for this dataset


(No column name),TotalCases,TotalDeaths,PercentCases,percentdeath
38155012.0,4107420,43329,10.765,0.1135


6\. let's compare total vaccinated with total of death and cases across all locations

In [25]:
--lets update the original covid table to correct the impact of the null values on aggregation
update covid
set new_cases = 0
where new_cases is null

update covid
set new_deaths = 0
where new_deaths is null

update covid
set new_vaccinations = 0
where new_vaccinations is null


select top (5) date, location, population,new_cases, new_deaths, SUM(new_vaccinations) OVER (Partition by Location Order by location, date) as TotalVaccinated, SUM(new_cases) OVER (Partition by Location Order by location, date) as TotalCases1, SUM(new_deaths) OVER (Partition by Location Order by location, date) as TotalDeaths1
from covid
where [continent] is not NULL
order by 2


date,location,population,new_cases,new_deaths,TotalVaccinated,TotalCases1,TotalDeaths1
01-01-2021,Afghanistan,40099462,183,12,0,183,12
01-01-2022,Afghanistan,40099462,23,0,0,206,12
01-02-2021,Afghanistan,40099462,36,4,0,242,16
01-02-2022,Afghanistan,40099462,629,3,0,871,19
01-03-2020,Afghanistan,40099462,0,0,0,871,19


7\. Let's use CTE to look at percentage of cases, deaths and vaccinations across all location

In [26]:
with New_Covid_Table(date,location,population,new_cases, new_deaths, TotalVaccinated, TotalCases1, TotalDeaths1)
as (select date, location, population,new_cases, new_deaths, SUM(new_vaccinations) OVER (Partition by Location Order by location, date) as TotalVaccinated, SUM(new_cases) OVER (Partition by Location Order by location, date) as TotalCases1, SUM(new_deaths) OVER (Partition by Location Order by location, date) as TotalDeaths1
from covid
where [continent] is not NULL
)

select top(5) *, (TotalVaccinated/population)*100 as PercentVaccinated
, (TotalCases1/population)*100 as PercentCases
, (TotalDeaths1/population)*100 as PercentDeaths
FROM New_Covid_Table
ORDER by 2

date,location,population,new_cases,new_deaths,TotalVaccinated,TotalCases1,TotalDeaths1,PercentVaccinated,PercentCases,PercentDeaths
01-01-2021,Afghanistan,40099462,183,12,0,183,12,0,0.0004563652250496,2.9925588527846085e-05
01-01-2022,Afghanistan,40099462,23,0,0,206,12,0,0.0005137226030613,2.9925588527846085e-05
01-02-2021,Afghanistan,40099462,36,4,0,242,16,0,0.0006034993686448,3.990078470379478e-05
01-02-2022,Afghanistan,40099462,629,3,0,871,19,0,0.0021720989673128,4.73821818357563e-05
01-03-2020,Afghanistan,40099462,0,0,0,871,19,0,0.0021720989673128,4.73821818357563e-05


8\. Let's create a view for the data for use in visualization tool

In [27]:
Create VIEW Covid_Data
AS
select date, location, population,new_cases, new_deaths, 
SUM(new_vaccinations) OVER (Partition by Location Order by location, date) as TotalVaccinated, 
SUM(new_cases) OVER (Partition by Location Order by location, date) as TotalCases1, 
SUM(new_deaths) OVER (Partition by Location Order by location, date) as TotalDeaths1
from covid
where [continent] is not NULL



: Msg 2714, Level 16, State 3, Procedure Covid_Data, Line 1
There is already an object named 'Covid_Data' in the database.

9\. Summary of Percent Cases and Percent Death per continent

In [1]:
--lets update the original covid table to correct the impact of the null values on aggregation
update covid
set new_cases = 0
where new_cases is null

update covid
set new_deaths = 0
where new_deaths is null

update covid
set new_vaccinations = 0
where new_vaccinations is null

select location, max((convert(float,total_cases)/population)*100) as PercentCases,max((convert(float,total_deaths)/population)*100) as PercentDeaths
from covid
where continent is null and location in ('Africa','Asia','Europe','Oceania','North America','South America')
group by [location]

location,PercentCases,PercentDeaths
North America,18.22994825913136,0.2479009136356694
Asia,3.5469752930916982,0.0309796730688381
Africa,0.8807645200484564,0.0183894748382922
Oceania,25.83492066802824,0.0367685962249379
South America,14.349771141897646,0.3032980228187298
Europe,29.151152877620817,0.2517367069827728
