I will be SQL to review the global COVID 19 data (Source: [https://ourworldindata.org/covid-deaths](https://ourworldindata.org/covid-deaths)) and providing some insights.

Key skills: Aggregate function, Create Table, Update Table, Data Type Conversion 

Report Format: Jupyter Notebook

1\. First Let's look at the Total Cases, Total Death and Total Vaccination data for Canada

<span style="color:#008000;">-- We have null total vaccination values in the most recent 5 days. vaccination data reporting appears to be lagging cases and death by 5 days</span>

In [21]:
select date,[location], total_cases, total_deaths, total_vaccinations
FROM dbo.covid 
where location = 'Canada'
order by 4 desc


date,location,total_cases,total_deaths,total_vaccinations
02-08-2022,Canada,4099374,42969.0,
31-07-2022,Canada,4092722,42951.0,
01-08-2022,Canada,4093713,42951.0,
30-07-2022,Canada,4091778,42949.0,
29-07-2022,Canada,4090442,42937.0,
28-07-2022,Canada,4084973,42905.0,87412791.0
27-07-2022,Canada,4074977,42777.0,87326880.0
26-07-2022,Canada,4071469,42739.0,87248306.0
25-07-2022,Canada,4064084,42695.0,87214618.0
24-07-2022,Canada,4062808,42691.0,87198207.0


2\. Let's look at aggregate summation of cases at each location to see the location with the higest cases. 

I have intentionally decided to recreate the total\_cases from new\_cases.

US has the highest total cases while Canada is ranked 32 in terms of total cases.

Some countries do not have any reported covid cases.

In [29]:
select [location],SUM(new_cases) As TotalCases,SUM(new_deaths) As TotalDeath,SUM(new_vaccinations) As Totalvaccination
FROM dbo.covid
-- continent data included as part of location
where continent is not null
group by location
order by 2 DESC

location,TotalCases,TotalDeath,Totalvaccination
United States,91585520.0,1031240.0,603657696.0
India,44067144.0,519105.0,1959202084.0
France,34436630.0,152736.0,148561854.0
Brazil,33785346.0,679093.0,446216563.0
Germany,31044610.0,144381.0,183952920.0
United Kingdom,22555882.0,180072.0,147715062.0
Italy,21124792.0,172428.0,139717402.0
South Korea,20052304.0,25110.0,127443179.0
Russia,18350864.0,374765.0,148805839.0
Turkey,15066270.0,99341.0,148224840.0


3\. Let's look at the list of countries without any reported COVID cases.

There are 15 countries with NULL reported COVID total cases

We need to investigate further to ensure the total\_cases column is error\_free

In [38]:
--first lets create a temporary table from previous example
Drop TABLE if exists TempCovid
CREATE Table TempCovid(Country NVARCHAR(255),TotalCases numeric)
insert into TempCovid
SELECT location, SUM(new_cases) As TotalCases
FROM dbo.covid
where continent is not null
GROUP by location

-- lets filter out country with NULL total cases
Select *
FROM TempCovid
where TotalCases is NULL

Country,TotalCases
Guernsey,
Puerto Rico,
Turkmenistan,
Northern Mariana Islands,
Pitcairn,
Tuvalu,
Guam,
Niue,
Northern Cyprus,
Sint Maarten (Dutch part),


4\. Lets see the trend of ratio of covid cases and death per population in Canada

There are null data in the existing total\_death column for Jan 23, 2020 to March 08, 2020.

In [5]:
SELECT date, location,population,total_cases,total_deaths,(cast(total_cases as float)/population)*100 As PercentCases, (total_deaths/population)*100 As percentdeath
from dbo.covid
where location in ('Canada')
order by 3 desc


date,location,population,total_cases,total_deaths,PercentCases,percentdeath
11-05-2020,Canada,38155012,71606,5836.0,0.1876712815605981,0.0152955003657186
12-05-2020,Canada,38155012,72723,5979.0,0.190598813073365,0.0156702873006565
13-05-2020,Canada,38155012,73950,6091.0,0.1938146422283919,0.0159638267182303
14-05-2020,Canada,38155012,75060,6213.0,0.1967238275275604,0.0162835750123732
15-05-2020,Canada,38155012,76331,6337.0,0.2000549757394913,0.0166085650818298
16-05-2020,Canada,38155012,77486,6452.0,0.2030821009832207,0.0169099671623743
17-05-2020,Canada,38155012,78694,6542.0,0.2062481332727663,0.0171458470514961
18-05-2020,Canada,38155012,79780,6637.0,0.2090944172681691,0.0173948313789024
19-05-2020,Canada,38155012,80782,6714.0,0.2117205467003915,0.0175966397284844
20-05-2020,Canada,38155012,82038,6832.0,0.2150123815974688,0.0179059044719996


5\. Lets try to correct the null values in new\_deaths and new cases in the canadian data

In [30]:
-- lets filter out the new_deaths and new_cases with null data in Canada
select date, new_cases, new_deaths
from covid
where new_deaths is null or new_cases is null

date,new_cases,new_deaths
21-01-2021,0.0,
22-01-2021,0.0,
23-01-2021,0.0,
24-01-2021,0.0,
25-01-2021,0.0,
26-01-2021,6.0,
27-01-2021,0.0,
28-01-2021,0.0,
29-01-2021,0.0,
30-01-2021,0.0,


In [36]:



-- lets start by creating a new table from the original table
Drop TABLE if exists TempCovid1
CREATE Table TempCovid1(location NVARCHAR(255),pop numeric, new_cases numeric,new_deaths numeric)
insert into TempCovid1
SELECT location, population, new_cases, new_deaths
FROM dbo.covid
where location = 'Canada'

-- let's change all the null values in new_deaths to zero so that it doesn't affect our aggregate
UPDATE TempCovid1
SET new_deaths = 0
WHERE new_deaths is NULL

-- let's change all the null values in new_deaths to zero so that it doesn't affect our aggregate
UPDATE TempCovid1
SET new_cases = 0
WHERE new_cases is NULL

-- let's recalculate the summation of cases and death
select  sum(new_cases) as TotalCases, sum(new_deaths) as TotalDeaths, (SUM(new_cases)/max(pop))*100 As PercentCases, (sum(new_deaths)/max(pop))*100 As percentdeath
FROM TempCovid1

-- data cleaning is neccesary for this dataset


TotalCases,TotalDeaths,PercentCases,percentdeath
4107420,43329,10.765,0.1135


6\. let's compare total vaccinated with total of death and cases across all locations

In [44]:
--lets update the original covid table to correct the impact of the null values on aggregation
update covid
set new_cases = 0
where new_cases is null

update covid
set new_deaths = 0
where new_deaths is null

update covid
set new_vaccinations = 0
where new_vaccinations is null


select date, location, population,new_cases, new_deaths, SUM(new_vaccinations) OVER (Partition by Location Order by location, date) as TotalVaccinated, SUM(new_cases) OVER (Partition by Location Order by location, date) as TotalCases1, SUM(new_deaths) OVER (Partition by Location Order by location, date) as TotalDeaths1
from covid
where [continent] is not NULL
order by 2


date,location,population,new_cases,new_deaths,TotalVaccinated,TotalCases1,TotalDeaths1
01-01-2021,Afghanistan,40099462,183,12,0,183,12
01-01-2022,Afghanistan,40099462,23,0,0,206,12
01-02-2021,Afghanistan,40099462,36,4,0,242,16
01-02-2022,Afghanistan,40099462,629,3,0,871,19
01-03-2020,Afghanistan,40099462,0,0,0,871,19
01-03-2021,Afghanistan,40099462,19,1,0,890,20
01-03-2022,Afghanistan,40099462,220,11,0,1110,31
01-04-2020,Afghanistan,40099462,26,0,0,1136,31
01-04-2021,Afghanistan,40099462,63,5,0,1199,36
01-04-2022,Afghanistan,40099462,35,0,0,1234,36


7\. Let's use CTE to look at percentage of cases, deaths and vaccinations across all location

In [50]:
with New_Covid_Table(date,location,population,new_cases, new_deaths, TotalVaccinated, TotalCases1, TotalDeaths1)
as (select date, location, population,new_cases, new_deaths, SUM(new_vaccinations) OVER (Partition by Location Order by location, date) as TotalVaccinated, SUM(new_cases) OVER (Partition by Location Order by location, date) as TotalCases1, SUM(new_deaths) OVER (Partition by Location Order by location, date) as TotalDeaths1
from covid
where [continent] is not NULL
)

select *, (TotalVaccinated/population)*100 as PercentVaccinated
, (TotalCases1/population)*100 as PercentCases
, (TotalDeaths1/population)*100 as PercentDeaths
FROM New_Covid_Table
ORDER by 2

date,location,population,new_cases,new_deaths,TotalVaccinated,TotalCases1,TotalDeaths1,PercentVaccinated,PercentCases,PercentDeaths
01-01-2021,Afghanistan,40099462,183,12,0,183,12,0.0,0.0004563652250496,2.9925588527846085e-05
01-01-2022,Afghanistan,40099462,23,0,0,206,12,0.0,0.0005137226030613,2.9925588527846085e-05
01-02-2021,Afghanistan,40099462,36,4,0,242,16,0.0,0.0006034993686448,3.990078470379478e-05
01-02-2022,Afghanistan,40099462,629,3,0,871,19,0.0,0.0021720989673128,4.73821818357563e-05
01-03-2020,Afghanistan,40099462,0,0,0,871,19,0.0,0.0021720989673128,4.73821818357563e-05
01-03-2021,Afghanistan,40099462,19,1,0,890,20,0.0,0.0022194811491485,4.987598087974347e-05
01-03-2022,Afghanistan,40099462,220,11,0,1110,31,0.0,0.0027681169388257,7.730777036360239e-05
01-04-2020,Afghanistan,40099462,26,0,0,1136,31,0.0,0.0028329557139694,7.730777036360239e-05
01-04-2021,Afghanistan,40099462,63,5,0,1199,36,0.0,0.0029900650537406,8.977676558353826e-05
01-04-2022,Afghanistan,40099462,35,0,0,1234,36,0.0,0.0030773480202801,8.977676558353826e-05


8\. Let's create a view for the data for use in visualization tool

In [54]:
Create VIEW Covid_Data
AS
select date, location, population,new_cases, new_deaths, 
SUM(new_vaccinations) OVER (Partition by Location Order by location, date) as TotalVaccinated, 
SUM(new_cases) OVER (Partition by Location Order by location, date) as TotalCases1, 
SUM(new_deaths) OVER (Partition by Location Order by location, date) as TotalDeaths1
from covid
where [continent] is not NULL

