# Window LAG

## COVID-19 Data
Notes on the data: This data was assembled based on work done by [Rodrigo Pombo](https://github.com/pomber/covid19) based on [John Hopkins University](https://systems.jhu.edu/research/public-health/ncov/), based on [World Health Organisation](https://www.who.int/health-topics/coronavirus). The data was assembled 21st April 2020 - there are no plans to keep this data set up to date.

In [1]:
# Prerequesites
from pyhive import hive
%load_ext sql
%sql hive://cloudera@quickstart.cloudera:10000/sqlzoo
%config SqlMagic.displaylimit = 20

 ····


## Window Function
The SQL Window functions include LAG, LEAD, RANK and NTILE. These functions operate over a "window" of rows - typically these are rows in the table that are in some sense adjacent.

## 1. Introducing the `covid` table

The example uses a WHERE clause to show the cases in 'Italy' in March.

```sql
SELECT name, DAY(whn),
 confirmed, deaths, recovered
 FROM covid
WHERE name = 'Italy'
AND MONTH(whn) = 3
ORDER BY whn
```

**Modify the query to show data from Spain**

In [2]:
%%sql
SELECT name, EXTRACT(DAY FROM whn) whn, confirmed, deaths, recovered
  FROM covid
    WHERE name='Spain' AND EXTRACT(MONTH FROM whn)=3
    ORDER BY whn;

 * postgresql://postgres:***@localhost/sqlzoo
31 rows affected.


name,whn,confirmed,deaths,recovered
Spain,1.0,84,0,2
Spain,2.0,120,0,2
Spain,3.0,165,1,2
Spain,4.0,222,2,2
Spain,5.0,259,3,2
Spain,6.0,400,5,2
Spain,7.0,500,10,30
Spain,8.0,673,17,30
Spain,9.0,1073,28,32
Spain,10.0,1695,35,32


## 2. Introducing the LAG function

The LAG function is used to show data from the preceding row or the table. When lining up rows the data is partitioned by country name and ordered by the data whn. That means that only data from Italy is considered.

```sql
SELECT name, DAY(whn), confirmed,
   LAG(whn, 1) OVER (PARTITION BY name ORDER BY whn)
 FROM covid
WHERE name = 'Italy'
AND MONTH(whn) = 3
ORDER BY whn
```

**Modify the query to show confirmed for the day before.**

In [3]:
%%sql
SELECT name, EXTRACT(DAY FROM whn), confirmed,
  LAG(confirmed, 1) OVER (PARTITION BY name ORDER BY whn) AS lag_confirmed
    FROM covid
    WHERE name='Italy' AND EXTRACT(MONTH FROM whn)=3
    ORDER BY whn;

 * postgresql://postgres:***@localhost/sqlzoo
31 rows affected.


name,date_part,confirmed,lag_confirmed
Italy,1.0,1694,
Italy,2.0,2036,1694.0
Italy,3.0,2502,2036.0
Italy,4.0,3089,2502.0
Italy,5.0,3858,3089.0
Italy,6.0,4636,3858.0
Italy,7.0,5883,4636.0
Italy,8.0,7375,5883.0
Italy,9.0,9172,7375.0
Italy,10.0,10149,9172.0


### LAG operation

Here is the correct query showing the cases for the day before:

```sql
SELECT name, DAY(whn), confirmed,
   LAG(confirmed, 1) OVER (partition by name ORDER BY whn) AS lag
 FROM covid
WHERE name = 'Italy'
AND MONTH(whn) = 3
ORDER BY whn
```

Notice how the values in the LAG column match the value of the row diagonally above and to the left.

name | DAY(whn) | confirmed | dbf
------|---|------|----------
Italy | 1 | **1694** | null
Italy | 2 | 2036 | **1694**
Italy | 3 | 2502 | 2036
Italy | 4 | 3089 | 2502
Italy | 5 | **3858** | 3089
Italy | 6 | 4636 | **3858**
Italy | 7 | 5883 | 4636
Italy | 8 | 7375 | 5883
Italy | 9 | 9172 | 7375
Italy | 10 | 10149 | 9172
... | | |

## 3. Number of new cases

The number of confirmed case is cumulative - but we can use LAG to recover the number of new cases reported for each day.

```sql
SELECT name, DAY(whn), confirmed,
   LAG(confirmed, 1) OVER (PARTITION BY name ORDER BY whn)
 FROM covid
WHERE name = 'Italy'
AND MONTH(whn) = 3
ORDER BY whn
```

**Show the number of new cases for each day, for Italy, for March.**

In [4]:
%%sql
SELECT name, EXTRACT(DAY FROM whn), 
  confirmed - LAG(confirmed, 1) OVER (PARTITION BY name ORDER BY whn) AS new
    FROM covid
    WHERE name = 'Italy' AND EXTRACT(MONTH FROM whn) = 3
    ORDER BY whn;

 * postgresql://postgres:***@localhost/sqlzoo
31 rows affected.


name,date_part,new
Italy,1.0,
Italy,2.0,342.0
Italy,3.0,466.0
Italy,4.0,587.0
Italy,5.0,769.0
Italy,6.0,778.0
Italy,7.0,1247.0
Italy,8.0,1492.0
Italy,9.0,1797.0
Italy,10.0,977.0


## 4. Weekly changes

The data gathered are necessarily estimates and are inaccurate. However by taking a longer time span we can mitigate some of the effects.

```sql
SELECT name, DATE_FORMAT(whn,'%Y-%m-%d'), confirmed
 FROM covid
WHERE name = 'Italy'
AND WEEKDAY(whn) = 0
ORDER BY whn
```

You can filter the data to view only Monday's figures **WHERE WEEKDAY(whn) = 0**.

**Show the number of new cases in Italy for each week - show Monday only.**

In [5]:
%%sql
SELECT name, TO_CHAR(whn, 'yyyy-mm-dd'), 
  confirmed-LAG(confirmed, 1) OVER (PARTITION BY name ORDER BY whn) AS new
    FROM covid
    WHERE name = 'Italy' AND EXTRACT(DOW FROM whn) = 0
    ORDER BY whn;

 * postgresql://postgres:***@localhost/sqlzoo
13 rows affected.


name,to_char,new
Italy,2020-01-26,
Italy,2020-02-02,2.0
Italy,2020-02-09,1.0
Italy,2020-02-16,0.0
Italy,2020-02-23,152.0
Italy,2020-03-01,1539.0
Italy,2020-03-08,5681.0
Italy,2020-03-15,17372.0
Italy,2020-03-22,34391.0
Italy,2020-03-29,38551.0


## 5. LAG using a JOIN

You can JOIN a table using DATE arithmetic. This will give different results if data is missing.

```sql
SELECT tw.name, DATE_FORMAT(tw.whn,'%Y-%m-%d'), 
 tw.confirmed, lw.confirmed
 FROM covid tw LEFT JOIN covid lw ON 
  DATE_ADD(lw.whn, INTERVAL 1 WEEK) = tw.whn
   AND tw.name=lw.name
WHERE tw.name = 'Italy'
ORDER BY tw.whn
```

**Show the number of new cases in Italy for each week - show Monday only.**

In the sample query we JOIN this week tw with last week lw using the DATE_ADD function.

In [6]:
%%sql
SELECT tw.name, TO_CHAR(tw.whn,'yyyy-mm-dd'), 
  tw.confirmed - lw.confirmed AS new
    FROM covid tw LEFT JOIN covid lw ON 
    lw.whn + INTERVAL '1w' = tw.whn AND tw.name=lw.name
    WHERE tw.name = 'Italy' AND EXTRACT(DOW FROM tw.whn)=0
    ORDER BY tw.whn;

 * postgresql://postgres:***@localhost/sqlzoo
13 rows affected.


name,to_char,new
Italy,2020-01-26,
Italy,2020-02-02,2.0
Italy,2020-02-09,1.0
Italy,2020-02-16,0.0
Italy,2020-02-23,152.0
Italy,2020-03-01,1539.0
Italy,2020-03-08,5681.0
Italy,2020-03-15,17372.0
Italy,2020-03-22,34391.0
Italy,2020-03-29,38551.0


## 6. RANK()

The query shown shows the number of confirmed cases together with the world ranking for cases.

United States has the highest number, Spain is number 2...

Notice that while Spain has the second highest confirmed cases, Italy has the second highest number of deaths due to the virus.

```sql
SELECT 
   name,
   confirmed,
   RANK() OVER (ORDER BY confirmed DESC) rc,
   deaths
  FROM covid
WHERE whn = '2020-04-20'
ORDER BY confirmed DESC
```

**Include the ranking for the number of deaths in the table. Only include countries with a population of at least 10 million.**

In [7]:
%%sql
SELECT 
  covid.name, confirmed,
    RANK() OVER (ORDER BY confirmed DESC) rc,
    deaths, RANK() OVER (ORDER BY deaths DESC) rc2
    FROM covid JOIN world ON (covid.name=world.name)
    WHERE whn = '2020-04-20' AND population>=10000000
    ORDER BY confirmed DESC;

 * postgresql://postgres:***@localhost/sqlzoo
90 rows affected.


name,confirmed,rc,deaths,rc2
United States,784326,1,42094,1
Spain,200210,2,20852,3
Italy,181228,3,24114,2
France,156480,4,20292,4
Germany,147065,5,4862,8
United Kingdom,125856,6,16550,5
Turkey,90980,7,2140,12
China,83817,8,4636,9
Iran,83505,9,5209,7
Russia,47121,10,405,23


## 7. Infection rate

The query shown includes a JOIN t the world table so we can access the total population of each country and calculate infection rates (in cases per 100,000).

```sql
SELECT 
   world.name,
   ROUND(100000*confirmed/population,0)
  FROM covid JOIN world ON covid.name=world.name
WHERE whn = '2020-04-20' AND population > 10000000
ORDER BY population DESC
```

**Show the infect rate ranking for each country. Only include countries with a population of at least 10 million.**

In [8]:
%%sql
SELECT 
   world.name,
    ROUND(100000*confirmed/population,0),
    RANK() OVER (ORDER BY confirmed/population)
    FROM covid JOIN world ON covid.name=world.name
    WHERE whn = '2020-04-20' AND population > 10000000
    ORDER BY population DESC

 * postgresql://postgres:***@localhost/sqlzoo
90 rows affected.


name,round,rank
China,6,52
India,1,28
United States,238,87
Indonesia,3,35
Pakistan,4,42
Brazil,19,65
Nigeria,0,16
Bangladesh,2,32
Russia,32,71
Mexico,7,54


## 8. Turning the corner

For each country that has had at last 1000 new cases in a single day, show the date of the peak number of new cases.

In [9]:
%%sql
WITH t AS (
    SELECT name, whn,
       COALESCE(confirmed-LAG(confirmed, 1) OVER 
                (PARTITION BY name ORDER BY whn), 0) AS new_cases
    FROM covid
), r AS (
    SELECT name, whn, new_cases,
       RANK() OVER (PARTITION BY name ORDER BY new_cases DESC) AS rank_new
    FROM t
)
SELECT name, whn, new_cases FROM r
    WHERE rank_new=1 AND new_cases>1000
    ORDER BY whn, name;

 * postgresql://postgres:***@localhost/sqlzoo
26 rows affected.


name,whn,new_cases
China,2020-02-13,15136
Italy,2020-03-21,6557
Switzerland,2020-03-23,1321
Israel,2020-03-25,1131
Spain,2020-03-25,9630
Austria,2020-03-26,1321
Germany,2020-03-27,6933
Iran,2020-03-30,3186
Canada,2020-04-05,2778
Ecuador,2020-04-10,2196
