<img src="img/dsci513_header2.png" width="600">

# Lab 3: Subqueries, window functions, CTEs

Total out of 29 marks.

## Instructions
---
rubric={mechanics:2}

- Follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/)

- You submit 3 files to Gradescope (***upload it separately, not as a zip file, or folder***)
    - Fully rendered ipynb notebook, 
    - HTML of the fully rendered ipynb notebook
    - PDF of the fully rendered ipynb notebook

- We don't have challenging questions in this lab.
- Add a link to your GitHub repository here:

> NOTE: There is no autograding for any of our labs. So, the idea of Gradescope is just to upload the 3 files listed above. You just need to make sure that it is uploaded. You must upload 3 files individually to Gradescope (not in a folder or a zipped folder).

## Getting set up
---

In [None]:
%load_ext sql
%config SqlMagic.displaylimit = 50

import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

In [None]:
username

In [None]:
%sql postgresql://{username}:{password}@{host}:{port}/world

## Exercise 1: Simple subqueries
---

### 1.1

rubric={accuracy:2}

Suppose that we are interested in computing how far the life expectancy of each country in the `country` table is from its average value for all countries in the world. This can be expressed as `lifeexpectancy` $-$ `AVG(lifeexpectancy)`.

Write a query that lists country names and their life expectancy deviation from the world average.

- Eliminate rows with null values
- Sort your results in ascending order by the values in the life expectancy deviation column
- Round the values in the life expectancy deviation column to 1 decimal digit.

In [None]:
%%sql

...

### 1.2

rubric={reasoning:1}

Explain why you can't write a query for Exercise [1.1](#1.1) without using a subquery.

_Type your answer here, replacing this text._

### 1.3

rubric={accuracy:2}

Use your query in Exercise [1.1](#1.1) to return the same columns, but modify it such that only countries are returned whose population density (i.e. `population / surfacearea`) is greater than the world average.

On average, do the countries returned by your query have lower or higher life expectancy, compared to the world average? (No computation needed, just look at the results to find out).

_Type your answer here, replacing this text._

In [None]:
%%sql

...

### 1.4

rubric={accuracy:2}

Write a query that lists all continents and the number of countries in each continent that have a life expectancy greater than 77 years. If there are no countries in a continent that satisfy this condition, the value of the second column should be null.

Your result should look like this if ordered alphabetically by continent:

<img src="img/1_4.png" width="170">

> **Hint:** The result of a subquery is also a table that you can use in a join operation.

In [None]:
%%sql

...

### 1.5

rubric={accuracy:2}

Retrieve the names of non-European countries in the world where one or more of the official European languages are spoken (either officially or non-officially). Make sure to remove duplicate country names from your results, and sort the rows by country name in descending order.

For this query, I have provided some starter code for you:

In [None]:
%%sql

SELECT
    ...
FROM
    ...
JOIN
    ...
ON
    ...
WHERE
    c.continent ...
    AND
    cl.language ... (
        SELECT
            
        FROM
            ...
        JOIN
            ...
        ON
            ...
        WHERE
            ...
            AND
            ...
    )
ORDER BY
    ... DESC
;

### 1.6

rubric={accuracy:2}

Rewrite the following query using a subquery instead of a join:

```sql
SELECT
    c.name
FROM
    country c
JOIN
    city ci
ON
    c.capital = ci.id
WHERE
    ci.population > 5000000
;
```

In [None]:
%%sql

...

### 1.7

rubric={accuracy:2}

Which countries in the world are vast enough that all western European and Nordic countries could fit within them?

_Type your answer here, replacing this text._

In [None]:
%%sql

...

## Exercise 2: Correlated subqueries
---

### 2.1

rubric={accuracy:3}

Find the number of countries in each continent whose life expectancy is greater than the average value for their respective continent.

> **Hint:** The `lifeexpectancy` column contains a bunch of nulls. Be careful with your counting!

> "Antarctica" won't be there in the final result.

In [None]:
%%sql

...

### 2.2

rubric={accuracy:2}

The results of your query for the previous question may not be very informative on first look, because absolute counts do not reveal much unless we can relate them to the total number of countries in each continent.

Borrow your query from Exercise [2.1](#2.2) and modify it such that it shows the ratio of the number of countries in each continent whose life expectancy is greater than their continent-average values, to the total number of countries in each continent. Round your ratio values to 2 decimal digits.

> **Hint:** Again, be careful with your counting, since `lifeexpectancy` column contains a bunch of nulls, and we don't want to include NULLS.

> **Hint:** That's right, you need to add one more subquery somewhere in your previous query!

In [None]:
%%sql

...

### 2.3

rubric={reasoning:2}

Consider this question:

In which European countries English is **not** spoken at all (i.e. not listed in the `countrylanguage` table)?

I have written the following SQL query to answer the above question:

```sql
SELECT
    DISTINCT c.name
FROM
    country c
JOIN
    countrylanguage cl
ON
    c.code = cl.countrycode
WHERE
    NOT cl.language ILIKE 'english'
    AND
    c.continent ILIKE 'europe'
;
```

However, when I run the above query I can find "United Kingdom" listed in the results, which is clearly incorrect. Can you tell me why I'm getting wrong results?

_Type your answer here, replacing this text._

### 2.4

rubric={accuracy:2}

Alright, let's figure out how to correctly answer the following question from Exercise [2.3](#2.3):

In which European countries English is **not** spoken at all (i.e. not listed in the `countrylanguage` table)?

**Note:** There's more than one way to answer the above question using a query. Here, I want you to use a correlated subquery.

In [None]:
%%sql

...

## Exercise 3: Window functions and CTEs
---

### 3.1

rubric={accuracy:1}

Rewrite the query that you've written for Exercise [1.1](#1.1) using window functions this time.

In [None]:
%%sql

...

### 3.2

rubric={accuracy:2}

Write a query that returns country, continent and city names, as well as the ratio of the population of each city to the population of the country where it's located, expressed as a percentage value. Furthermore, your query should also return the population rank of each city among all other cities in the same continent in descending order.

Your results should look like this:

<img src="img/3_2.png" width="700">

Use the above image to name and format your columns properly.

> **Note:** The order of your returned rows might be different from mine, but that is fine. 

In [None]:
%%sql

...

### 3.3

rubric={accuracy:2}

Suppose that we'd like to only choose the most populated city in each continent. The problem is, it's not possible to use a window function in the `WHERE` clause. But don't worry, it's not the end of the world!

Using your query in Exercise [3.2](#3.2), write a common table expression (CTE) to retrieve rows associated with the most populated cities of each continent. In each row, your query should only return the city name, along with the name of the country and continent where it's located.

In [None]:
%%sql

...