<img src="img/dsci513_header2.png" width="600">

# Lab 2: Grouping, joins, and table manipulation

Total out of 39 marks.

## Instructions
---
rubric={mechanics:2}

- Follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/)

- You submit 3 files to Gradescope (***upload it separately, not as a zip file, or folder***)
    - Fully rendered ipynb notebook, 
    - HTML of the fully rendered ipynb notebook
    - PDF of the fully rendered ipynb notebook

- Add a link to your GitHub repository here:

> NOTE: There is no autograding for any of our labs. So, the idea of Gradescope is just to upload the 3 files listed above. You just need to make sure that it is uploaded. You must upload 3 files individually to Gradescope (not in a folder or a zipped folder).

## Imports and configurations
---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import csv

%matplotlib inline
%load_ext sql
%config SqlMagic.displaylimit = 30
%config SqlMagic.autolimit = 30

## Connecting to the database
---

Before running the following cell, make sure that you have the correct login information in the `credentials.json` file:

> You should be careful where your credentials file is `credentials.json`. Revise concepts on absolute path and relative path. 

In [None]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

For exercises 1 - 4, use the `pd.read_sql_query` function from Pandas to execute your queries. But we need to establish a connection to the `world` database first:

In [None]:
from sqlalchemy import create_engine, text
url = f'postgresql://{username}:{password}@{host}:{port}/world'
conn = create_engine(url)

To run a query, you can use the following code:

```python
query = """
YOUR QUERY HERE
"""

pd.read_sql_query(text(query), con=conn)
```

(The `text` function above wrapped around `query` is necessary to avoid interpreting special characters such as `%`)

**Note:** Since we read query results into Pandas dataframes, you'll see an **index column** appearing in your results. That is expected and nothing to worry about.

## Exercise 1: Aggregations and grouping
---

### 1.1

rubric={accuracy:2}

Write a query to answer the following question:

How much higher is the population of the most populated country in the world, with respect to the average population of all countries in the world, expressed in percent?

- You can find this value using the the formula $(\text{pop} - \text{pop}_\text{avg}) / \text{pop}_\text{avg}) \times 100$.
- Your query should print the value with only one digit after the decimal point, followed by the percent sign `%`, e.g. `2500.0%`.

### 1.2

rubric={accuracy:2}

Write a query to answer the following question:

What is the maximum, average, and minimum population density (population per surface area [person / km$^2$]) of countries located in Europe?

- The values in the `surfacearea` column already have the required unit (km$^2$); no unit conversion is required.
- Your column headers should read _Max pop_density_, _Average pop_density_, and _Min pop_density_.
- Round all values to 2 decimal digits.

>**Note:** Remember that you have to convert approximate types (e.g. double precision or real) to `NUMERIC` to be able to use the `ROUND()` function.

### 1.3

rubric={reasoning:1}

We'd like to write a query to return the name of the country with the greatest surface area in the `world` database. Would the following query work as expected? Explain your answer in a 2-3 sentences.

```sql
SELECT
    name, MAX(surfacearea)
FROM
    country
```

_Type your answer here, replacing this text._

### 1.4

rubric={accuracy:1}

Can you write a query to answer the question posed in [Exercise 1.3](#1.3) ie "Find the name of the country with the greatest surface area in the `world` database"? Your result should contain one column and one row containing the value described above.

> **Hint:** Surprisingly, you don't need to use aggregation!

### 1.5

rubric={accuracy:2}

Write a query that returns the total population of each region of the world according to the `country` table.

- Sort your results in descending order by each region's total population (Hint: Be careful not to sort alphabetically!).
- In order to increase the readability of the results, use the `to_char()` function to separate groups of thousands with commas
(see the documentation [here](https://www.postgresql.org/docs/current/functions-formatting.html#FUNCTIONS-FORMATTING-NUMERIC-TABLE)).
For example, to accommodate numbers going up to a billion, you can use `to_char(column, '9,999,999,999')`.

### 1.6

rubric={accuracy:2}

What is the number of countries in each region that have a republic form of government? Sort your results by the number of countries in descending order.

>***Note:** Here, you are counting anything in governmentform that contains "Republic"/"republic" as a republic form of government. E.g., "Federal Republic" and "Socialistic Republic" are republic forms of government. 

_Type your answer here, replacing this text._

### 1.7 (Challenging Question)

> **Note:** This question is challenging, and is meant to be attempted after you have completed the rest of the assignment (from all courses). Don't stress out if you can't solve it!

rubric={accuracy: 1}

Using the `countrylanguage` table, write a query to find the `countrycode` and number of spoken languages in countries where 

- Each listed language is spoken by at least 10% of the population,
- There are at least 2 spoken languages in those countries.

Sort the resulting rows by the number of listed languages in each country in descending order.

### 1.8

rubric={accuracy:2}

Write a query to find the `countrycode` of countries that have at least 3 official languages. To verify your results, also return the number of official languages and name the corresponding column `num_official_lang`. Sort the returned rows according to this column in descending order.

## Exercise 2: Why don't you `JOIN` us?
---

### 2.1

rubric={accuracy:2}

It's hard to figure out which countries we're talking about in Exercise [1.8](#1.8) just by looking at their codes. Copy the query you wrote in Exercise [1.8](#1.8) here, and modify it such that it returns the name of each country instead of country code.

>**Hint:** You need to join the `country` and `countrylanguage` tables.

### 2.2

rubric={accuracy:2}

Write a query that finds the ratio of the population of each country's capital city to its entire population shown as a percentage value, for countries that have a population of at least 1,000,000.

- Your query should list the country name, capital city, and the population ratio percentage value
- Name the population ratio column `pop_ratio`, and round the values to 1 decimal digit
- Sort your results in descending order by the population percentage
- Limit the number of returned countries to 20 in your SQL query

> **Hint:** Watch out for integer division; use type conversion if needed.

> **Note:** You might notice that singapore pop_ratio to be higher than 100%. This is because the capital city's population is higher than the country's population. One mentions official population vs. population + number of people temporarily living in Singapore. Don't worry about it, and leave it as it is.

### 2.3

rubric={accuracy:2}

Write a query that returns:

- country name,
- average population of cities,
- number of listed cities

for each country.

- Pick meaningful names for the columns in your results. They can be anything you like.
- Use `to_char()` (which you've learned in a previous exercise in this lab) to format the average populations such that groups of thousands are separated by commas, and decimal digits are eliminated. For example, 1656782.25 should be shown as 1,656,782.
- Sort the results by the number of cities in each country in descending order
- limit the number of returned rows to 20.

### 2.4 (Challenging Question)

> **Note:** This question is challenging, and is meant to be attempted after you have completed the rest of the assignment (from all courses). Don't stress out if you can't solve it!

rubric={accuracy: 1}

Write a query to return the following data for each country:

- country name
- region
- population
- number of official languages
- number of cities having a population of over 1 million

for countries that have **at least** 1 official language **AND** 1 city with a population of over 1 million.

Make sure to

- Give meaningful names to your derived columns
- Sort your results in descending order by the number of official languages in each country

> **Hint:** Since you need to do multiple joins, you'll end up with a lot of duplicates. Make sure to count only the unique values.

### 2.5

rubric={accuracy:3}

Now that we've learned about grouping, aggregation and joins, let's revisit the last problem of Lab 1 and try to arrive at the same result using pure SQL. I hope after writing this query entirely in SQL, you'll appreciate the convenience of extracting this kind of information, compared to how you've done it in Pandas!

So without further ado, let's write a query that answers this question:

What are the 10 most spoken languages in the world?

- Each row should show the language and the respective speaker population
- Sort your results by the second column in descending order
- Format the population numbers such that groups of thousands are separated with commas (you've already learned how to do this in previous exercises of this lab)
- Use meaningful column aliases that you like

Verify that you get exactly the same result as those you obtained in Lab 1 using Pandas.

## Exercise 3: More joins with the IMDB database
---

In this exercise, you'll explore the `imdb` database more in depth and extract richer information by pulling data from various tables and joining them together.

In [None]:
conn = create_engine(f'postgresql://{username}:{password}@{host}:{port}/imdb')

### 3.1

rubric={accuracy:2}

Write a query that returns the names of all actors/actresses of the movie "Catch Me If You Can" (2002).

**Hint:** The data you need for this exercise is spread across the `movies`, `acting_roles`, and `names` tables.

### 3.2

rubric={accuracy:2}

Write a query that lists each movie genre along with the average runtime of movies belonging to each genre. Sort your results in descending order by the latter column.

### 3.3

rubric={accuracy:3}

Write a query to find the number of "drama" or "biography" movies in which either "Marlon Brando", "Gary Oldman", or "Robin Williams" played a role. Your query should list the actor's name, genre, and number of movies played by that actor in that genre.

## Exercise 4: Unter is the new Uber
---

In this exercise, we're going to create a database called `unter` and its tables from scratch, and then populate its tables with some fake data in the later exercises. The database `unter` is supposed to store data of employees, drivers, cars, etc. of a company which provides taxi services. Let's call our company _Unter_, because we want to be a rival to Uber!

(Uber in German means "over" or "above", so I've chosen "Unter" meaning "under" or "below" to oppose and compete with them even in name. But it's kinda obvious where the company's fate is headed with this name choice ðŸ˜„)

Because you might want to drop your database several times as you try out new things that you've learned and you probably want to start fresh each time, I thought it may not be convenient to do it every time using the pgAdmin GUI. So I've given you the following cell to be able to **drop your database `unter` forcefully** (regardless if there are connections to it or not), and re-create it immediately.

In [None]:
conn = psycopg2.connect(database='postgres', **login)

autocommit = psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT
conn.set_isolation_level(autocommit)
conn.cursor().execute("DROP DATABASE IF EXISTS unter WITH (FORCE);")
conn.cursor().execute("CREATE DATABASE unter;")
conn.cursor().close()

At this point, you can use pgAdmin to see that your new database `unter` is created. Don't forget to right-click your "Databases" group in the browser pane and click "Refresh". Alternatively, you can use the following cell to see if the database `unter` appears in the list of databases on your Postgres server:

In [None]:
conn = psycopg2.connect(database='postgres', **login)

with conn, conn.cursor() as cur:
    cur.execute("SELECT datname FROM pg_database;")
    print([i[0] for i in cur.fetchall()])

### 4.1

rubric={accuracy:3}

Now that you've created the database, it's time to create the tables we need. I've included `employee.csv` in the `data` folder of this assignment, which contains the data of employees of our company. Given this CSV file, your task is:

- Create a new table and call it `employee`
- Create an auto-incrementing column called `id` (Hint: Remember the `SERIAL` data type?). This is the primary-key column.
- Create other columns in your table that correspond to the columns you see in `employee.csv`. Use the same column names as in the CSV file.
- You also need to specify data types for your columns. To do this, inspect `employee.csv` and choose the most appropriate data types according to the values you see.
- Except for `exit_date`, none of the columns can be `NULL`.
- Values in column `sin` should be unique.
- The number of digits stored in column `sin` should be exactly 9 (Hint: Use the `CHECK` keyword together with `LENGTH()`).

> **Note:** Use your best judgment to choose the number of characters to allow in, for example, `VARCHAR(n)`. Just make sure that it's not unreasonably short or long. In any other case, it's arbitrary and up to you what length you use.

Make sure to enforce the constraints in your table. I've provided starter Python code for using the `psycopg2` package to do this:

Here is a skeleton code for creating a table in Postgres using `psycopg2`:

```sql
conn = psycopg2.connect(database='unter', **login)
conn.autocommit = True

with conn, conn.cursor() as cur:
    cur.execute("""
<your SQL create table query here>
    """)
```

I am dropping if the table already exists so you can re-run this cell without getting an error. 

In [None]:
with conn, conn.cursor() as cur:
    cur.execute("""
DROP TABLE IF EXISTS employee;
    """)

In [None]:
conn = psycopg2.connect(database='unter', **login)
conn.autocommit = True

with conn, conn.cursor() as cur:
    cur.execute("""
        ...put your DDL here...
    """)

After creating your table, run `\d employee` in `psql` to see if your **columns** and **constraints** are properly created.

### 4.2

rubric={accuracy:2}

Being able to create things feels powerful, so let's create more!

Now let's create a table `driver` for the drivers who work at Unter. I've provided a data file called `driver.csv` in the `data` folder of this assignment. If you inspect this file, you'll see that the data about our drivers is pretty much similar to our employees data. The only difference is that in `driver.csv` we also store driver's license information, in addition to other columns in the `employee` table.

- You can use the same commands that you've written to create the `employee` table, but make sure to add more columns to accommodate the data in `driver.csv`.
- None of the new columns can be null.
- Add a constraint to the `driver_license` column so that its values are unique.

In [None]:
conn = psycopg2.connect(database='unter', **login)
conn.autocommit = True

with conn, conn.cursor() as cur:
    cur.execute("""
        ...
    """)

### 4.3

rubric={accuracy:2}

All right, almost there. We need to create two more tables, `car_model` and `cab`. 

`car_model` stores information about each type of car, whereas `cab` contains data about the particular cabs that the drivers of our company own. 

This time I've given you part of the `CREATE TABLE` statements for both tables, and you're in charge of adding constraints.

**`car_model` table:**

- Add the primary key constraint

**`cab` table:**

- Add the primary key constraint
- Each cab is a particular type of car, the information of which can be found in the `car_model` table. In order to ensure that each cab in the `cab` table can only be of the car types in the `car_model` table, the `car_model_id` column in the `cab` table should reference the `id` column of the `car_model` table. Add a constraint that enforces this.
- The owner of each can in the `cab` table should be one of the drivers of our company. Add a constraint to the `cab` table such that `owner_id` references the `id` column of the `driver` table.

In [None]:
conn = psycopg2.connect(database='unter', **login)
conn.autocommit = True

with conn, conn.cursor() as cur:
    cur.execute("""
    CREATE TABLE car_model(
        id SERIAL,
        model_name VARCHAR(64) NOT NULL,
        miles_per_gallon REAL,
        year DATE,
        origin VARCHAR(32),
        
        ...
    );

    CREATE TABLE cab(
        id SERIAL,
        licence_plate VARCHAR(32) UNIQUE NOT NULL,
        car_model_id INT,
        owner_id INT,
        active BOOLEAN NOT NULL,

        ...
    );
    """)

### End

Congratulations, here is a cartoon ([source](https://xkcd.com/327/)) that only SQL people understand:

<img src="img/cartoon.png" width="800">

Have fun with SQL!