<img src="img/dsci513_header2.png" width="600">

# Lab 1: Introduction to relational databases

**Arman Seyed-Ahmadi, November 2021**

## Instructions
---
rubric={mechanics:2}

- Follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/)
- Make sure to upload a PDF version of your lab notebook to Gradescope, in addition to the `.ipynb` file. Use the `Webpdf` option of Jupyter Lab if `PDF` doesn't work.
- Add a link to your GitHub repository here:

## Imports and configurations
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2

%matplotlib inline
%load_ext sql
%config SqlMagic.displaylimit = 20
%config SqlMagic.autolimit = 30

Before running the following cell, make sure that you have the correct login information in the `credentials.json` file:

In [2]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

Use the following cell (and copy it as needed) to connect to the database that you need for a question. Remember that if you're using the same database for a few questions, you don't need to reconnect each time. Only use the following cell for establishing the first connection, and for when you want to switch from one to another.

In [None]:
%sql postgresql://{username}:{password}@{host}:{port}/imdb_dsci513

## Exercise 1: Getting to know a database
---

This exercise does not involve any coding, just getting to know our way around in pgAdmin and `psql`. You can use either of these two options for answering the questions.

### 1.1

rubric={accuracy:2}

List the name of tables that exist in the `imdb_dsci513` and `world_dsci513` databases.

_Your answer goes here, replacing this line._

### 1.2

rubric={accuracy:2}

List the column names of the `country` table in the `world_dsci513` database.

> **Hint**: You can find the answer using pgAdmin by right-clicking the table name and selecting "Properties", but it is much easier in my opinion to use one of `psql`'s meta commands to do this!

_Your answer goes here, replacing this line._

### 1.3

rubric={accuracy:2}

How many **unique** data types do you see in the `country` table of the `world_dsci513` database? List those unique datatypes.

> Remember? The datatype of each column in a table determines its **domain** of allowable values the column can store.

_Your answer goes here, replacing this line._

### 1.4

rubric={accuracy:1}

How many rows are there in the `names` table of the `imdb_dsci513` database? Use pgAdmin to answer this question. 

> **Hint:** Right-click on the table in pgAdmin and inspect the options you have. Also, you might want to check out the "Properties" tab in pgAdmin for your table.

_Your answer goes here, replacing this line._

## Exercise 2: Basic SQL queries
---

### 2.1

rubric={accuracy:3}

- Write a query that returns 5 rows from the columns `title`, `start_year`, and `rating` from the `movies` table in the `imdb_dsci513` database.

- How are the rows ordered by default? Provide your answer in a markdown cell below the code.

In [None]:
%%sql

...

_Your answer goes here, replacing this line._

### 2.2

rubric={accuracy:3}

We want to retrieve rows corresponding to the 10 top-rated movies in 2015 (year based on the `start_year` column) using the `movies` table in the `imdb_dsci513` database, but we only want those movies that have at least 10,000 votes.

> **Hint:** When trying to come up with a SQL statement for a given query, it's helpful to ask yourself these questions:
> - Which columns to retrieve?
> - Which table to choose columns from?
> - What filters (if any)?
> - Need to sort the results?
> - Duplicates ok?
> - How many rows to retrieve? As many as there are, or a specific number?

In [None]:
%%sql

...

### 2.3

rubric={accuracy:4}

We want to find out what percentage of movies in the `imdb_dsci513` database are rated no less than 7. Write a query that computes that percentage value with two digits after the decimal point, and prints the output as e.g. `10.25%`.

For this question, write one query to find the total count, and use the result manually in another query to compute the percentage (You will learn how to do this in a single query soon!).

> **Hint:** There is a SQL function for rounding numbers, remember?

In [None]:
%%sql

...

In [None]:
%%sql

...

### 2.4

rubric={accuracy:2}

Write a query to the `world_dsci513` database that returns all unique pairs of continents and regions in the `country` table.

In [None]:
%%sql

...

### 2.5 (OPTIONAL)

rubric={accuracy:1}

Write a query to the `world_dsci513` database that returns the **number** of unique pairs of continents and regions in the `country` table.

In [None]:
%%sql

...

### 2.6

rubric={accuracy:4}

- Query the `country` table from the `world_dsci513` database to find the population density (i.e. `(population) / (surface area)`) for every country located in Asia, Africa, and Europe.
- Name the resulting column `pop_density`, and round its values to 1 decimal digit.
- Your query should return the `name`, `region`, and `pop_density` columns only.
- Sort the resulting rows by `pop_density` in descending order.

> **Hint:** Some SQL functions don't work with inexact data types, so you'll need to do a type conversion to get the right result.

In [None]:
%%sql

...

### 2.7

rubric={accuracy:4}

Given the `country` table of the `world_dsci513` database, retrieve the name and percent change in GNP (gross national product) of countries which experienced between 0 and 50% increase in their GNP. The percent change in GNP, $\epsilon_\text{GNP}$, is given by

$$
\epsilon_\text{GNP} = \frac{\text{GNP} - \text{GNP}_\text{old}}{\text{GNP}_\text{old}} \times 100
$$

- Round $\epsilon_\text{GNP}$ to 1 decimal digit, and show $\epsilon_\text{GNP}$ as, for example, `100.0%` (i.e. append a percent sign to the value).
- The column containing $\epsilon_\text{GNP}$ values should have the column name "**GNP % change**".
- Sort your results by $\epsilon_\text{GNP}$ in descending order.
- Eliminate rows which contain null values for $\epsilon_\text{GNP}$.

In [None]:
%%sql

...

### 2.8

rubric={accuracy:2}

How many of the countries retrieved in [Exercise 2.7](#2.7) have gained independence after 1950? Write a query that answers this question.

In [None]:
%%sql

...

### 2.9 (OPTIONAL)

rubric={accuracy:2}

- Write a query that returns the names of all countries that gained independence at some point in time (i.e. have a recorder independence year).
- We also like to have a column named `Independent for (years):` which computes the number of years since independence until now (in integer values). 
- Don't hard-code current year, e.g. 2021, in the query; we want our query to be useful in upcoming years too.
- Sort your results alphabetically by country name in ascending order.

> **Hint:** Watch out for nulls!

> **Hint:** There are several solutions to this exercise, and all of them are acceptable as long as your query returns the correct rows.

In [None]:
%%sql

...

### 2.10 (OPTIONAL)

rubric={accuracy:3}

The following SQL query finds the number of countries in Asia:

```sql
SELECT
    COUNT(country)
FROM
    country
WHERE
    continent = 'Asia'
;
```

Rewrite this query to obtain the same result without using a `WHERE` clause.

**Hint:** Think about how `COUNT()` treats certain data types differently...

In [None]:
%%sql

...

## Exercise 3: Pattern matching
---

### 3.1

rubric={accuracy:3}

Write a query to find the country names both starting and ending with a vowel (i.e. "a", "e", "i", "o", and "u"). Also, discard names with more than one part (e.g. "United Kingdom"). Do not use regex for this question.

> **Hint:** You might initially think of `LIKE` keyword for finding names starting and ending with a vowel. It is certainly possible to answer this question using `LIKE`, but it would be an unnecessarily long query. Think about how you can "index" into the names, and use that in combination with `IN` to make the query shorter. You still need `LIKE` to discard names with multiple parts though!

> **Hint:** Multi-part country names have at least one space character.

In [None]:
%%sql

...

### 3.2

rubric={accuracy:2}

Use regex in a query to count the number of leaders (represented by the `headofstate` column) whose name contains one of the roman numerals I, II, III, IV, V, VI, VII, or VIII (e.g. "Elisabeth II").

In [None]:
%%sql

...

### 3.3

rubric={accuracy:2}

- Given the table `country` of the `world_dsci513` database, write a query to find the name of countries located in the Middle East region.
- Along with the column `name`, your query should also retrieve a derived column named `Republic?` that is boolean valued, and shows `True` if the country is run by a republic government, and `False` if not.
- Sort the results based on the `name` column alphabetically in ascending order.

In [None]:
%%sql

...

### 3.4

rubric={accuracy:2}

Use regex to find the number of movies in the `movies` table of the `imdb_dsci513` database which contain both the word "the" (case insensitive) and a number in their title?

In [None]:
%%sql

...

### 3.5

rubric={accuracy:3}

Given the `country` table from the `world_dsci513` database, retrieve the name, region, and populations of all countries. Also, supposed that we'd like to clean up the column `governmentform`. Create a column called `gov_type`, which based on the value of the column `governmentform` prints one of the following options:

- "Republic" if `governmentform` contains any form of the word "Republic"
- "Monarchy" if `governmentform` contains any form of the word "Monarchy"
- "Dependent" if `governmentform` contains either one of these words: "Territory", "Area", "Region", "Department", "Part"

And `NULL` for anything else.

**Hint:** Remember to make sure your search pattern is case insensitive.

In [None]:
%%sql

...

## Exercise 4: Data retrieval with `psycopg2` and Pandas
---

### SQL and Python

`psycopg2` is the official Python driver for Postgres (see docs [here](https://www.psycopg.org/docs/index.html)), which allows us to send SQL queries directly to a database server and retrieve data into Python.

It is actually quite easy to use `psycopg`. We first need to set up a connection to our database. Let's open up the JSON file storing our login information in python:

In [30]:
with open('data/credentials.json') as f:
    login = json.load(f)

To make a connection, we use `psycopg2.connect()`:

In [31]:
conn = psycopg2.connect(database='world_dsci513',
                        user=login['user'],
                        password=login['password'],
                        host=login['host'],
                        port=login['port'])

Or because we've used the same names for our dictionary keys in `credentials.json` as the arguments to `psycopg2.connect()`, we can simply unpack `login` directly:

In [32]:
conn = psycopg2.connect(database='world_dsci513', **login)

Note that I'm setting the argument `database` separately such that I don't have to go back and modify `credentials.json` every time I want to change the database.

We'll keep this connection open for our whole working session. It's not bad to keep it open if you're not using it (it will terminate when you exit Python/Jupyter anyway), however the connection does consume system resources, so it's good practice to close it if you're finished with it, using `conn.close()`.

Once we have a connection, we create `cursor` objects to perform operations and then `.execute()` a SQL statement. For various reasons which you can read more about in the [psycopg2 docs](https://www.psycopg.org/docs/usage.html#transactions-control), it's recommended to use Python's context managing `with` statement to create cursors:

In [33]:
with conn, conn.cursor() as cur:
    cur.execute("SELECT * FROM country LIMIT 5")

To inspect the returned data we can use one of three methods:
- `cur.fetchone()`: returns a single row
- `cur.fetchmany(5)`: returns the specified numbers of rows
- `cur.fetchall()`: returns all rows

But note that the returned data is like a generator, as you call the above methods, you'll iterate over the data. If you've iterated over all the rows, then running one of the above methods won't return anything, you'd need to run the `execute` statement again. The reason for this behaviour is to avoid reading all the returned data into memory at once. For example:

In [34]:
with conn, conn.cursor() as cur:
    cur.execute("SELECT name, population FROM country LIMIT 5")
    for row in cur.fetchall():
        print(row)

('Afghanistan', 22720000)
('Netherlands', 15864000)
('Netherlands Antilles', 217000)
('Albania', 3401200)
('Algeria', 31471000)


If I try iterate over 6 rows in this case, I'll get `None` back once all my data is exhausted:

In [35]:
with conn, conn.cursor() as cur:
    cur.execute("SELECT name, population FROM country LIMIT 5")
    for i in range(6):
        print(cur.fetchone())

('Afghanistan', 22720000)
('Netherlands', 15864000)
('Netherlands Antilles', 217000)
('Albania', 3401200)
('Algeria', 31471000)
None


We can even execute queries that are broken over multiple lines for readability using Python's triple quote delimiters (`"""text"""`) for multi-line comments:

In [36]:
query = """
SELECT
  name, region, population
FROM
  country
LIMIT 5
;
"""

with conn, conn.cursor() as cur:
    cur.execute(query)
    for row in cur.fetchall():
        print(row)

('Afghanistan', 'Southern and Central Asia', 22720000)
('Netherlands', 'Western Europe', 15864000)
('Netherlands Antilles', 'Caribbean', 217000)
('Albania', 'Southern Europe', 3401200)
('Algeria', 'Northern Africa', 31471000)


That's all there is to it! Once we're done, we can close our connection to save system resources by running `conn.close()`.

> **Note:** If you get an error: `InternalError: current transaction is aborted, commands ignored until end of transaction block` that means that you didn't use the `with` context manager correctly. To get out of this state, run `conn.rollback()`. You can read more about why you need to do this in the [psycopg2 docs](https://www.psycopg.org/docs/usage.html#transactions-control).

### SQL and Pandas

`psycopg2` provides a basic interface with a Postgres database. As data scientists, you'll often be working with data in a Pandas dataframe so it would be useful to be able to execute SQL statements and coerce the returned data directly into a dataframe. As we've seen in lecture 1, it is also possible to convert the returned data into a dataframe with `ipython-sql`, but we also need to be able to do that without a Jupyter notebook environment as well.

Luckily, this is super easy to do with the pandas functions `pd.read_sql_query()`. Let's give it a try. First we need to create a connection to our database using `psycopg2` just like we did before:

In [37]:
query = """
SELECT
  name, region, population
FROM
  country
LIMIT 5
;
"""

In [38]:
pd.read_sql_query(query, conn)

Unnamed: 0,name,region,population
0,Afghanistan,Southern and Central Asia,22720000
1,Netherlands,Western Europe,15864000
2,Netherlands Antilles,Caribbean,217000
3,Albania,Southern Europe,3401200
4,Algeria,Northern Africa,31471000


### 4.1

rubric={accuracy:2,viz:2}

- Create a graph that contains the overlay of two histograms showing the distribution of "life expectancy", one for both continents of North and South America (collectively called "Americas"), and another for the continent of Europe.
- Extract your data from the `country` table in the `world_dsci513` database.
- Filtering of the data must be done using SQL queries executed through `psycopg2`.
- As for the formatting of the graph, please follow the [general visualization rubric](https://github.com/UBC-MDS/public/blob/55c2d336bb91e38301c9e9d025faf284449ca272/rubric/rubric_viz.md). You can use any bin size that you see fit for your histograms. Use any visualization package that you like.

In [None]:
...

### 4.2

rubric={accuracy:2,viz:1}

- Create a bar graph that shows the average life expectancy of each continent.
- Your plot should also show the standard deviation of life expectancy for each continent as an error bar.
- Use Pandas aggregators to obtain averages and the standard deviations after retrieving data using SQL.
- Also, remember to exclude rows that contain null values for life expectancy.

In [None]:
...

### 4.3

rubric={accuracy:4}

According to the `world_dsci513` database, which are the 10 most spoken languages in the world? Note that each language in a country is spoken by a particular percentage of the population.

Answering this question involves retrieving and combining data from multiple tables, as well as grouping and aggregation. Since we have not talked about either combining tables or grouping and aggregation in SQL yet, you need to do the retrieval part using SQL and the remaining parts using Pandas. My solution involves the following steps:

1. Use SQL queries to retrieve data from 2 different tables, and load the data into 2 Pandas dataframes using `pd.read_sql_query()`
2. Merge the tables using Pandas `.merge()` function that you've learned in DSCI 511
3. Create a new column for your dataframe to take into account the population percentages speaking a particular language
4. Drop unnecessary columns
5. Use Pandas functions `.groupby()` together with `.agg()`

For your SQL queries, only retrieve the columns that you need.

Your final result should look something like this:

| **language** | **speaking_population** |
|--------------|-------------------------|
| **lang1**    | 123456789               |
| **lang2**    | 123456789               |
| **lang3**    | 123456789               |
| ...          | ...                     |

(Note that you will get multiple column indexes because of grouping in Pandas, so your `speaking_population` column header will visually look _raised_ which is hard to reproduce in markdown. Don't worry if this happens, as it is supposed to.)

> **Note:** In order for merging to be possible in Pandas, column names should be the same. Think of how you can write your SQL query such that you have the same column names, based on which you want to do the merging.

In [None]:
df_name_pop = ...

df_language = ...

df_merged = ...

# compute the number of speakers for each language
df_merged['speaking_population'] = ...

# drop unnecessary columns
df_merged = ...

# chain mulitple operations here to get the final result
df_merged ...