# Lecture 3: Summarizing & joining tables

## Announcements

```{admonition} Anonymous iclicker: How was your experience with installing PostgreSQL? (MAKE SURE IT IS ANONYMOUS)
- A. There are no issues with Postgres and pgadmin, and the installation I did at the start of MDS is all good.
- B. I had to reinstall it for some other reason, but it's all good.
- C. I forgot my password and had to reinstall it to make it work :( 
- D. Please don't ask me about it. 
```
```{margin}
<img src="img/recap.png" width="400px">
```

## Recap
- Data types: `INTEGER`, `REAL`, `TEXT`, `BOOLEAN`, `DATE` and its casting.
- More more keywords `LIKE`, `ILIKE`, `IN`, `BETWEEN`, `IS NULL` - YOU WILL END UP USING `ILIKE` A LOT.
- More operators `AND`, `OR`, `NOT` - Be careful of the order of precedence of these operators.
- Where to use double quotes `"` and single quotes `'`
- Column aliases using `AS`
- Derived columns:  runtime / 60::REAL AS runtime_hours
- Use of `CASE` keyword
- Some inbuilt functions `ROUND`, `SUBSTR`, `LENGTH` and many more.
- Order of execution of SQL statements vs Order of SQL clauses

```{admonition} Recap iclicker: Following is a "student" table. 

| student_id | name   | age |
|------------|--------|-----|
| 2          | Jane   | 30  |
| 3          | Alexander    | 10  |
| 4          | Emily  | 40  |
| 5          | Alex   | 20  |
| 6          | alex    | 80  |
| 7          | Alice  | 10  |
| 8          | Boalexo  | 50  |
| 9          | NULL  | 60  |

What is the output of the following query?

    SELECT name, age, age/2 AS age_half
    FROM student
    WHERE name ILIKE 'Alex%' OR name = 'NULL' ;

> Note: In student_id 9, the name got a `NULL` value. 

A.
| name   | age | age_half |
|--------|-----|----------|
| Alexander    | 10  | 5        |
| Alex   | 20  | 10       |
| alex    | 80  | 40       |
| NULL  | 60  | 30       |

B.
| name   | age | age_half |
|--------|-----|----------|
| Alexander    | 10  | 5        |
| Alex   | 20  | 10       |
| alex    | 80  | 40       |

C.
| name   | age | age_half |
|--------|-----|----------|
| Alexander    | 10  | 5        |
| Alex   | 20  | 10       |
| alex    | 80  | 40       |
| Boalexo  | 50  | 25  |

D.
| name   | age | age/2 |
|--------|-----|----------|
| Alexander    | 10  | 5        |
| Alex   | 20  | 10       |
| alex    | 80  | 40       |

```

```{toggle}
Answer B.

If you got it wrong, please review the lecture notes and try to understand the following:
- `ILIKE` is case insensitive. So, it will match `Alex` and `alex` .
- If you want to match `NULL` values, you need to use `IS NULL` or `IS NOT NULL` keywords.
```

## Themes
- Aggregations: If we want to do operations on a bunch of rows at once. 
- Grouping: If we want to do operations on a bunch of rows at once but grouped by some column(s).
- Joins: If we want to combine data from multiple tables.

Look at the below example:

In [1]:
%load_ext sql
%config SqlMagic.displaylimit = 20

In [2]:
import json
import urllib.parse

with open('../lectures/data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
# Just printing the logo information to see I am getting the correct file and 
# information
login

{'host': 'localhost', 'port': 5432, 'user': 'postgres', 'password': 'postgres'}

In [3]:
%sql postgresql://{username}:{password}@{host}:{port}/world

In [4]:
%%sql

SELECT *
FROM country ;

 * postgresql://postgres:***@localhost:5432/world
239 rows affected.


code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF
NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,15864000,78.3,371362.0,360478.0,Nederland,Constitutional Monarchy,Beatrix,5,NL
ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33,AN
ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34,AL
DZA,Algeria,Africa,Northern Africa,2381741.0,1962.0,31471000,69.7,49982.0,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35,DZ
ASM,American Samoa,Oceania,Polynesia,199.0,,68000,75.1,334.0,,Amerika Samoa,US Territory,George W. Bush,54,AS
AND,Andorra,Europe,Southern Europe,468.0,1278.0,78000,83.5,1630.0,,Andorra,Parliamentary Coprincipality,,55,AD
AGO,Angola,Africa,Central Africa,1246700.0,1975.0,12878000,38.3,6648.0,7984.0,Angola,Republic,José Eduardo dos Santos,56,AO
AIA,Anguilla,North America,Caribbean,96.0,,8000,76.1,63.2,,Anguilla,Dependent Territory of the UK,Elisabeth II,62,AI
ATG,Antigua and Barbuda,North America,Caribbean,442.0,1981.0,68000,70.5,612.0,584.0,Antigua and Barbuda,Constitutional Monarchy,Elisabeth II,63,AG


In [5]:
%%sql

SELECT count(*) AS count
FROM country ;

 * postgresql://postgres:***@localhost:5432/world
1 rows affected.


count
239


In [6]:
%%sql

SELECT count(headofstate) AS count
FROM country ;

 * postgresql://postgres:***@localhost:5432/world
1 rows affected.


count
238


```{admonition} Discussion: Why do you think the number is different when we did count(*) and count(acolumn) ?
<img src="img/discuss.png" width="120">
```

```{toggle}
Except for COUNT(*), all aggregation functions ignore NULLs
```

Let's play with some other aggregations;

In [7]:
%%sql
SELECT avg(population) AS count
FROM country ;

 * postgresql://postgres:***@localhost:5432/world
1 rows affected.


count
25434098.117154807


Here is the graphics on how it works:

<img src="img/simpleavg.png" width="600">

## GROUP BY

But what if I want to find the ***average population for each continent***? 

> Divide tuples into groups and apply aggregate operations to each group.

I want to know about GROUP BY to do that. Here is the graphics on how it works:

<img src="img/groupby.png" width="600">

In [8]:
%%sql

SELECT continent, AVG(population)
FROM country
GROUP BY continent ;

 * postgresql://postgres:***@localhost:5432/world
7 rows affected.


continent,avg
Asia,72647562.74509805
South America,24698571.42857143
North America,13053864.864864863
Oceania,1085755.3571428573
Antarctica,0.0
Africa,13525431.034482758
Europe,15871186.95652174


```{admonition} Discussion: What if I want to find the name of countries with above-average populations ? Try writing a SQL query for that.
<img src="img/discuss.png" width="120">
```

```{toggle}
Did you got this ?

    SELECT name
    FROM country
    WHERE population > AVG(population);
```

```{admonition} Discussion: Why you think we can't do this ?
<img src="img/discuss.png" width="120">

```sql
SELECT name
FROM country
WHERE population > AVG(population);
```

```{toggle}
An aggregation function CANNOT be used in the `WHERE` clause.
```

```{margin}
<img src="img/5.png" width="600">
```

```{seealso}
We will learn how to do this in lecture 5 when we learn about subqueries. We will revisit this question then and do this using subqueries.
```

```{important}
Group by will also create a group for NULLS. Look at the example below:
```

In [9]:
%%sql

SELECT headofstate, count(*) AS count
FROM country
GROUP BY headofstate 
ORDER BY headofstate DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/world
5 rows affected.


headofstate,count
,1
Ólafur Ragnar Grímsson,1
Émile Lahoud,1
tipe Mesic,1
kenraali Than Shwe,1


## HAVING

What if I want to find the average population for each continent but ***only for continents with an average population greater than 100 million***? 

```{admonition} Discussion: How do you think this question differs from the one we started with GROUP BY "Finding the ***average population for each continent***? "?

<img src="img/discuss.png" width="120">
```

```{toggle}
***Only for continents with an average population greater than 100 million.*** statement makes me think I want to do some filtering on the groups now. We use WHERE if we want to filter each row, but we use the HAVING clause if we want to do some filtering on a group that we made using the GROUP BY clause.
```

Refer to the HAVING execution animation below: 

<img src="img/having.png" width="600">

In [10]:
%%sql
select continent, avg(population) as avg_pop
from country
group by continent
having avg(population) > 10000000
order by avg_pop desc;

 * postgresql://postgres:***@localhost:5432/world
5 rows affected.


continent,avg_pop
Asia,72647562.74509805
South America,24698571.42857143
Europe,15871186.95652174
Africa,13525431.034482758
North America,13053864.864864863


SQL query: Write a query to return the average and maximum population of cities in the city table for China, India, Canada, US, Australia, and Russia, ***where the countries that have at least 60 cities listed in the city table.***

Show the results for each country using the corresponding country code, and order groups alphabetically in ascending order.


In [11]:
%%sql
SELECT
    countrycode,
    AVG(population)::int,
    MAX(population)::int,
    COUNT(population) AS city_count
FROM
    city
WHERE
    countrycode IN ('CHN', 'IND', 'CAN', 'USA', 'AUS', 'RUS')
GROUP BY
    countrycode
HAVING
    COUNT(*) > 60
ORDER BY
    city_count DESC
;

 * postgresql://postgres:***@localhost:5432/world
4 rows affected.


countrycode,avg,max,city_count
CHN,484721,9696300,363
IND,361579,10500000,341
USA,286955,8008278,274
RUS,365877,8389200,189


Look at the graphics below to understand the use of count(*) and how it works:

<img src="img/havingcount.png" width="600">

```{admonition} Discussion: When should we use WHERE vs HAVING clause ?
<img src="img/discuss.png" width="120">

```

```{toggle}
- WHERE clause is used to filter rows before grouping.
- HAVING clause is used to filter groups after grouping.
```

## Rules for using GROUP BY and HAVING

```sql
SELECT        [DISTINCT]  `target-list`
FROM         `relation-list`
WHERE        `qualification`
GROUP BY  `grouping-list`
HAVING      `group-qualification`
ORDER BY  `target-list`
```

```{margin}
<img src="img/rules.png" width="600">
```

```{admonition} RULES to keep in mind:
- The `target-list` contains 
    - "attribute names"
    - terms with aggregate operations (e.g., MIN (S.age)).
- Attributes in "attribute names" must also be in `grouping-list`.
    - each answer tuple corresponds to a group, 
    - group = a set of tuples with same value for all attributes in grouping-list
    - selected attributes must have a single value per group.
- Attributes in `group-qualification` are either in `grouping-list`  or are arguments to an aggregate operator.
```

Recap:

    `FROM` and `JOIN`
          |
        `WHERE`
          |
      `GROUP BY`
          |
        `HAVING`
          |
        `SELECT`
          |
      `DISTINCT`
          |
      `ORDER BY`
          |
        `LIMIT`

```{admonition} iclicker: Does the following query works ?

    select continent, code, avg(population) as avg_pop
    from country
    group by continent
    having avg(population) > 100000000
    order by avg_pop desc;

A. Yes

B. No
```

```{toggle}
No, because code is not in the grouping list.
```

## Multilevel grouping

Yes so we can do grouping on multiple columns. Lets look at an example:

In [12]:
%%sql

SELECT continent, region, AVG(population)::INT
FROM country
GROUP BY continent, region
ORDER BY continent, region;

 * postgresql://postgres:***@localhost:5432/world
25 rows affected.


continent,region,avg
Africa,Central Africa,10628000
Africa,Eastern Africa,12349950
Africa,Northern Africa,24752286
Africa,Southern Africa,9377200
Africa,Western Africa,13039529
Antarctica,Antarctica,0
Asia,Eastern Asia,188416000
Asia,Middle East,10465594
Asia,Southeast Asia,47140091
Asia,Southern and Central Asia,106484000


## Joins

```{margin}
<img src="img/join.png" width="600">
```
```
SELECT
    columns
FROM
    left_table
join_type
    right_table
ON
    join_condition
WHERE
    row_filter
GROUP BY
    columns
HAVING
    group_filter
ORDER BY
    columns
;
```

```{admonition} Discussion: Why we got multiple tables instead of having just one ?
<img src="img/discuss.png" width="120">
```

```{toggle}
Please refer to lecture 1, and the anomalies that we discussed.
```

In [13]:
%sql CREATE DATABASE mds;

 * postgresql://postgres:***@localhost:5432/world
(psycopg2.errors.DuplicateDatabase) database "mds" already exists

[SQL: CREATE DATABASE mds;]
(Background on this error at: https://sqlalche.me/e/14/f405)


In [14]:
%sql postgresql://{username}:{password}@{host}:{port}/mds

In [15]:
%%sql

DROP TABLE IF EXISTS
    instructor,
    instructor_course,
    course_cohort
;

CREATE TABLE instructor (
    id INTEGER PRIMARY KEY,
    name TEXT,
    email TEXT,
    phone VARCHAR(12),
    department VARCHAR(50)
    )
;

INSERT INTO
    instructor (id, name, email, phone, department)
VALUES
    (1, 'Mike', 'mike@mds.ubc.ca', '605-332-2343', 'Computer Science'),
    (2, 'Tiffany', 'tiff@mds.ubc.ca', '445-794-2233', 'Neuroscience'),
    (3, 'Arman', 'arman@mds.ubc.ca', '935-738-5796', 'Physics'),
    (4, 'Varada', 'varada@mds.ubc.ca', '243-924-4446', 'Computer Science'),
    (5, 'Quan', 'quan@mds.ubc.ca', '644-818-0254', 'Economics'),
    (6, 'Joel', 'joel@mds.ubc.ca', '773-432-7669', 'Biomedical Engineering'),
    (7, 'Florencia', 'flor@mds.ubc.ca', '773-926-2837', 'Biology'),
    (8, 'Alexi', 'alexiu@mds.ubc.ca', '421-888-4550', 'Statistics'),
    (15, 'Vincenzo', 'vincenzo@mds.ubc.ca', '776-543-1212', 'Statistics'),
    (19, 'Gittu', 'gittu@mds.ubc.ca', '776-334-1132', 'Biomedical Engineering'),
    (16, 'Jessica', 'jessica@mds.ubc.ca', '211-990-1762', 'Computer Science')
;

    
CREATE TABLE instructor_course (
    id SERIAL PRIMARY KEY,
    instructor_id INTEGER,
    course TEXT,
    enrollment INTEGER,
    begins DATE
    )
;

INSERT INTO
    instructor_course (instructor_id, course, enrollment, begins)
VALUES
    (8, 'Statistical Inference and Computation I', 125, '2021-10-01'),
    (8, 'Regression II', 102, '2022-02-05'),
    (1, 'Descriptive Statistics and Probability', 79, '2021-09-10'),
    (1, 'Algorithms and Data Structures', 25, '2021-10-01'),
    (3, 'Algorithms and Data Structures', 25, '2021-10-01'),
    (3, 'Python Programming', 133, '2021-09-07'),
    (3, 'Databases & Data Retrieval', 118, '2021-11-16'),
    (6, 'Visualization I', 155, '2021-10-01'),
    (6, 'Privacy, Ethics & Security', 148, '2022-03-01'),
    (2, 'Programming for Data Manipulation', 160, '2021-09-08'),
    (7, 'Data Science Workflows', 98, '2021-09-15'),
    (2, 'Data Science Workflows', 98, '2021-09-15'),
    (12, 'Web & Cloud Computing', 78, '2022-02-10'),
    (10, 'Introduction to Optimization', NULL, '2022-09-01'),
    (9, 'Parallel Computing', NULL, '2023-01-10'),
    (13, 'Natural Language Processing', NULL, '2023-09-10')
;

CREATE TABLE course_cohort (
    id INTEGER,
    cohort VARCHAR(7)
    )
;

INSERT INTO
    course_cohort (id, cohort)
VALUES
    (13, 'MDS-CL'),
    (8, 'MDS-CL'),
    (1, 'MDS-CL'),
    (3, 'MDS-CL'),
    (1, 'MDS-V'),
    (9, 'MDS-V'),
    (3, 'MDS-V')
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
Done.
Done.
11 rows affected.
Done.
16 rows affected.
Done.
7 rows affected.


[]

We have created dummy tables in the lecture notes to explain the joins. Let's look at the example below:

Here, I am going to take some rows out of the dummy rows (sounds crazy) that we created in lecture notes so that I can take you through the process. (You have already seen this in 523, I am just refreshing your memory)

## Cross join 

```sql
SELECT *
FROM
    instructor
CROSS JOIN
    instructor_course
;
```

Here is the animation of how it looks like:

<img src="img/crossjoin.png" width="600">

```{admonition} Discussion: Following query gives an error, can you think of reason ?
    SELECT
        name, id, course
    FROM
        instructor
    CROSS JOIN
        instructor_course
    LIMIT 10 ;

<img src="img/selectambig.png" width="500">
```

```{toggle}
<img src="img/ambiguos1.png" width="600">
```

Well, so the following query works, as we dealt with the ambiguous column name issue.

In [16]:
%%sql
SELECT
    name, i.id, course
FROM
    instructor AS i
CROSS JOIN
    instructor_course AS ic
LIMIT 10 ;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
10 rows affected.


name,id,course
Mike,1,Statistical Inference and Computation I
Tiffany,2,Statistical Inference and Computation I
Arman,3,Statistical Inference and Computation I
Varada,4,Statistical Inference and Computation I
Quan,5,Statistical Inference and Computation I
Joel,6,Statistical Inference and Computation I
Florencia,7,Statistical Inference and Computation I
Alexi,8,Statistical Inference and Computation I
Vincenzo,15,Statistical Inference and Computation I
Gittu,19,Statistical Inference and Computation I


```{important}
Once you create an alias for a table, you should only use the alias to refer to that table in the statement. For example, the following query would throw an error.
```

In [17]:
%%sql
SELECT
    name, instructor.id, course
FROM
    instructor AS i
CROSS JOIN
    instructor_course AS ic
LIMIT 10 ;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
(psycopg2.errors.UndefinedTable) invalid reference to FROM-clause entry for table "instructor"
LINE 2:     name, instructor.id, course
                  ^
HINT:  Perhaps you meant to reference the table alias "i".

[SQL: SELECT
    name, instructor.id, course
FROM
    instructor AS i
CROSS JOIN
    instructor_course AS ic
LIMIT 10 ;]
(Background on this error at: https://sqlalche.me/e/14/f405)


Let's get to other joins now.

- Inner join (mostly used)
- Left join
- Right join
- Full outer join

## Inner join

In [18]:
%%sql

SELECT
    name, i.id, ic.instructor_id, course
FROM
    instructor AS i
INNER JOIN -- INNER is optional
    instructor_course AS ic
ON
    i.id = ic.instructor_id
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
12 rows affected.


name,id,instructor_id,course
Alexi,8,8,Statistical Inference and Computation I
Alexi,8,8,Regression II
Mike,1,1,Descriptive Statistics and Probability
Mike,1,1,Algorithms and Data Structures
Arman,3,3,Algorithms and Data Structures
Arman,3,3,Python Programming
Arman,3,3,Databases & Data Retrieval
Joel,6,6,Visualization I
Joel,6,6,"Privacy, Ethics & Security"
Tiffany,2,2,Programming for Data Manipulation


Here is the animation of how it looks like:

<img src="img/innerjoin.png" width="600">

## Left join

In [19]:
%%sql

SELECT
    * -- selecting * to show you, but the question is asking just to return the name, so you should do so
FROM
    instructor AS i
LEFT JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
17 rows affected.


id,name,email,phone,department,id_1,instructor_id,course,enrollment,begins
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics,1.0,8.0,Statistical Inference and Computation I,125.0,2021-10-01
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics,2.0,8.0,Regression II,102.0,2022-02-05
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,3.0,1.0,Descriptive Statistics and Probability,79.0,2021-09-10
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,4.0,1.0,Algorithms and Data Structures,25.0,2021-10-01
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics,5.0,3.0,Algorithms and Data Structures,25.0,2021-10-01
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics,6.0,3.0,Python Programming,133.0,2021-09-07
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics,7.0,3.0,Databases & Data Retrieval,118.0,2021-11-16
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering,8.0,6.0,Visualization I,155.0,2021-10-01
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering,9.0,6.0,"Privacy, Ethics & Security",148.0,2022-03-01
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience,10.0,2.0,Programming for Data Manipulation,160.0,2021-09-08


How can this be helpful? As an example, we can return the names of instructors who don’t teach any courses with the following query:

In [20]:
%%sql

SELECT
    name 
FROM
    instructor AS i
LEFT JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
WHERE
    ic.course IS NULL
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
5 rows affected.


name
Vincenzo
Quan
Gittu
Jessica
Varada


## Right join

## Full outer join

In [21]:
%%sql

SELECT
    *
FROM
    instructor AS i
FULL OUTER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
21 rows affected.


id,name,email,phone,department,id_1,instructor_id,course,enrollment,begins
8.0,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics,1.0,8.0,Statistical Inference and Computation I,125.0,2021-10-01
8.0,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics,2.0,8.0,Regression II,102.0,2022-02-05
1.0,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,3.0,1.0,Descriptive Statistics and Probability,79.0,2021-09-10
1.0,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,4.0,1.0,Algorithms and Data Structures,25.0,2021-10-01
3.0,Arman,arman@mds.ubc.ca,935-738-5796,Physics,5.0,3.0,Algorithms and Data Structures,25.0,2021-10-01
3.0,Arman,arman@mds.ubc.ca,935-738-5796,Physics,6.0,3.0,Python Programming,133.0,2021-09-07
3.0,Arman,arman@mds.ubc.ca,935-738-5796,Physics,7.0,3.0,Databases & Data Retrieval,118.0,2021-11-16
6.0,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering,8.0,6.0,Visualization I,155.0,2021-10-01
6.0,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering,9.0,6.0,"Privacy, Ethics & Security",148.0,2022-03-01
2.0,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience,10.0,2.0,Programming for Data Manipulation,160.0,2021-09-08


How can this be helpful? We can now write a query to find instructors who are free to teach a course, and courses that need an instructor:

In [22]:
%%sql

SELECT
    *
FROM
    instructor AS i
FULL OUTER JOIN
    instructor_course AS ic
ON
    i.id = ic.instructor_id
WHERE
    i.name IS NULL
    OR
    ic.course IS NULL
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
9 rows affected.


id,name,email,phone,department,id_1,instructor_id,course,enrollment,begins
,,,,,13.0,12.0,Web & Cloud Computing,78.0,2022-02-10
,,,,,14.0,10.0,Introduction to Optimization,,2022-09-01
,,,,,15.0,9.0,Parallel Computing,,2023-01-10
,,,,,16.0,13.0,Natural Language Processing,,2023-09-10
15.0,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics,,,,,
5.0,Quan,quan@mds.ubc.ca,644-818-0254,Economics,,,,,
19.0,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering,,,,,
16.0,Jessica,jessica@mds.ubc.ca,211-990-1762,Computer Science,,,,,
4.0,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science,,,,,


```{admonition} Homework discussion question: What’s the difference between a cross join and a full outer join?
<img src="img/discuss.png" width="120">
```

## Natural join

```{admonition} Discussion question: How natural join is going to be for following 2 tables?
<img src="img/2tables.png" width="600">
```

```{toggle}
NATURAL JOIN is going to be on the basis of common column names. So in this case it will be on the basis of `id` column.

    SELECT *
    FROM
        instructor_course ic
    NATURAL JOIN
        instructor i
    ;

same as:
    
    SELECT *
    FROM
        instructor_course ic
    INNER JOIN
        instructor i
    ON
        ic.id = i.id
    ;

```

### Showcase (with common column)

In [23]:
%%sql
select * from instructor LIMIT 2;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
2 rows affected.


id,name,email,phone,department
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience


In [24]:
%%sql
select * from course_cohort LIMIT 2;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
2 rows affected.


id,cohort
13,MDS-CL
8,MDS-CL


In [25]:
%%sql

SELECT
    *
FROM
    instructor AS i
NATURAL JOIN
    course_cohort AS cc

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
5 rows affected.


id,name,email,phone,department,cohort
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics,MDS-CL
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,MDS-CL
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics,MDS-CL
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,MDS-V
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics,MDS-V


### Dummy tables to test out NATURAL JOIN (no common column)

Going to recreate course_cohort table to have no common column names with the instructor_course table. It is going to return the cartesian product (or the cross join), nothing meaningful.

In [26]:
%%sql
drop table if EXISTS course_cohort;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
Done.


[]

In [27]:
%%sql
CREATE TABLE course_cohort (
    cohort_id INTEGER,
    cohort VARCHAR(7)
    )
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
Done.


[]

In [28]:
%%sql
INSERT INTO
    course_cohort (cohort_id, cohort)
VALUES
    (13, 'MDS-CL'),
    (8, 'MDS-CL'),
    (1, 'MDS-CL'),
    (3, 'MDS-CL'),
    (1, 'MDS-V'),
    (9, 'MDS-V'),
    (3, 'MDS-V')
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
7 rows affected.


[]

In [29]:
%%sql
select * from course_cohort LIMIT 2;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
2 rows affected.


cohort_id,cohort
13,MDS-CL
8,MDS-CL


In [30]:
%%sql

SELECT
    *
FROM
    instructor AS i
    NATURAL JOIN
    course_cohort AS cc
;

 * postgresql://postgres:***@localhost:5432/mds
   postgresql://postgres:***@localhost:5432/world
77 rows affected.


id,name,email,phone,department,cohort_id,cohort
1,Mike,mike@mds.ubc.ca,605-332-2343,Computer Science,13,MDS-CL
2,Tiffany,tiff@mds.ubc.ca,445-794-2233,Neuroscience,13,MDS-CL
3,Arman,arman@mds.ubc.ca,935-738-5796,Physics,13,MDS-CL
4,Varada,varada@mds.ubc.ca,243-924-4446,Computer Science,13,MDS-CL
5,Quan,quan@mds.ubc.ca,644-818-0254,Economics,13,MDS-CL
6,Joel,joel@mds.ubc.ca,773-432-7669,Biomedical Engineering,13,MDS-CL
7,Florencia,flor@mds.ubc.ca,773-926-2837,Biology,13,MDS-CL
8,Alexi,alexiu@mds.ubc.ca,421-888-4550,Statistics,13,MDS-CL
15,Vincenzo,vincenzo@mds.ubc.ca,776-543-1212,Statistics,13,MDS-CL
19,Gittu,gittu@mds.ubc.ca,776-334-1132,Biomedical Engineering,13,MDS-CL


```{note}
With no common column names, NATURAL JOIN resulted in a cartisean product (cross join), but nothing meaningful here. Big caution not to use it (or extremely be careful). 
```

## How to reason through a logic question
- Ask what tables you want to use to get the answer to the question and joins (also kind of join) you want to perform.
- Decide on the columns you want to select.
- Decide on the grouping you want to perform.
- Decide on the filtering you want to perform. (WHERE vs. HAVING) (WHERE is for filtering rows, and HAVING is for filtering groups.)
- Order the results if needed.
- limit the results if needed.

### How to test your logic
Populate the relation with some dummy tuples, and work through the problem to see if your logic works. 
Finally, the answer. :-) 

## Wrap up

<img src="img/wrapup.gif" width="300">

## Moral of story
- do you understand the use of the GROUP BY clause?
- do you understand the use of the HAVING clause?
- do you understand the use of the WHERE vs. HAVING clause?
- what are the rules to keep in mind when using GROUP BY, HAVING, and SELECT clauses?
- do you understand the use of different joins?
- why do you need to be careful with natural join?
- why do we need to use an alias?
- do you know how to use an alias?
- do you know how to perform a join?
- do you know how to reason through the logic ?