# SQL Tutorial 2: Joins

-   Let's use a database of experimental plate assays

In [1]:
%load_ext sql
%config SqlMagic.autolimit = 0
%config SqlMagic.displaylimit = 0
%sql sqlite:///data/assays.db

-   Has six tables, but we'll start with these two
    -   `cid`: column ID (SQL lets use use numeric column IDs in some places, but please don't)
    -   `name`: column name
    -   `type`: data type (examples below are `INTEGER` and `TEXT`)
    -   `notnull`: are `null` values forbidden?
    -   `dflt_value`: default value for this column (we'll worry about this later)
    -   `pk`: is this the primary key for the table? (a major topic of this lesson)

**department**

| cid |   name   | type | notnull | dflt_value | pk |
|-----|----------|------|---------|------------|----|
| 0   | ident    | TEXT | 1       |            | 0  |
| 1   | name     | TEXT | 1       |            | 0  |
| 2   | building | TEXT | 1       |            | 0  |

**staff**

| cid |   name   |  type   | notnull | dflt_value | pk |
|-----|----------|---------|---------|------------|----|
| 0   | ident    | INTEGER | 0       |            | 1  |
| 1   | personal | TEXT    | 1       |            | 0  |
| 2   | family   | TEXT    | 1       |            | 0  |
| 3   | dept     | TEXT    | 0       |            | 0  |

-   What do these tables contain?

In [2]:
%%sql
select * from department;

ident,name,building
gen,Genetics,Chesson
hist,Histology,Fashet Extension
mb,Molecular Biology,Chesson


In [3]:
%%sql
select * from staff;

ident,personal,family,dept
1,Jayesh,Ramanathan,
2,Suhana,Chatterjee,mb
3,Ryan,Jain,hist
4,Riaan,Chana,gen
5,Jivika,Shroff,mb
6,Madhup,Barman,mb
7,Yashvi,Chacko,mb
8,Saira,Sastry,mb
9,Badal,Dyal,
10,Ela,Jani,


-   What building does each person work in?

In [4]:
%%sql
select
    staff.personal,
    staff.family,
    department.name
from staff inner join department
on staff.dept = department.ident;

personal,family,name
Suhana,Chatterjee,Molecular Biology
Ryan,Jain,Histology
Riaan,Chana,Genetics
Jivika,Shroff,Molecular Biology
Madhup,Barman,Molecular Biology
Yashvi,Chacko,Molecular Biology
Saira,Sastry,Molecular Biology


-   A *join* combines all rows from one table with all rows from another
    -   *inner join* only keeps the rows where the `on` condition is satisfied
-   Use `table.column` syntax to avoid ambiguity
    -   Both `staff` and `department` have an `ident` column
-   Staff with no department are not included in the results because `null` doesn't match a department ID
-   Use a *left outer join* to keep all rows from the left table even if there aren't matches
    -   Fill missing values with `null`

In [5]:
%%sql
select
    staff.personal,
    staff.family,
    department.name
from staff left outer join department
on staff.dept = department.ident;

personal,family,name
Jayesh,Ramanathan,
Suhana,Chatterjee,Molecular Biology
Ryan,Jain,Histology
Riaan,Chana,Genetics
Jivika,Shroff,Molecular Biology
Madhup,Barman,Molecular Biology
Yashvi,Chacko,Molecular Biology
Saira,Sastry,Molecular Biology
Badal,Dyal,
Ela,Jani,


>   A *right outer join* keeps all rows from the right table, even if there aren't matches.
>   SQLite doesn't bother to implement it, since you can achieve the same effect by swapping table order.
>   It *does* implement *full outer join*, which keeps unmatched rows from both sides (again, filling with nulls as needed).

-   Let's look at two more tables: `experiment` records the experiments that have been done, and `performed` records who has done which

**experiment**

| cid |  name   |  type   | notnull | dflt_value | pk |
|-----|---------|---------|---------|------------|----|
| 0   | ident   | INTEGER | 0       |            | 1  |
| 1   | kind    | TEXT    | 1       |            | 0  |
| 2   | started | TEXT    | 1       |            | 0  |
| 3   | ended   | TEXT    | 0       |            | 0  |

**performed**

| cid |    name    |  type   | notnull | dflt_value | pk |
|-----|------------|---------|---------|------------|----|
| 0   | staff      | INTEGER | 1       |            | 0  |
| 1   | experiment | INTEGER | 1       |            | 0  |

-   `performed` is sometimes called a *join table* because it's only purpose is to connect two other tables.
-   Why is it needed?
-   The relationship between `department` and `staff` is *one-to-many*
    -   Represent this by storing a *foreign key* in `staff` that refers to a *primary key* in `department`
-   The relationship between `staff` and `experiment` is *many-to-many*
    -   One person might do many experiments
    -   Each experiment might be done by many people
    -   So store each (person, experiment) pair in `performed`
-   There are a lot of experiments, so let's be selective

In [6]:
%%sql
select *
from staff inner join performed inner join experiment
on staff.ident = performed.staff and performed.experiment = experiment.ident
where staff.ident in (1, 2);

ident,personal,family,dept,staff,experiment,ident_1,kind,started,ended
1,Jayesh,Ramanathan,,1,4,4,trial,2023-11-03,2023-11-04
1,Jayesh,Ramanathan,,1,6,6,trial,2023-11-07,2023-11-08
1,Jayesh,Ramanathan,,1,10,10,calibration,2023-11-01,2023-11-01
1,Jayesh,Ramanathan,,1,11,11,calibration,2023-11-09,2023-11-09
1,Jayesh,Ramanathan,,1,12,12,trial,2023-11-04,2023-11-06
1,Jayesh,Ramanathan,,1,17,17,calibration,2023-11-05,2023-11-05
1,Jayesh,Ramanathan,,1,27,27,trial,2023-11-09,
2,Suhana,Chatterjee,mb,2,33,33,trial,2023-11-03,2023-11-04
1,Jayesh,Ramanathan,,1,33,33,trial,2023-11-03,2023-11-04
1,Jayesh,Ramanathan,,1,34,34,calibration,2023-11-07,2023-11-07


-   First step: database combines `staff` with `performed` by matching primary key in the former to foreign key in the latter
    -   We'll only show the first few rows of the result

In [7]:
%%sql
select *
from staff inner join performed
on staff.ident = performed.staff
limit 5;

ident,personal,family,dept,staff,experiment
5,Jivika,Shroff,mb,5,1
8,Saira,Sastry,mb,8,2
3,Ryan,Jain,hist,3,3
7,Yashvi,Chacko,mb,7,3
1,Jayesh,Ramanathan,,1,4


-   Second step: database combines this temporary table with `experiment` by matching keys
    -   Again, only show a few rows

In [8]:
%%sql
select *
from staff inner join performed inner join experiment
on staff.ident = performed.staff and performed.experiment = experiment.ident
limit 5;

ident,personal,family,dept,staff,experiment,ident_1,kind,started,ended
5,Jivika,Shroff,mb,5,1,1,trial,2023-11-09,
8,Saira,Sastry,mb,8,2,2,trial,2023-11-04,2023-11-05
3,Ryan,Jain,hist,3,3,3,trial,2023-11-06,2023-11-07
7,Yashvi,Chacko,mb,7,3,3,trial,2023-11-06,2023-11-07
1,Jayesh,Ramanathan,,1,4,4,trial,2023-11-03,2023-11-04


-   Third step: database filters the result using `in` and a list of specific staff IDs
    -   `staff.ident in (1, 2)` is the same as `(staff.ident = 1) or (staff.ident = 2)`
    -   Again, only show a few rows (we saw the full output earlier)

In [9]:
%%sql
select *
from staff inner join performed inner join experiment
on staff.ident = performed.staff and performed.experiment = experiment.ident
where staff.ident in (1, 2)
limit 5;

ident,personal,family,dept,staff,experiment,ident_1,kind,started,ended
1,Jayesh,Ramanathan,,1,4,4,trial,2023-11-03,2023-11-04
1,Jayesh,Ramanathan,,1,6,6,trial,2023-11-07,2023-11-08
1,Jayesh,Ramanathan,,1,10,10,calibration,2023-11-01,2023-11-01
1,Jayesh,Ramanathan,,1,11,11,calibration,2023-11-09,2023-11-09
1,Jayesh,Ramanathan,,1,12,12,trial,2023-11-04,2023-11-06


-   Notice that the `ident` column from `experiment` has been named `ident_1` in the output
-   Better practice to slim down the columns (e.g., remove the columns from the join table) and rename any duplicates
-   Just for fun, we will add `offset 5` to look at rows *after* row 5 and order by start date
    -   Note: sort *then* slice

In [15]:
%%sql
select
    staff.personal,
    staff.family,
    experiment.ident as experiment_id,
    experiment.kind,
    experiment.started,
    experiment.ended
from staff inner join performed inner join experiment
on staff.ident = performed.staff and performed.experiment = experiment.ident
where staff.ident in (1, 2)
order by experiment.started asc
limit 5 offset 5;

personal,family,experiment_id,kind,started,ended
Suhana,Chatterjee,37,trial,2023-11-04,2023-11-05
Jayesh,Ramanathan,42,trial,2023-11-04,2023-11-05
Jayesh,Ramanathan,17,calibration,2023-11-05,2023-11-05
Jayesh,Ramanathan,45,trial,2023-11-05,2023-11-06
Jayesh,Ramanathan,49,trial,2023-11-06,2023-11-08


-   How many experiments of each kind has each person been involved in?

In [21]:
%%sql
select
    staff.personal,
    staff.family,
    experiment.kind,
    count(*) as num_experiment_kind
from staff inner join performed inner join experiment
on staff.ident = performed.staff and performed.experiment = experiment.ident
group by staff.ident, experiment.kind
order by staff.family, staff.personal, experiment.kind;

personal,family,kind,num_experiment_kind
Madhup,Barman,calibration,2
Madhup,Barman,trial,1
Yashvi,Chacko,calibration,1
Yashvi,Chacko,trial,7
Riaan,Chana,trial,7
Suhana,Chatterjee,trial,2
Badal,Dyal,calibration,2
Badal,Dyal,trial,3
Ryan,Jain,calibration,3
Ryan,Jain,trial,3


-   Riaan Chana and Suhana Chatterjee weren't involved in any calibration experiments
-   Would like to see entries with 0 for them