# Creating FOREIGN KEYS


In the Part 8 notebooks, we looked at how to define tables which represent entities, and some of the constraints that apply to individual tables, in particular the primary key. In the Part 9 notebooks, we will look more closely at foreign key relationships, which implement the relationships that hold *between* entities.

In this notebook, we will look at:

- how to define foreign key constraint as part of a `CREATE TABLE` statement, and
- how to add foreign key constraints to existing tables.

You should spend around one hour on this notebook.

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

In [None]:
%run reset_databases.ipynb

## Foreign Key References as Part of a CREATE TABLE Statement



At this point it is worth observing how foreign keys are implemented *as constraints* (in PostgreSQL at least) when they are defined as part of a column definition in a `CREATE TABLE` statement.

In [Activity 9.3](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%209%20Relational%20data%20modelling&targetptr=3), we saw that the relationship between the doctor and patient was represented in the Entity Relationship Diagram as:

![Relationship between patient and doctor](patient-doctor-fk.jpg)

Each patient is assigned to exactly one doctor, and each doctor may be responsible for zero, one or many patients. We then saw in [Part 9, section 5](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%209%20Relational%20data%20modelling&targetptr=5), that the `doctor_id` column in the `patient` table can be used to reference the `doctor` table as a foreign key, thereby implementing the patient-doctor relationship. We can extend the definition of `CREATE TABLE` to implement such a relationship by defining appropriate foreign keys:


<code>CREATE TABLE &#x2329;table_name&#x232A;(   
     &#x2329;column_name&#x232A; &#x2329;data_type&#x232A;,   
     &#x2329;column_name&#x232A; &#x2329;data_type&#x232A;,
     ... 
     PRIMARY KEY (&#x2329;column_name&#x232A;, &#x2329;column_name&#x232A;, ...),
     ... 
     FOREIGN KEY (&#x2329;column_name&#x232A;, &#x2329;column_name&#x232A;, ...) REFERENCES &#x2329;table_name&#x232A;);</code>


As in notebook *08.1 Data Definition Language in SQL* we can define the `doctor` table using SQL's `CREATE TABLE` statement:

In [None]:
%%sql

DROP TABLE IF EXISTS doctor;

CREATE TABLE doctor (
    
    doctor_id CHAR(4),
    doctor_name VARCHAR(20),
    
    PRIMARY KEY (doctor_id)
 );

We will also define the `patient` table, but unlike in notebook *08.1 Data Definition Language in SQL*, we will now add a declaration that the column `doctor_id` in `patient` is a foreign key referencing the `doctor` table:

In [None]:
%%sql

DROP TABLE IF EXISTS patient;

CREATE TABLE patient (
    
    patient_id CHAR(4),
    patient_name VARCHAR(20),
    date_of_birth DATE,
    gender CHAR(6),
    height_cm DECIMAL(4,1),
    weight_kg DECIMAL(4,1),
    doctor_id CHAR(4),
    
    PRIMARY KEY (patient_id),
    
    FOREIGN KEY (doctor_id) REFERENCES doctor
 );

We can see the foreign key with the display schema:

In [None]:
%schema --connection_string $DB_CONNECTION -t patient,doctor

Note that the `patient.doctor_id` column now has a marker `(FK)` to show that it is a foreign key. The text `+ doctor_id` on the connector shows that it forms the foreign key referencing `doctor`.


You may recall from the definition of a foreign in key in [Part 9, section 4](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%209%20Relational%20data%20modelling&targetptr=4) that the table containing primary key is commonly referred to as the **parent table**, and the table containing the foreign key which refers to the parent table is commonly referred to as the corresponding **child table**.

In this case, we see that the `doctor`, which contains the primary key `doctor_id` table represents the __parent__ table. The `patient` table, which contains `doctor_id` as a foreign key, is the __child__ table.

We can now add some data to the tables. The next cell adds two rows to the `doctor` table, to represent two doctors. (See notebook *08.2 Data Manipulation Language in SQL* if you need to remind yourself how SQL's `INSERT INTO` statements work.)

In [None]:
%%sql

INSERT INTO doctor(doctor_id, doctor_name)
VALUES ('d06', 'Gibson'),
       ('d07', 'Paxton');
    
SELECT *
FROM doctor;

We can also add some rows to the `patient` table. For each of these rows, the value in the `patient.doctor_id` column matches one of the values in the `doctor.doctor_id` column:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES('p001', 'Thornton', '1980/01/22', 'F', 162.3, 71.6, 'd06'),
      ('p007', 'Tennent', '1980/04/01', 'M', 176.8, 70.9, 'd07'),
      ('p008', 'James', '1980/07/08', 'M', 167.9, 70.5, 'd07'),
      ('p009', 'Kay', '1980/09/25', 'F', 164.7, 53.2, 'd06');
    
SELECT *
FROM patient;

(Note that we are using the SQL dotted notation here, so that `patient.doctor_id` refers to the `doctor_id` column in the `patient` table, and `doctor.doctor_id` refers to the `doctor_id` column in the `doctor` table.)

### Activity 1

What do you think will happen if you try execute the following statement (with the tables as they are populated at this point in the notebook, and with the foreign key constraint defined in the `patient` table)? Why?

```sql
INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p068', 'Monroe', '1981/02/21', 'F', 165, 62.6, 'd10');
```

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

The following cell executes the given statement:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p068', 'Monroe', '1981/02/21', 'F', 165, 62.6, 'd10');

If you execute the statement, you should receive an `IntegrityError`, with the additional information that `insert or update on table "patient" violates foreign key constraint`. That is, in this case, the value provided in the `patient.doctor_id` column, `d10`, does not correspond to any of the values in the `doctor.doctor_id` column. This violates the foreign key constraint (as described in [Part 9, section 4](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%209%20Relational%20data%20modelling&targetptr=4.1), and so this row cannot be inserted.

To add this row to the table, it is necessary for the `doctor` table to contain the value `d10` in the `doctor.doctor_id` column. If we were to add the row:

| doctor_id | doctor_name | 
| ------ | ------ | 
| d10 | Rampton | 

to the `doctor` table, we should then be able to add the necessary row to `patient`.

First, add the appropriate row to the `doctor` table:

In [None]:
%%sql

INSERT INTO doctor(doctor_id, doctor_name)
VALUES ('d10', 'Rampton');

In [None]:
%%sql

SELECT *
FROM doctor;

The `doctor` table should now contain a row with the value `d10` in the `doctor.doctor_id` column. This then permits the referencing row to be added to `patient` without raising a constraint violation error:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p068', 'Monroe', '1981/02/21', 'F', 165, 62.6, 'd10');

In [None]:
%%sql

SELECT *
FROM patient;

As with the primary key constraints, changes to the database are not allowed if they would result in a foreign key constraint being violated. In the case of foreign keys, each value in the foreign key column(s) of the referencing table must match a value in the primary key column(s) of the referenced table.

#### End of Activity 1

-----------------------------------------------

### Activity 2

What do you think will happen if you try execute the following statement to the table as it is populated at this point in the notebook, in which no value for the foreign key is provided? Why?

```sql
INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p039', 'Maher', '1981/10/09', 'F', 161.9, 73.0, NULL);
```

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

The following cell executes the given statement:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p039', 'Maher', '1981/10/09', 'F', 161.9, 73.0, NULL);

In [None]:
%%sql

SELECT *
FROM patient;

The `INSERT` statement should have succeeded in adding a row to `patient`, with no value in the `patient.doctor_id` column. In fact, foreign keys do not *have* to contain a value, but *if* they do, then that value must exist in the referenced column.

#### End of Activity 2

------------------------------------------------------------

The previous activity demonstrates one way of representing *optionality*. Where the foreign key is not constrained to be `NOT NULL`, we are saying that the referencing table does not have to have a value in the foreign key column, but if it does, then it *must* correspond to a value in the referenced column.

## Adding Foreign Key constraints to existing tables

In this section, we will use CSV files to create the database tables, and then define constraints on those. We will use the same techniques to generate SQL tables from dataframes as we saw in notebook *08.3 Adding column constraints to tables*. Refer back to that notebook if you need a reminder of how we used CSV files and the dataframe's `.to_sql` method to generate SQL tables.

We'll start by simply creating some tables from dataframes. As we have seen, this means that initially, no primary keys are defined on any of these tables.

We want to be able to implement the relationship between the `doctor` table and the `patient` table. Specifically, we want to declare that `doctor_id` in the `patient` table is a foreign key that references `doctor_id` in the `doctor` table.

If we just add a *foreign* key constraint to the `patient` table, what changes are made to that table and what changes, if any, are made to the `doctor` table (referred to via the foreign key relationship) as a result?

To investigate this, we need to drop the existing tables to clear the constraints on them:

In [None]:
%%sql

DROP TABLE IF EXISTS patient;

DROP TABLE IF EXISTS doctor;

Now we can use the `pandas.read_csv` function to read the `doctor.csv` file into a dataframe called `doctor_df`, and then use the `.to_sql` method to export the data into an SQL table:

In [None]:
# Import the doctor.csv file into a dataframe
doctor_df=pd.read_csv('./sql_data/doctor.csv')

doctor_df.to_sql('doctor',
                 DB_CONNECTION,
                 if_exists='replace',
                 index=False
                 )

We can use a `SELECT` query to see that the populated `doctor` table exists:

In [None]:
%%sql

SELECT *
FROM doctor
LIMIT 5;

We can now do exactly the same to create the populated `patient` table:

In [None]:
# Import the patient.csv file into a dataframe
patient_df=pd.read_csv('./sql_data/patient.csv')

#Look at the first few rows of the resulting dataframe

patient_df.to_sql('patient',
                 DB_CONNECTION,
                 if_exists='replace',
                 index=False
                 )

In [None]:
%%sql

SELECT *
FROM patient
LIMIT 5;

We have seen how to add constraints to tables in notebook *08.3 Adding column constraints to tables*. As a reminder, to add a constraint to an existing table in SQL, use the following statement:

<code>ALTER TABLE &#x2329;table name&#x232A;
ADD CONSTRAINT &#x2329;constraint name&#x232A;
    &#x2329;constraint definition&#x232A;;
</code>


Having exported the dataframe data to an SQL table, we should be able to add the foreign key constraint with the next cell:

In [None]:
%%sql

ALTER TABLE patient
ADD CONSTRAINT patient_doctor_fk
    FOREIGN KEY (doctor_id) REFERENCES doctor;

Perhaps surprisingly, the foreign key declaration raises a `ProgrammingError`, stating that `there is no primary key for referenced table "doctor"`. Where a foreign key constraint definition does not specify a particular column in the referenced table, the foreign key is assumed to refer to the referenced table's primary key. Therefore, to be able to define a foreign key in `patient`, we need to have the primary key defined in `doctor` (and we can define a primary key for `patient` at the same time):

In [None]:
%%sql 

ALTER TABLE doctor
ADD CONSTRAINT doctor_pk
    PRIMARY KEY (doctor_id);

ALTER TABLE patient
ADD CONSTRAINT patient_pk
    PRIMARY KEY (patient_id);

Now that we have defined the primary key for `doctor`, we should be able to add the foreign key constraint to `patient`:

In [None]:
%%sql

ALTER TABLE patient
ADD CONSTRAINT patient_doctor_fk
    FOREIGN KEY (doctor_id) REFERENCES doctor;

This should now run without raising an error. We can see the two tables and the relationship between them using the display magic:

In [None]:
%schema --connection_string $DB_CONNECTION -t patient,doctor

#### Recap

Adding the foreign key to the `patient` table that references the `doctor` table, we notice the following things:

- a constraint is added to the `patient` table that associates the column `patient.doctor_id` with a reference to the `doctor.doctor_id` column,
- the `doctor` table is not altered, other than by the requirement that the referenced column, `doctor_id`, be unique (which it is, by virtue of being the primary key), and
- the `patient.doctor_id` column  is *not* required to contain only unique values, nor is it required to be `NOT NULL`.

(Again, `patient.doctor_id` means the `doctor_id` column in the `patient` table, and similarly, `doctor.doctor_id` means the `doctor_id` column in the `doctor` table.)

### Activity 3

In notebook *08.3 Adding column constraints to tables*, we used four of the CSV files in the directory `sql_data` - `patient.csv`, `doctor.csv`, `drug.csv` and `prescription.csv` - to define the tables `patient`, `doctor`, `drug` and `prescription`. The complete ERD, including the relationships, is shown in the diagram:

![Full ERD for the hospital database](part09_notebooks_ERD.jpg)

Use the two files `drug.csv` and `prescription.csv` to extend the existing database so that it contains the four tables in the diagram, and define the appropriate constraints to define the relationships between all four entities.

Note that we can make multiple foreign key references from one table to several others. In this example, as well as referencing the `patient` table, the `prescription` entity also makes a second foreign key reference to the `drug` table. Specifically, the column `prescription.drug_code` is a foreign key that references the primary key of the `drug` table, `drug.drug_code`.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

To create the `drug` and `prescription` tables, we can carry out the same statements as in notebook *08.3 Adding column constraints to tables*. First create the `drug` table:

In [None]:
# Import the drug.csv file into a dataframe
drug_df=pd.read_csv('./sql_data/drug.csv')

# And use the `.to_sql` method to create a new table:

drug_df.to_sql('drug',
               DB_CONNECTION,
               if_exists='replace',
               index=False
              );

We can look at the first few rows to check that the table has been created:

In [None]:
%%sql

SELECT *
FROM drug
LIMIT 5;

And add the primary key:

In [None]:
%%sql

ALTER TABLE drug
ADD CONSTRAINT drug_pk
    PRIMARY KEY (drug_code);

We can add the `prescription` table in the same way:

In [None]:
# Import the drug.csv file into a dataframe
prescription_df=pd.read_csv('./sql_data/prescription.csv')

# And use the `.to_sql` method to create a new table:

prescription_df.to_sql('prescription',
                       DB_CONNECTION,
                       if_exists='replace',
                       index=False
                      )

We can look at the first few rows to check that the table has been created:

In [None]:
%%sql

SELECT *
FROM prescription
LIMIT 5;

And add the primary key:

In [None]:
%%sql

ALTER TABLE prescription
ADD CONSTRAINT prescription_pk
    PRIMARY KEY (patient_id, doctor_id, drug_code, date);

We can now add the foreign keys. From the diagram (and from the discussion in [Part 9, section 5](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%209%20Relational%20data%20modelling&targetptr=5)), we know that the foreign keys that we require are:

- `patient.doctor_id` references `doctor`
- `prescription.patient_id` references `patient`
- `prescription.doctor_id` references `doctor`
- `prescription.drug_code` references `drug`


We have already defined the column `patient.doctor_id` to reference `doctor`, so we just need to add the remaining foreign keys:

In [None]:
%%sql

ALTER TABLE prescription
ADD CONSTRAINT prescription_patient_fk
    FOREIGN KEY (patient_id) REFERENCES patient;

In [None]:
%%sql

ALTER TABLE prescription
ADD CONSTRAINT prescription_doctor_fk
    FOREIGN KEY (doctor_id) REFERENCES doctor;

In [None]:
%%sql

ALTER TABLE prescription
ADD CONSTRAINT prescription_drug_fk
    FOREIGN KEY (drug_code) REFERENCES drug;

Finally, use the display magic to see the ERD:

In [None]:
%schema --connection_string $DB_CONNECTION

You may notice that in the diagram above, the arrow between the `patient` and `doctor` table is of a different kind to the arrows from `prescription` to `prescription`'s parent tables. Remember that the three foreign keys in `prescription` (`prescription.patient_id`, `prescription.doctor_id` and `prescription.drug_code`) are all part of the primary key, and therefore may not be `NULL`. The form of the arrow represents that the foreign keys as defined in the `prescription` table are required to be `NON NULL` whereas the `doctor_id` foreign key in the `patient` table *may* be `NULL`.

#### End of Activity 3

-----------------------------------------------

## Foreign keys referencing non-primary keys

So far, we have only used foreign keys to refer to the primary key in the parent table. Although it is usually the case that a foreign key column will reference the primary key of the parent table, this does not have to be the case. For example, suppose that, as well as their codes being unique, all drugs required unique names as well. We might decide to define a table which referenced `drug`, but using the drug's name rather than the code. 

Let's define a table `drug_order`, which contains three columns representing the order number of a particular drug, the quantity in the order, and the name of the drug in the order:

In [None]:
%%sql

DROP TABLE IF EXISTS drug_order;

CREATE TABLE drug_order (
    
    order_code CHAR(4),
    quantity INT,
    drug_name TEXT,
    
    PRIMARY KEY (order_code)
);

 Although we could use the primary key as the reference from `drug_order`, we can also use the drug's name (after all, if the drug name is unique, it should not matter whether the drug is referred to by its name or its code).
 
To define the foreign key in this way, the column name must be given in the `FOREIGN KEY...REFERENCES` declaration, as well as the table that it appears in:

In [None]:
%%sql

ALTER TABLE drug_order
ADD CONSTRAINT drug_order_drug_fk
    FOREIGN KEY (drug_name) REFERENCES drug(drug_name);

When the previous cell is executed, a `ProgrammingError` is raised, with the the explanation that `there is no unique constraint matching given keys for referenced table "drug"`. This error is raised here because, although all the values in the `drug_name` column of the `drug` table are currently unique, there is nothing to guarantee that this will continue to be the case in the future. Therefore, if we want to use this column as the referenced column, then it must be *constrained* to be unique. This guarantees that the column will continue to be an acceptable column to reference in future.

We can add this constraint as we saw in notebook *08.2 Data Manipulation Language in SQL*:

In [None]:
%%sql

ALTER TABLE drug
ADD CONSTRAINT drug_name_unique
    UNIQUE (drug_name);

and now we can add the foreign key constraint:

In [None]:
%%sql

ALTER TABLE drug_order
ADD CONSTRAINT drug_order_drug_fk
    FOREIGN KEY (drug_name) REFERENCES drug(drug_name);

## What next?

You have now seen how to define foreign keys on tables at the point of creating the table, and how to add constraints to unconstrained tables.

In the next notebook, *09.2 Using foreign keys in SQL*, we will look in a bit more detail at the behaviour of foreign keys, and how careful use of foreign keys can help maintain the integrity of a database.