# Adding column constraints to tables

As we have seen in the online materials, an important aspect of databases is the ability to define *constraints* on the data. In this notebook, we will consider a number of constraints which can be defined on data to maintain integrity by restricting the values which can be placed in particular columns in a database.

This notebook divides roughly into two parts. In the first part, we will look at how to define constraints on an existing table. In the second part, we will look at how to use pandas to quickly create tables from dataframes.

You should spend around two hours on this notebook.

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *8.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb

print("Connecting with connection string : {}".format(DB_CONNECTION))

%sql $DB_CONNECTION

In [None]:
%run reset_databases.ipynb

## Defining and removing key constraints on existing tables

### A little more on primary key constraints

You have already seen one of the fundamental constraint types of relational databases: the primary key constraint. In the previous notebook, you saw how to define a primary key constraint on a table, and how an integrity error can be raised if the primary key constraint is violated. You have seen the primary key constraints in action in the online materials, and in notebook *8.1 Data Definition Language in SQL* and notebook *8.2 Data Manipulation Language in SQL*.

In fact, although we have stressed how important a primary key is, SQL does allow tables to be defined without one. Let's redefine the `patient` table, but this time without including a clause to define the primary key. We will call this table `patient_no_pk` to emphasise that this version of the `patient` table has been defined without a primary key.

In [None]:
%%sql

DROP TABLE IF EXISTS patient_no_pk;

CREATE TABLE patient_no_pk (
    
    patient_id CHAR(4),
    patient_name VARCHAR(20),
    date_of_birth DATE,
    gender CHAR(6),
    height_cm DECIMAL(4,1),
    weight_kg DECIMAL(4,1),
    doctor_id CHAR(4)
 );

We can also add the first few rows of the table, and finish with a `SELECT` query to check that the rows have been added correctly:

In [None]:
%%sql

INSERT INTO patient_no_pk(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p001', 'Thornton', '1980-01-22', 'F', 162.3, 71.6, 'd06'),
       ('p007', 'Tennent', '1980-04-01', 'M', 176.8, 70.9, 'd07'),
       ('p008', 'James', '1980-07-08', 'M', 167.9, 70.5, 'd07'),
       ('p009', 'Kay', '1980-09-25', 'F', 164.7, 53.2, 'd06');

In [None]:
%%sql

SELECT *
FROM patient_no_pk;

The primary key constraint on a table's column states that the values column must all be different, and may not be `NULL`. To see that there is not a primary key defined on the table `patient_no_pk`, we can try adding a row that would violate a primary key constraint. Remember that in notebook *8.2 Data Manipulation Language in SQL*, we were unable to add the following row to the `patient` table, because it violated the uniqueness requirement of the primary key constraint on the `patient_id` column:

| patient_id | patient_name | date_of_birth | gender | height_cm | weight_kg|doctor_id
| ------ | ------ | ------ | ------ | ------ | ------|------|
| p008 | Smith | 1981/03/13 | M | 169.3 | 81.7|d06|


As `patient_no_pk` does not have primary key defined on it, then we should be able to add this row:

In [None]:
%%sql

INSERT INTO patient_no_pk(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p008', 'Smith', '1981/03/13', 'M', 169.3, 81.7, 'd06');

This `INSERT` statement should have completed without raising an error. If we now query the contents of the table, the row should have been added. (The `ORDER BY` clause makes it easier to see the repeating value in the `patient_no_pk.patient_id` column.)

In [None]:
%%sql

SELECT *
FROM patient_no_pk
ORDER BY patient_id;

Similarly, we can add rows to the table which do not have a value in the `patient_id` column:

In [None]:
%%sql

INSERT INTO patient_no_pk(patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('Jones', '1982/07/04', 'F', 160.2, 68.0, 'd11');

SELECT *
FROM patient_no_pk;

You should now have a table `patient_no_pk` in which the column `patient_id` contains both repeated values and missing values, so `patient_id` cannot serve as a primary key for this table, with the given data.


### Adding a new constraint to a table with `ALTER TABLE ... ADD CONSTRAINT`

To add a constraint to an existing table in SQL, use the following:

<code>ALTER TABLE &#x2329;table name&#x232A;
ADD CONSTRAINT &#x2329;constraint name&#x232A;
    &#x2329;constraint definition&#x232A;;
</code>


So to add a constraint to `patient_no_pk` stating that the column `patient_id` should be the primary key, we could use a statement such as the following:

<code>ALTER TABLE patient_no_pk
ADD CONSTRAINT patient_no_pk_primary_key
    PRIMARY KEY(patient_id);</code>

which would create a primary key on the `patient_id` column of the `patient_no_pk` table, and gives that constraint the name `patient_no_pk_primary_key`.

### Activity 1

What do you think will happen if you try execute the following statement (to the table as it is populated at this point in the notebook)? Why?

```sql
ALTER TABLE patient_no_pk
ADD CONSTRAINT patient_no_pk_primary_key
    PRIMARY KEY(patient_id);
    ```

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

The following cell executes the given statement:

In [None]:
%%sql

ALTER TABLE patient_no_pk
ADD CONSTRAINT patient_no_pk_primary_key
    PRIMARY KEY(patient_id);



If you try to execute the cell, you should receive an `IntegrityError` with the additional information that `column "patient_id" contains null values`. In this case, SQL has tried to introduce a primary key constraint, but an error has occurred because the data in the table is inconsistent with the requirement of the primary key constraint (ie. that all the values in `patient_id` be unique and non-`NULL`).

#### End of Activity 1

---------------------------------------------

### Constraint actions

As the previous activity showed, and as emphasised by the error raised by executing the `ADD CONSTRAINT` statement, constraints state properties of the database **that must be true**. A constraint does **not** mandate how that state should come about. As a result, if you try to define a constraint on a table where that constraint would be violated, then an error is raised when the constraint is defined, and the constraint is not added.

Rather than try to guess (possibly incorrectly) which rows should be removed from the table to ensure that the constraint would be true, the DBMS prevents us adding the constraint if adding that constraint would then lead to a violation.

Therefore, to add the primary key constraint to the table, we need to make sure that the table is consistent with the primary key *before* adding it.

Let's remove the two rows which are causing the problem:

In [None]:
%%sql

-- Remove the row for the patient with name Smith

DELETE FROM patient_no_pk
WHERE patient_name='Smith';

-- Remove the row where there is no value for patient_id

DELETE FROM patient_no_pk
WHERE patient_id IS NULL;

-- And show the contents of the patient_no_pk table

SELECT *
FROM patient_no_pk;

The table returned by the `SELECT` query should show that for each row of the table, there is a value in the column `patient_id`, and that none of the values are repeated. 

With the duplicate `patient_id` items removed, we should be able to add the primary key constraint to the table:

In [None]:
%%sql

ALTER TABLE patient_no_pk
ADD CONSTRAINT patient_no_pk_primary_key
    PRIMARY KEY(patient_id);

This time, no error is raised, and the primary key constraint has been added (which we can see from the `(PK)` flag in the displayed version of the table definition):

In [None]:
%schema --connection_string $DB_CONNECTION -t patient_no_pk

If we now try to add the rows which violated the primary key constraint, an error will be raised:

In [None]:
%%sql

-- Attempt to add a repeated value to the primary key column

INSERT INTO patient_no_pk(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p008', 'Smith', '1981/03/13', 'M', 169.3, 81.7, 'd06');

In [None]:
%%sql

-- Attempt to add a row to the table with a missing value for the primary key

INSERT INTO patient_no_pk(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES (NULL, 'Jones', '1982/07/04', 'F', 160.2, 68.0, 'd11');

As these examples show, a constraint can only be added to a database where the database is already consistent with the constraint. Similarly, once a constraint has been added, data can only be added or removed if the resulting state of the database is consistent with that constraint. In both cases, an error will be raised and the database unchanged if the data in the database is inconsistent with the set of constraints defined upon it.

### Removing a constraint from a table with `ALTER TABLE ... DROP CONSTRAINT`

We previously stated that the form of statement required to add a constraint to a database was:

<code>    ALTER TABLE &#x2329;table name&#x232A;
    ADD CONSTRAINT &#x2329;constraint name&#x232A;
        &#x2329;constraint definition&#x232A;;
</code>


Although it is possible to omit the name of the constraint, it is good practice to include one, as constraints can then be removed with statements of the form:

<code>    ALTER TABLE &#x2329;table name&#x232A;
    DROP CONSTRAINT &#x2329;constraint name&#x232A;;
</code>


Because the primary key on `patient_no_pk` was given the name `patient_no_pk_primary_key` in the statement:

<code>    ALTER TABLE patient_no_pk
    ADD CONSTRAINT patient_no_pk_primary_key
        PRIMARY KEY(patient_id);</code>

we can remove the constraint again with the statement:

In [None]:
%%sql

ALTER TABLE patient_no_pk
DROP CONSTRAINT patient_no_pk_primary_key;

If you run the `DROP CONSTRAINT` statement in the previous cell, we can now `INSERT` a row with a non-unique value, which raised an integrity error when the primary key was in place:

In [None]:
%%sql

-- Should work after the primary key constraint has been dropped

INSERT INTO patient_no_pk(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p008', 'Smith', '1981/03/13', 'M', 169.3, 81.7, 'd07');

SELECT *
FROM patient_no_pk;

Let's check the table definition again:

In [None]:
%%sql

SELECT *
FROM patient_no_pk;

Let's remove the duplicate row and restore the primary key constraint:

In [None]:
%%sql

DELETE FROM patient_no_pk
WHERE patient_id='p008' and patient_name='Smith';

ALTER TABLE patient_no_pk
ADD CONSTRAINT patient_no_pk_primary_key
    PRIMARY KEY(patient_id);

This ability to remove and redefine constraints can be extremely important when updating tables with complex interactions, such as with foreign keys.

## Further examples of constraints

As well as primary key constraints, there are many constraints that we might like to apply to the database in order to maintain integrity. A very important such constraint is the foreign key constraint which we will look at in part 9. Before then, we will look at some examples in this section of how constraints can be used to restrict the values which can be placed in particular columns in the database's tables.

For this section, we will go back to the `patient` table that we used in notebook *8.1 Data Definition Language in SQL* and notebook *8.2 Data Manipulation Language in SQL*. The next cells will redefine the table (including its primary key constraint), and populate it with the data given in notebook *8.2 Data Manipulation Language in SQL*). (The next three cells are identical to those used in notebook 8.2.)

In [None]:
%%sql

-- Redefine the patient table

DROP TABLE IF EXISTS patient;

CREATE TABLE patient (
    
    patient_id CHAR(4),
    patient_name VARCHAR(20),
    date_of_birth DATE,
    gender CHAR(6),
    height_cm DECIMAL(4,1),
    weight_kg DECIMAL(4,1),
    doctor_id CHAR(4),
    
    PRIMARY KEY (patient_id)
 );

In [None]:
%%sql

-- Add the data for the complete table

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p001', 'Thornton', '1980-01-22', 'F', 162.3, 71.6, 'd06'),
       ('p007', 'Tennent', '1980-04-01', 'M', 176.8, 70.9, 'd07'),
       ('p008', 'James', '1980-07-08', 'M', 167.9, 70.5, 'd07'),
       ('p009', 'Kay', '1980-09-25', 'F', 164.7, 53.2, 'd06'),
       ('p015', 'Harris', '1980-12-04', 'M', 180.6, 64.3, 'd06'),
       ('p038', 'Ming', '1981-09-23', 'M', 186.3, 85.4, 'd11'),
       ('p039', 'Maher', '1981-10-09', 'F', 161.9, 73.0, 'd11'),
       ('p068', 'Monroe', '1981-02-21', 'F', 165.0, 62.6, 'd10'),
       ('p071', 'Harris', '1981-12-12', 'M', 186.3, 76.7, 'd10'),
       ('p078', 'Hunt', '1982-02-25', 'M', 179.9, 74.3, 'd10'),
       ('p079', 'Dixon', '1982-05-05', 'F', 163.9, 56.5, 'd06'),
       ('p080', 'Bell', '1982-06-11', 'F', 171.3, 49.2, 'd07'),
       ('p087', 'Reed', '1982-06-14', 'F', 160.0, 59.1, 'd07'),
       ('p088', 'Boswell', '1982-08-23', 'M', 168.4, 91.4, 'd06'),
       ('p089', 'Jarvis', '1982-11-09', 'F', 172.9, 53.4, 'd10');

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, doctor_id)
VALUES ('p031', 'Rubinstein', '1980-12-23', 'F', 'd07'),
       ('p037', 'Boswell', '1981-06-11', 'F', 'd10');

In [None]:
%%sql

SELECT *
FROM patient
ORDER BY patient_id;

### `UNIQUE` constraints

It is often useful to be able to declare that all the values in a particular column must be different from each other. Of course, uniqueness is part of the primary key constraint, but often we would like to state that the values in one or more columns must be unique for other reasons (perhaps a company uses a payroll number as the primary key for its employees, but also requires that no two employees have the same internal telephone number).

The syntax for adding a `UNIQUE` constraint to a table is similar to that for adding a primary key constraint:

<code>ALTER TABLE &#x2329;table name&#x232A;
ADD CONSTRAINT &#x2329;constraint name&#x232A;
    UNIQUE (&#x2329;column 1&#x232A;, &#x2329;column 2&#x232A;, ..., &#x2329;column n&#x232A;);</code>

where the $n$-tuple of values in the columns <code>(&#x2329;column 1&#x232A;, &#x2329;column 2&#x232A;, ..., &#x2329;column n&#x232A;)</code> must be different for each row.

For example, in order to add a constraint to the `patient` table stating that each value in the `date_of_birth` column must be unique, we would use:

In [None]:
%%sql

ALTER TABLE patient
ADD CONSTRAINT date_of_birth_unique
    UNIQUE (date_of_birth);

To test that this constraint is working, we can try adding a row that contains a repeated value of one of the existing patients' dates of birth:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p090', 'Wilson', '1980/07/08', 'M', 163.7, 72.3, 'd07');

The insertion raises an integrity error: the `UNIQUE` constraint has been violated.

As mentioned earlier, having named constraints makes it easier to drop constraints if they are later found to be inappropriate. In this case, we might decide that insisting on a unique date of birth for each patient is not a sensible constraint for the database, in which case the constraint can be dropped using the form:

<code>ALTER TABLE &#x2329;table name&#x232A;
DROP CONSTRAINT &#x2329;constraint name&#x232A;;</code>

In this case, we can drop the constraint using the next cell:

In [None]:
%%sql

ALTER TABLE patient
DROP CONSTRAINT date_of_birth_unique;

And the row with the repeated date of birth can now be added:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p090', 'Wilson', '1980/07/08', 'M', 163.7, 72.3, 'd07');

### CHECK constraints

The most general type of constraint in SQL is the *check constraint*: a check constraint is used to constrain the values that appear in each row of a table. The general form of the check constraint is:

<code>    ALTER TABLE &#x2329;table name&#x232A;
    ADD CONSTRAINT &#x2329;constraint name&#x232A;
        CHECK &#x2329;condition&#x232A;;
</code>


where the <code>&#x2329;condition&#x232A;</code> must be true for each row in the table. The conditions in the `CHECK` clause are the same form as that of a `WHERE` clause in a `SELECT` query.

#### NOT NULL constraints

A common use for a check constraint is to be able to declare that none of the values in a particular column may be `NULL`. As with uniqueness, `NOT NULL`-ness is part of the primary key constraint, but often we would like to state that a particular value may not be `NULL` for other reasons (perhaps the company using a payroll number as the primary key for its employees requires that each employee has a name recorded in the database).

To state that a particular column may not be `NULL`, the constraint definition in the `ALTER TABLE` command can use a check constraint. In this case, we will add a constraint that values in a particular column may not be `NULL`, where the condition used is `IS NOT NULL`, as we might write in a `WHERE` clause:

In [None]:
%%sql

ALTER TABLE patient
ADD CONSTRAINT patient_name_not_null
    CHECK (patient_name IS NOT NULL);

We should now find that attempting to add a row to the `patient` table without a given value for `patient_name` should result in an error being raised:

In [None]:
%%sql

INSERT INTO patient(patient_id, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES('p096', '1983-07-23', 'F', 165.6, 55.3, 'd07');

Executing the previous cell raises an `IntegrityError`, with the additional information `new row for relation "patient" violates check constraint "patient_name_not_null"`.

As before, we can remove the constraint using the `DROP CONSTRAINT` statement:

In [None]:
%%sql

ALTER TABLE patient
DROP CONSTRAINT patient_name_not_null;

Now that the constraint has been dropped, we should be able to add the new row without an error being raised:

In [None]:
%%sql

INSERT INTO patient(patient_id, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES('p096', '1983-07-23', 'F', 165.6, 55.3, 'd07');

And we can query the `patient` table to ensure that the new row has been added:

In [None]:
%%sql

SELECT *
FROM patient;

There is now a row with a value of `p096` in the `patient_id` column, which does not have a value for `patient_name`.

### Activity 2

If the values in a column are constrained to be unique, but there is no `NOT NULL` constraint on the column, can multiple rows contain `NULL`?

If you are not sure, a good way to find out would be to write some simple code to investigate.

Write your answer or code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

To investigate this question, we can define our own small table with two columns: the primary key and one other. The aim will be to define the table:

| column_1 | column_2  |
|:--------:|:---------:|
| A        | NULL      | 
| B        | NULL      |

where `column_1` is the primary key, and `column_2` is constrained to be `NOT NULL`.


We start by defining the table: we can use `CHAR(1)` as the data type for `column_1`, and `INT` as the data type for `column_2` (although as we will not be entering any values for `column_1` or `column_2`, we could choose whatever data type we liked). We can call the table `test_table`:

In [None]:
%%sql

DROP TABLE IF EXISTS test_table;

CREATE TABLE test_table (

    column_1 CHAR(1),
    column_2 INT,

    PRIMARY KEY (column_1)
);


Next, we can add the `UNIQUE` constraint to `column_2`:

In [None]:
%%sql

ALTER TABLE test_table
ADD CONSTRAINT column_2_unique
    UNIQUE (column_2);

Now that the table is defined, we can insert some values:

In [None]:
%%sql

INSERT INTO test_table(column_1)
VALUES ('A'),
       ('B');

If we now query the contents of `test_table`, we can see that the required data has been added, with each row missing a value in `column_2`.

In [None]:
%%sql

SELECT *
FROM test_table;

As you can see, it is possible for multiple occurrences of `NULL` to appear in `column_2`, even though the values in that column are constrained to be unique. This shows that in SQL, separate occurrences of `NULL` are not considered to be equal.

#### End of Activity 2

--------------------------------------------------

### Constraining the values in a column

From when you worked through notebook *8.2 Data Manipulation Language in SQL*, you might recall that executing the SQL statement:

<code>INSERT INTO patient
VALUES('p071', 'Harris', '1981-12-12', 186.3, 76.7);</code>

resulted in the `patient` table containing the row:

| patient_id | patient_name | date_of_birth | gender | height_cm | weight_kg|doctor_id|
| ------ | ------ | ------ | ------ | ------ | ------|-------|
| p071 | Harris | 1981/12/12 | 186.3 | 76.7| None | None|

Clearly, the value 186.3 should not appear in the `gender` column of the `patient` table. The values which can be added to the `gender` column should only be either `F` or `M`. We can restrict the values in this way using a check constraint:

In [None]:
%%sql

ALTER TABLE patient
ADD CONSTRAINT gender_F_or_M
    CHECK (gender='F' OR gender='M');

The `patient` table now has the additional constraint `gender_F_or_M` defined so that for each row in the `patient` table, the condition `gender='F' OR gender='M'` must be true. To test that the constraint is working, we can try adding the row for the patient Harris with a value of `Male` in the `gender` column:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, gender, height_cm, weight_kg, doctor_id)
VALUES ('p075', 'Harris', '1981-12-12', 'Male', 186.3, 76.7, 'd10');

Executing the `INSERT` statement should have raised an `IntegrityError`, with additional information that `new row for relation "patient" violates check constraint "gender_f_or_m"`. 

Such constraints are primarily used to ensure the integrity of the database. If we now try the `INSERT` statement which created the incorrect row:

In [None]:
%%sql

INSERT INTO patient
VALUES('p071', 'Harris', '1981-12-12', 186.3, 76.7);

we find that the statement now raises an `IntegrityError` with an explanation that the check constraint has been violated. 

### Activity 3

What do you think will happen if you try execute the following statement (to the table as it is populated at this point in the notebook, and with the `gender_F_or_M` constraint defined)? Why?

```sql
INSERT INTO patient(patient_id, patient_name, date_of_birth, height_cm, weight_kg, doctor_id)
VALUES ('p075', 'Harris', '1981-12-12', 186.3, 76.7, 'd10');
```

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

The following cell executes the given statement:

In [None]:
%%sql

INSERT INTO patient(patient_id, patient_name, date_of_birth, height_cm, weight_kg, doctor_id)
VALUES ('p075', 'Harris', '1981-12-12', 186.3, 76.7, 'd10');

If you try to execute the cell, you find that the `INSERT` statement has added a row containing the given data to the `patient` table, with a missing value in the `gender` column. We can see this with a suitable `SELECT` query:

In [None]:
%%sql

SELECT *
FROM patient;

This illustrates that missing values do not necessarily violate the constraint: if we wanted to force the user to provide a value for the `gender` column in the `patient` table, we would need to add a `NOT NULL` constraint as well as the check constraint.

#### End of Activity 3

----------------------------------------------

# Interoperability between pandas and SQL

Very often, the most convenient way of building a table will be from CSV files, or from existing structures within  python. In the rest of this notebook, we will look at some convenience functions that pandas provides which will enable data to be quickly transferred into SQL tables from pandas dataframes. We will explore how to quickly generate a table from a pandas dataframe, and then look at the series of statements needed to alter the table definition and add constraints to it.

So far, we have entered all our data to SQL tables using `INSERT` statements. Found data is often distributed in the form of CSV files, and the most convenient way of dealing with it can be to pass it into a dataframe before exporting to a table in the relational database. To illustrate this process, we have created a CSV file containing the data for the `patient` table in the `sql_data` directory. You can see it here:

In [None]:
!head -n 5 sql_data/patient.csv

(Check notebook *02.2 Data file formats* if you need a reminder of how to use CSV files, or the unix `head` command.)

Before we can import this data into a dataframe, we need to import pandas:

In [None]:
import pandas as pd

Now, as described in notebook *02.2.1  Data file formats - CSV*, we can use the `pd.read_csv` function to import the csv file into a dataframe. (The `parse_dates` parameter in the argument list is used to convert the given column, `date_of_birth`, into `pandas.Timestamp` objects, also as described in that notebook.)

In [None]:
# Import the patient.csv file into a dataframe
patient_df=pd.read_csv('./sql_data/patient.csv',
                       parse_dates=['date_of_birth'])

#Look at the first few rows of the resulting dataframe
patient_df.head()

## Exporting data from a dataframe to SQL

Now that we have the data stored in a dataframe, it is straightforward to export it to an SQL table. The `.to_sql()` method defined on dataframes provides a convenience function that will create a database table corresponding to a dataframe if required, or add data from a data frame to a pre-existing table. Whilst this can be handy for quickly getting data into a database, many of the table definitions created by the `.to_sql()` method will be lacking in terms of table structure (the selection of column data types is likely to be far from optimal) and constraints.

To start with, let's remove the existing `patient` table, so that it is no longer in the database:

In [None]:
%%sql

DROP TABLE IF EXISTS patient;

We can now use the `.to_sql` method to create a new table:

In [None]:
patient_df.to_sql('patient',
                  DB_CONNECTION,
                  if_exists='replace',
                  index=False
                  )

and check that the new table has indeed been created:

In [None]:
%%sql

SELECT *
FROM patient;

You should have found that a table has been created, containing the data from the CSV file.

The `.to_sql` method creates a new table, and enters the data from the dataframe. It is worth clarifying here what the various arguments to `.to_sql()` are doing:

- `'patient'` The first argument is a string, which is the name of the table which will be created when the method is called.

- `DB_CONNECTION` / `'postgresql://tm351_student:tm351_pwd@localhost:5432/tm351_clean'`. The second argument is the connection string, which we have also seen to establish the database connection for the sql magic. This can be passed as a literal string or using a previously set connection string variable.

- `if_exists='replace'`. `if_exists` tells pandas what to do if a table `patient` already exists in the database. In this case, setting the parameter as `replace` tells pandas to create a new version of the table. Other values are `fail` (raise an error) and `append` (append the dataframe's data to the existing data in the table).

- `index=False`: `index` tells pandas whether or not to create a column in the table corresponding to the dataframe's index.

The full documentation can be seen by executing:

`pd.DataFrame.to_sql?`

in a code cell.

<div class="alert alert-danger"><h4><em>pandas</em> Gotchas - Integers and NaNs</h4>
<p>*pandas* <tt>Series</tt> and <tt>DataFrame</tt> columns with an <tt>int</tt> (integer) type <strong>cannot</strong> contain <tt>NaN</tt> or null values, unlike PostgreSQL tables, which <em>can</em> take a <tt>NULL</tt> value.</p>
<p>The <tt>%sql</tt> magic returns queries made on the database as a <em>pandas</em> <tt>DataFrame</tt>. If your query returns a table with a column of type <tt>INTEGER</tt> in the database, but the result contains one or more <tt>NULL</tt> values in the column, the corresponding column in the dataframe result will be a <tt>float</tt>, which <em>can</em> represent <tt>NaN</tt> values.</p>
</div>


## Adapting tables generated from pandas for relational databases

We now have a table with appropriate columns, and the correct data in those columns. However, we do not yet have the full benefits of those tables being in a relational database, as there are no integrity constraints defined on those tables.

Although a dataframe has an index, and this may often be associated with a unique set of values and used as an key, there is no concept of a *primary key* in a dataframe. Merely exporting the data from a dataframe into a relational table, with or without an explicit index, will not result in a primary key being defined on that table.

Also, the `.to_sql()` method does not necessarily know what the best types are for the columns in the new database table.

We can see how pandas has decided to create the `patient` table using the display magic:

In [None]:
%schema --connection_string $DB_CONNECTION -t patient

From the display we can see that pandas has made a reasonable guess as to the types of the different columns, although at a fairly coarse level of granularity:

- the `date_of_birth` column which had a `Timestamp` type in pandas has been converted to a `TIMESTAMP WITHOUT TIME ZONE` type in SQL;
- the `height_cm` and `weight_kg` columns have both been implemented as `DOUBLE PRECISION` numbers;
- the remaining columns have been implemented as `TEXT`.

To make the exported table more robust in database terms, the first thing to do, of course, is to define a primary key. We have seen in the earlier cells in this notebook how to implement a primary key, so we can do that for the new `patient` table:

In [None]:
%%sql

ALTER TABLE patient
ADD CONSTRAINT patient_primary_key
    PRIMARY KEY(patient_id);

Next, we might decide to redefine some of the column types. To do this, we can use an `ALTER TABLE` statement with the following form:

<code>ALTER TABLE &#x2329;table name&#x232A;
ALTER COLUMN &#x2329;column name&#x232A;
    TYPE &#x2329;data type&#x232A;;
</code>

where the <code>&#x2329;data type&#x232A;</code>s are as discussed in notebook *8.1 Data Definition Language in SQL*.

For the `patient` example, we might design our system so that patient identifiers are only ever up to four characters long, which is to say they will at most be represented by a string four characters long.

We can modify the table structure to set the column type as a `CHAR(4)` type using the `ALTER TABLE` statement:

In [None]:
%%sql 

ALTER TABLE patient 
ALTER COLUMN patient_id 
    TYPE char(4);

After running the previous cell, we can see the properties of the `patient` table again:

In [None]:
%schema --connection_string $DB_CONNECTION -t patient

The column `patient_id` has now been defined as the primary key, and has had its type changed to `CHAR(4)`.

Finally, as before, we might want to add a constraint to restrict the values which can be taken by the `gender` column to be `F` or `M`: we can use the same `ADD CONSTRAINT` statement as earlier:

In [None]:
%%sql

ALTER TABLE patient
ADD CONSTRAINT gender_F_or_M
    CHECK (gender='F' OR gender='M');

### Activity 4

The directory `sql_data` contains four csv files: `patient.csv`, `doctor.csv`, `drug.csv` and `prescription.csv`. These contain data for the `patient`, `doctor`, `drug` and `prescription` entities respectively, as illustrated in the entity diagrams:

![The entities as shown in Figure 9.2](notebook_8.3_entities.jpg)

The files themselves can be found here:

In [None]:
!ls sql_data

Use the CSV files and the entity diagram to create tables in the database for the three remaining entities (`doctor`, `drug` and `prescription`). You should both import the data from the CSV files, and define appropriate constraints on the created tables.

In [None]:
# Write your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

The process is similar to what we have already seen for the `patient` table. For this exercise, we will just import the CSV files into tables and add primary keys. However, in a real application, you might want to change some of the column data types, or add additional check constraints in order to ensure that the database correctly represents the domain.

First, we will import the `doctor` data:

In [None]:
!head -n 5 sql_data/doctor.csv

In [None]:
# Import the doctor.csv file into a dataframe
doctor_df=pd.read_csv('./sql_data/doctor.csv')

#Look at the first few rows of the resulting dataframe
doctor_df.head()

We can now use the `.to_sql` method to create a new table:

In [None]:
doctor_df.to_sql('doctor',
                 DB_CONNECTION,
                 if_exists='replace',
                 index=False
                 )

and check that the new table has indeed been created:

In [None]:
%%sql

SELECT *
FROM doctor;

To define the primary key, we will use the entity diagram, which uses an asterisk to show that the `doctor_id` column is the primary key for the `doctor` table.

In [None]:
%%sql

ALTER TABLE doctor
ADD CONSTRAINT doctor_primary_key
    PRIMARY KEY(doctor_id);

And finally, we can use the display magic to check that the primary key has been correctly set:

In [None]:
%schema --connection_string $DB_CONNECTION -t doctor

We can carry out similar steps for the `drug` entity:

In [None]:
!head -n 5 sql_data/drug.csv

In [None]:
# Import the drug.csv file into a dataframe
drug_df=pd.read_csv('./sql_data/drug.csv')

#Look at the first few rows of the resulting dataframe
drug_df.head()

In [None]:
# Use the `.to_sql` method to create a new table:
    
drug_df.to_sql('drug',
               DB_CONNECTION,
               if_exists='replace',
               index=False
              )

and check that the new table has been created:

In [None]:
%%sql

SELECT *
FROM drug;

To define the primary key, we will use the entity diagram, which has starred the `drug_code` column as being the primary key for the `drug` table.

In [None]:
%%sql

ALTER TABLE drug
ADD CONSTRAINT drug_primary_key
    PRIMARY KEY(drug_code);

And use the display magic to check that the primary key has been correctly set:

In [None]:
%schema --connection_string $DB_CONNECTION -t drug

Finally we can consider the `prescription` table. This is slightly more fiddly than the `doctor` and `drug` tables because the `date` column needs to be an appropriate data type, and the table has a composite primary key. However, the stages are more or less the same. First we import the CSV file as a dataframe, but this time specifying the column which needs to be parsed as a date:

In [None]:
!head -n 5 sql_data/prescription.csv

In [None]:
# Import the prescription.csv file into a dataframe
prescription_df=pd.read_csv('./sql_data/prescription.csv',
                            parse_dates=['date'])

#Look at the first few rows of the resulting dataframe
prescription_df.head()

In [None]:
# Use the `.to_sql` method to create a new table:
    
prescription_df.to_sql('prescription',
                       DB_CONNECTION,
                       if_exists='replace',
                       index=False
                      )

and check that the new table has indeed been created:

In [None]:
%%sql

SELECT *
FROM prescription;

To define the primary key, we will use the entity diagram, which has starred the four columns of `patient_id`, `doctor_id`, `drug_code` and `date` column as being the primary key for the `prescription` table. Although we have only defined single column primary keys so far, it is straightforward to add a composite primary key:

In [None]:
%%sql

ALTER TABLE prescription
ADD CONSTRAINT prescription_primary_key
    PRIMARY KEY (patient_id, date, doctor_id, drug_code);

And finally, we can use the display magic to check that the primary key has been correctly set:

In [None]:
%schema --connection_string $DB_CONNECTION -t prescription

#### End of Activity 4

----------------------------------------

## What next?

In this notebook, we have seen:

* how to add constraints to an existing table, and how to remove named constraints if they are not required,

* how constraints, and particularly primary keys, can be used to determine what values can or cannot be put into a database table, and when values may or may not be `NULL`,

* that constraints are fully *declarative*: that is, they determine what values are legitimate, not how possible violations should be managed, and

* how to use pandas' dataframe methods to export data from dataframes into relational tables.

You have now completed the notebooks for part 8 of the module. 