# Introduction to the SQL Data Definition Language

A *data definition language* allows us to define the structures that contain our data. In SQL, the data definition language allows us to add tables to the database, delete those tables, and define or modify constraints on those tables. 

In this notebook, we will look at basic table definition using the simple `Hospital` database described in [Activity 8.4](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%208%20Introduction%20to%20relational%20databases&targetptr=4) of the VLE course materials. 

In particular, we will look at:
* how to create tables in a database with the `CREATE` statement,
* how to remove tables from a database with the `DROP` statement, and
* how to define primary keys on tables.


In the next notebook, we will look at the Data Manipulation Language, which allows us to add data to the tables. For the moment, however, we will concentrate on the definition of the tables themselves.

You should spend around one hour on this notebook.

## Setting up the Relational Database 

In Parts 8-12, you will be using a relational database to manage data. The data in the database is *persistent*, that is, all the data that you have put into the database stays in place even after you shut down the notebooks in which the data was created. This is different from DataFrames, for example, which are deleted when the notebook closes.

You also need to *connect* to the database. The database server runs independently from the Jupyter notebook server, and to interact with it, you need to set up an explicit connection.

To address these points, we have written two scripts, one of which sets up the database connection, and another which resets the database to the same state at the beginning of each notebook. In the following notebooks, we will call the scripts at the beginning of the notebooks, so that you can dive straight into that notebook's content.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which you would generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. To set this up, we have created the script `sql_init.ipynb`, which uses the login credentials you provided in `DB_USER` and `DB_PWD` to create the connection string. The next cell executes the script:

In [None]:
%run sql_init.ipynb

If the script ran without error, you should now be able to see your database connection string, which is held in the variable `DB_CONNECTION`:

In [None]:
print(DB_CONNECTION)

The connection string is made up of several parts:

- `postgresql` : tells `ipython-sql` that we will use PostgreSQL as our database engine
- Your user name and password appear separated by a colon
- `localhost:5432` : the port on which the database engine is listening
- Finally, the string contains the name of the database (`tm351` for a local VCE, or your OUCU for the remote VCE)



We now connect to the database with the command:
```python
%sql <connection string>
```
where `%sql` is the ipython-sql magic, and was installed with the command `%load_ext sql` in the `sql_init.ipynb` notebook. To signify that `DB_CONNECTION` is a variable containing the connection string, we prefix the variable name with a `$` character. This signifies that it refers to a Python variable whose value we want to use when making the connection to the database:

In [None]:
%sql $DB_CONNECTION

You should now be connected to the PostgreSQL database.

### Reset the database

The notebooks in the next few parts all assume that you start off with the database in a clean state. To put the database into a clean state, run the next cell, which executes a notebook which ensures that any tables which we will use are in the state that we expect. Note that you *must* be correctly connected to the database to run this cell: if you are not, an error will be raised.

In [None]:
%run reset_databases.ipynb

### Examining the setup scripts

The notebook scripts `sql_init.ipynb` and `reset_databases.ipynb` are both contained in the same folder as this notebook. If you want to see how the environment is configured, or the SQL which is used to reset the databases, you are encouraged to look at these scripts.

In the remaining notebooks in the SQL parts of the module, we will not include this walkthrough: rather, we will just give the commands to set the authentication credentials, and run the setup scripts.

## Executing SQL queries

Having set up a connection to the database, we will want to make some queries. Because we used `%load_ext sql` to load the sql magic, then you can run a cell as SQL by putting `%%sql` at the top of a Jupyter code cell. This will use the connection that we defined in the `DB_CONNECTION` variable.

In PostgreSQL, there is a built in table called `information_schema.tables`, which contains details of all the tables in the current database. We can see all the tables by using a query which `SELECT`s all the columns in the `information_schema.tables` table (check the section *Projection using SQL* in notebook *03.2 Selecting and projecting, sorting and limiting* if you need to remind yourself how `SELECT` is used to return a number of table columns):

In [None]:
%%sql

SELECT *
FROM information_schema.tables
LIMIT 10;

By running the previous cell, you should have seen a list of the tables which currently exist in SQL (strictly, which appear in the current *namespace*). The `%%sql` declaration in the first line states that the rest of the cell is SQL rather than python, and the remaining three lines select the first 10 rows of the `information_schema.tables` table. 

Very often, it is useful to be able to pass the output of a query into a dataframe. To do this, the `%%sql` declaration should also contain the variable name, and the symbol `<<` as:

<code>%%sql &#x2329;variable name&#x232A; << </code>


For example, the next cell puts the results of the `SELECT` query into the variable `information_schema_tables_df`.

In [None]:
%%sql information_schema_tables_df <<

SELECT *
FROM information_schema.tables
LIMIT 10;

And we can see the `DataFrame` contained in the variable:

In [None]:
information_schema_tables_df

We have now seen how to call SQL within a notebook, and how to store any resulting tables as a `DataFrame` in a variable. We can now look at how to define tables themselves.

## Defining (`CREATE`ing) tables

In the entity diagrams for the `Hospital` database, introduced in [Activity 8.4](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%208%20Introduction%20to%20relational%20databases&targetptr=4), you have seen that there are four tables defined, those being:

* <code>patient</code>
* <code>doctor</code>
* <code>drug</code>
* <code>prescription</code>



These are not initially defined in the `tm351` database. If this is the first time that you have worked with this notebook, you can see that these tables are not defined by making a simple <code>SELECT</code> query on one of them:

In [None]:
%%sql

SELECT *
FROM patient;

This should have raised an error stating that the `relation "patient" does not exist`.

To define the table, we use the SQL <code>CREATE TABLE</code> statement. In this case, we would like to create the `patient` table which appears as the entity diagram:

| Patient        |
| :-------------: |
| <b>\*</b> patient_id     |
| patient_name     |
| date_of_birth   |
| gender  |
| height_cm     |
| weight_kg     |
| doctor_id    |

Recall that the first attribute, `patient_id`, is starred to indicate that it is the *primary key* (see [Activity 8.5](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%208%20Introduction%20to%20relational%20databases&targetptr=4) in the online materials).

To create the table, the SQL statement we need is:

In [None]:
%%sql

CREATE TABLE patient (
    
    patient_id CHAR(4),
    patient_name VARCHAR(20),
    date_of_birth DATE,
    gender CHAR(6),
    height_cm DECIMAL(4,1),
    weight_kg DECIMAL(4,1),
    doctor_id CHAR(4),
    
    PRIMARY KEY (patient_id)
 );

This <code>CREATE TABLE</code> statement is very simple, but demonstrates the key features of table creation. The standard syntax for the statement is:

<code>CREATE TABLE table_name&#x232A;(   
     &#x2329;column_name&#x232A; &#x2329;data_type&#x232A;,   
     &#x2329;column_name&#x232A; &#x2329;data_type&#x232A;,
     ... 
     PRIMARY KEY (&#x2329;column_name&#x232A;, &#x2329;column_name&#x232A;, ...) );</code>

The key features of the statement are:

* a name for the table, <code>&#x2329;table_name&#x232A;</code>
* a list of pairs of each of the table's columns, given as the name of the column followed by the *data type* for that column, and 
* a `PRIMARY KEY` declaration, which identifies one or more of the declared columns as the primary key.

(The trailing semicolon is the end-of-statement marker. It can be omitted for single SQL statements, but if a cell contains multiple such statements, the semicolon is needed at the end of each statement. We will use the semicolon throughout.)

To see that the database does indeed contain the desired table (albeit without any constituent data yet) we can now have another go at the earlier <code>SELECT</code> query:

In [None]:
%%sql

SELECT *
FROM patient;

This query shows that the table now exists (we no longer get the `relation "patient" already exists` error), but returns an empty result. To see the definition of the tables, including the defined columns, we can use a further sql-magic called `schemadisplay_magic`. First we need to load the extension:

In [None]:
%load_ext schemadisplay_magic

Having loaded the extension, we can use it to view the tables currently defined by the user in the database.

As with the `%sql` magic, we can use the value set in the `DB_CONNECTION` variable:

In [None]:
%schema --connection_string $DB_CONNECTION

*Note that unlike the `sql` magic, where we only need to set up a connection once for the lifetime of a notebook session, we need to pass the connection string to the `%schema` magic each time we call it.*

This form of the `%schema` call displays all the user-defined tables. This is fine when there are only a couple of tables defined, but when the database becomes bigger and more complicated, it is more convenient to specify the table to be displayed. This can be done with the `-t` flag:

In [None]:
%schema --connection_string $DB_CONNECTION -t patient

Executing the previous cell should have displayed a `patient` entity diagram. The diagram has the 7 columns defined in the `patient` table, which correspond to the column names defined in the <code>CREATE TABLE</code> statement.

You should also see that the column `patient_id` has `(PK)` written after it. This is because we defined `patient_id` to be the primary key. As you saw in [Activity 8.5](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%208%20Introduction%20to%20relational%20databases&targetptr=4), no part of a primary key may be `NULL`, and no two rows in a table may have the same values for the primary key.

### Column data types

The <code>CREATE TABLE</code> statement for the `patient` table illustrates some of the standard SQL datatypes for columns.

The datatypes that we have used here are:
    
| Datatype | Description |
|:----------|:-------------|
| `CHAR(n)`  | A fixed length character string of length `n` | 
| `VARCHAR(n)` | A variable length character string, of maximum length `n` |
| `INT`/`INTEGER` | An integer, whose minimum and maximum values are defined by the DBMS |
| `DATE` | A calendar date (typically parsed by the DBMS) |
| `DECIMAL(p, s)` | A decimal with precision `p` and scale `s` |

In the final case of `DECIMAL`s, the *precision* denotes the number of digits which can appear altogether in the number, and the *scale* denotes the number of digits which can appear after the decimal point. So for the column declaration:

<code>height_cm DECIMAL(4,1)</code>

the column <code>height_cm</code> can take decimal values of up to 999.9 (four figures altogether, with one after the decimal point). Values with more digits after the decimal point will be rounded, and if the resulting value exceeds the size of the container (for example, 999.97 would round to 1000.0), an error is raised.

### Primary key constraints

As discussed in [Part 8, section 4](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%208%20Introduction%20to%20relational%20databases&targetptr=4), the primary key of a table provides a unique identifier for that particular entity. In the <code>CREATE TABLE</code> statement to define `patient`, only a single column was needed to define an appropriate primary key. However, it is possible to define tables in which the primary key is composed of multiple columns. For example, the entity diagram for the `prescription` table is given as:

| Prescription        |
| :-------------: |
| <b>\*</b> patient_id     |
| <b>\*</b> doctor_id     |
| <b>\*</b> drug_code     |
| <b>\*</b> date   |
| dosage  |
| duration     |



The definition has the same structure as previously, but in this case we pass a tuple of the columns `patient_id`, `doctor_id`, `drug_code` and `date` as the primary key:

In [None]:
%%sql

CREATE TABLE prescription (
    
    patient_id CHAR(4),
    doctor_id CHAR(4),
    drug_code CHAR(8),
    date DATE,
    dosage INT,
    duration INT,
    
    PRIMARY KEY (patient_id, doctor_id, drug_code, date)
 );

As before, we can check that the table has been created with a <code>SELECT</code> query:

In [None]:
%%sql

SELECT *
FROM prescription;


The `SELECT` query should return an empty table, without raising an error. Also as before, we can see the columns that have been defined on this table by looking at the appropriate rows with a `%schema` call: 

In [None]:
%schema  --connection_string $DB_CONNECTION -t prescription

In this case, the four columns `patient_id`, `doctor_id`, `drug_code` and `date` all have the `(PK)` marker, which is what we would expect as the primary key is made up of all four columns.

### Activity 1

The column `patient_id` in this example has a data type of `CHAR(4)`. What difficulties might come about if a different data type were declared, such as `CHAR(8)`?

Write your answer in this cell.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Because the `patient_id` column of `patient` should contain the same values as the `patient_id` column of `prescription`, if these two columns have different data types, then values could appear in one table which could not be represented in the other.

In this case, a string like `'PATIENT'` could be entered into the `patient_id` column of `prescription` (as it is defined as `CHAR(8)`), but the same value could not be entered into the `patient_id` column of `patient`. As a result, the data could end up being inconsistent.

#### End of Activity 1

----------------------------------------

## Removing (<code>DROP</code>ping) tables

Having seen how to create tables, it is useful to be able to remove any tables that you have created. To do so, use the <code>DROP TABLE</code> statement. The following cell will remove the `prescription` table (and any data stored in it) from the database:

In [None]:
%%sql

DROP TABLE prescription;

We can check that the table has indeed been removed with a `SELECT` query on the `prescription` table:

In [None]:
%%sql

SELECT *
FROM prescription;

The cell should have raised an error stating that `relation "prescription" does not exist`. 

If you have been working through the notebook, by this point you should have a database which still contains the table `patient`, but which does not contain the table `prescription`.

### Activity 2

What happens if you try to drop a table which does not exist in the database? Write an SQL statement to try to `DROP` the `prescription` table.

In [None]:
# Enter your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

To find out what happens, let's try to `DROP` the `prescription` table using the same SQL statement as before:

In [None]:
%%sql

DROP TABLE prescription;

The previous cell should have raised an error stating that `table "prescription" does not exist`.

#### End of Activity 2

------------------------------------------------------

### ... `IF EXISTS`

To avoid an error being raised when a non-existent table is dropped, we can add `IF EXISTS` to the `DROP` statement. In such a case, SQL will only drop the table if it exists in the database:

In [None]:
%%sql

DROP TABLE IF EXISTS prescription;

Although the `prescription` table does not exist in the database, the `IF EXISTS` prevents an error being raised when we try to drop it.

A common idiom is to add a `DROP TABLE IF EXISTS` statement before `CREATE TABLE` statements. We have seen that attempting to drop a non-existent table will raise an error: the same is true when trying to create a table that already exists. If you still have the `patient` table defined in your database, executing the following cell should raise an error:

In [None]:
%%sql

CREATE TABLE patient (
    
    patient_id CHAR(4),
    patient_name VARCHAR(20),
    date_of_birth DATE,
    gender CHAR(6),
    height_cm DECIMAL(4,1),
    weight_kg DECIMAL(4,1),
    doctor_id CHAR(4),
    
    PRIMARY KEY (patient_id)
);

This error can be avoided by using `DROP IF EXISTS` before the `CREATE` statement:

In [None]:
%%sql

DROP TABLE IF EXISTS patient;

CREATE TABLE patient (
    
    patient_id CHAR(4),
    patient_name VARCHAR(20),
    date_of_birth DATE,
    gender CHAR(6),
    height_cm DECIMAL(4,1),
    weight_kg DECIMAL(4,1),
    doctor_id CHAR(4),
    
    PRIMARY KEY (patient_id)
);

Of course, this should be used with care! Often it is very useful to be warned that you are trying to `DROP` a non-existant
table, which may indicate a bug in your code.

### Activity 3

The ERD contains some further tables. To practise the techniques in this section, write and execute suitable SQL to create the remaining two tables, `doctor` and `drug`. The tables are repeated here:


| Doctor        |
| :-------------: |
| <b>\*</b> doctor_id     |
| doctor_name    |

| Drug        |
| :-------------: |
| <b>\*</b> drug_code     |
| drug_name |


You should choose appropriate data types for your columns, and ensure that you have defined a primary key.





In [None]:
# Enter your code in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

It is fairly straightforward to create the tables using similar `CREATE TABLE` statements to those used for `patient` and `prescription`.

In [None]:
%%sql

CREATE TABLE doctor (
    
    doctor_id CHAR(4),
    doctor_name VARCHAR(20),
    
    PRIMARY KEY (doctor_id)
 );

CREATE TABLE drug (
    
    drug_code CHAR(8),
    drug_name VARCHAR(20),
    
    PRIMARY KEY (drug_code)
 );



We have made some simple assumptions here about the data types required for the columns. However, note that the type for `drug_code` defined here must be the same as the type defined in the `prescription` table, and the type for `doctor_id` must be the same type as defined in both the `patient` and `prescription` tables. These sets of columns will contain the same items of data (i.e. drug codes or doctor identifiers), and so they must be the same type in order to be comparable. (For the same reason as before, when we identified that the data type for `patient_id` needs to be the same across tables.)

#### End of Activity 3

-------------------------------------------------

## What next?

You have now completed this Notebook. We have covered how to add and remove tables from the database using the `CREATE TABLE` and `DROP TABLE` statements, how to define appropriate types on the tables' columns, and how to define the primary key.

You can now move on to notebook *08.2 Data Manipulation Language in SQL*.