# <a name="top"></a>Problems with unnormalised data

In this notebook you will see some problems which can arise when dealing with data that has not been normalised. We will continue to use the running example of a hospital database.

This notebook refers back to the notebooks from Parts 8 and 9, which used the Hospital database with the tables `patient`, `doctor`, `drug` and `prescription`. However, unlike those notebooks, in which the data was organised into four separate tables to represent the four different entities, in this notebook we will start off by assuming that all the data is held in one large table, and look at some of the problems that can arise from this representation.



This notebook uses the original `public` schema, and starts with an empty database, so none of the four tables have been defined to begin with.

You should spend around 45 minutes on this notebook.

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

In [None]:
%run reset_databases.ipynb

## An unnormalised `prescription` table

For the sake of this exercise, imagine that _all_ the information held about patients, doctors, and drugs is contained in a bundle of patient records which have the following form:

<img src="images/tm351-patient_record.png" alt="Drawing" style="width: 75%;"/>

The information at the top of the form gives the identifier and name of the patient for whom the record has been made out, as well as the identifier and name of the doctor who is responsible for that patient. The table in the lower part of the record contains the drug code, drug name, dosage and duration for each treatment that has been prescribed to the patient, as well as the date of the prescription, and the identifier of the prescribing doctor (who may not be the same as the patient's responsible doctor). So in this case, the patient called Thornton with identifier p001 has the doctor named Gibson with identifier d06 as their responsible doctor. The patient's responsible doctor prescribed three of the treatments in the record, with one (the one prescribed on 15th Jun 2017) being prescribed by the doctor with identifier d07.

Obviously this record format is not representative of the records kept in a real hospital, and the solutions to the problems below are likely to be obvious. However, this is simple example is still sufficiently rich to illustrate the problems that can occur.

In real systems, the same problems can exist, but will be hidden in the depths of a much more complex data model. Therefore, the problems are rarely obvious on first inspection. By working through a simple example, you should become familiar with the problems that come from unnormalised data and how they may manifest. In the rest of this Part, you'll look at how to restructure the database to avoid these problems.

### Define the unnormalised table

In the same way as we did in notebook *08.3 Adding column constraints to tables*, we'll load the prescription data from a file. As you saw, this is easier if we read the data into a Pandas Dataframe then load that into PostreSQL with `to_sql()`.

For this notebook, we have created a CSV file containing the unnormalised data, which we have put into the file named `unnormalised_prescription.csv` in this folder. As usual, you can see the first few lines with a call to `head`:

In [None]:
!head unnormalised_prescription.csv

In [None]:
unnormalised_prescription_df = pd.read_csv('unnormalised_prescription.csv', parse_dates=['date'])
unnormalised_prescription_df

As we saw in notebook *08.3 Adding column constraints to tables*, we can use the `.to_sql()` method of this dataframe to add a new table with this data to the database. We will call the unnormalised table `unnormalised_prescription`:

In [None]:
unnormalised_prescription_df.to_sql('unnormalised_prescription',
                                    DB_CONNECTION,
                                    if_exists='replace',
                                    index=False
                                   )

As discussed in [Part 8, section 4](https://learn2.open.ac.uk/mod/oucontent/view.php?id=1349962&section=4), the primary key of our `prescription` table is:
<code>(patient_id, prescribing_doctor_id, drug_code, date)</code>. (Note that in this case, the primary key column is `prescribing_doctor_id`, to distinguish from `doctor_id` for the responsible doctor.)
So let's add this as a primary key to the new `unnormalised_prescription` table:

In [None]:
%%sql

ALTER TABLE unnormalised_prescription 
ADD CONSTRAINT unnormalised_prescription_primary_key
    PRIMARY KEY (patient_id, prescribing_doctor_id, drug_code, date);

And use a `SELECT` query to check that the table has been created and appears as we'd expect:

In [None]:
%%sql

SELECT * 
FROM unnormalised_prescription;

## Insertion anomalies

In the exercises below, we will look at some of the issues that are raised by the primary key of `(patient_id, prescribing_doctor_id, drug_code, date)` in the large `unnormalised_prescription` table. As we saw in part 8, the primary key limits what information can be added to this table. In the exercises, we'll look at some problems that are raised by using the table as it stands, in its unnormalised form.

### Activity 1

Remember from notebook *08.2 Data Manipulation Language in SQL*, that the basic form of an `INSERT` statement for adding data to a table is:

<code>INSERT INTO &#x2329;table name&#x232A;( &#x2329;column 1&#x232A;, &#x2329;column 2&#x232A;,... &#x2329;column n&#x232A; )
VALUES ( &#x2329;value &#x232A;, &#x2329;value 2&#x232A;, ..., &#x2329;value n&#x232A;); </code>

Pravastatin is an alternative to simvastatin, treating much the same conditions with much the same doses. The hospital wants to make this available for prescription. 

Write an `INSERT` statement to add pravastatin, with `drug_code` P1234, to the `unnormalised_prescription` table.

In [None]:
# Write your code in this cell

Explain what problems are raised by using a simple `INSERT` statement for this task.

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Following notebook *08.2 Data Manipulation Language in SQL*, a statement to add the drug data should require a simple SQL `INSERT` command:

In [None]:
%%sql

INSERT INTO unnormalised_prescription (drug_code, drug_name) 
    VALUES ('P1234', 'Pravastatin');

This raises an `IntegrityError` because of integrity constraints on the `unnormalised_prescription` table's primary key. The composite primary key requires non-`NULL` values for each of `patient_id`, `prescribing_doctor_id`, `drug_code`, and `date`. If we attempt to record a drug before it's prescribed, there are no values for `patient_id`, `prescribing_doctor_id` or `date`. 

The attempt to add a row to `unnormalised_prescription` with just a drug code and drug name produces a row that violates the primary key constraint. PostgreSQL prevents us from adding it, reporting it as an `IntegrityError`.

Therefore we can't use this table to record information on drugs, unless those drugs have actually been used in a prescription. Normalisation will tell us how to split this one table into several parts so that we can record information about drugs without reference to patients, doctors or prescriptions. 

#### End of Activity 1

-----------------------------------

### Activity 2

A patient by the name of Kay has just been admitted to the hosptial. Kay has been assigned identifier `p009`, and Dr James, with identifier `d07`, is leading Kay's care. Kay has just been admitted, so has received no drugs yet.

Write an `INSERT` statement to add Kay's details to the `unnormalised_prescription` table.

In [None]:
# Write your code in this cell

Explain what problems are raised by using a simple `INSERT` statement for this task.

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Inserting the patient's data should be a simple SQL `INSERT` command:

In [None]:
%%sql

INSERT INTO unnormalised_prescription (patient_id, patient_name, doctor_id, doctor_name) 
    VALUES ('p009', 'Kay', 'd07', 'James');

Again, this fails because of integrity constraints on the `unnormalised_prescription` table's primary key. If we're recording just a patient before they've had any drugs prescribed, there are no values for `prescribing_doctor_id`,  `drug_id` or `date`. 

The attempt to add this row to `unnormalised_prescription` (with just a `patient_id` present for the primary key) produces a row that violates the primary key constraint. Again, PostgreSQL prevents us from adding it, reporting it as an `IntegrityError`.

#### End of Activity 2

-----------------------------------

### Discussion

These examples show that this database structure isn't fit for purpose, in that it does not allow us to add information about some aspects of the hospital data if other aspects are not available. If we were filling out a spreadsheet, we could record partial information in partial rows, for example:

|patient_id|patient_name|doctor_id|doctor_name|prescribing_doctor_id|drug_code|date |drug_name   |dosage|duration|
|----------|------------|---------|-----------|---------------------|---------|-----|------------|------|--------|
|      -   |     -      |   -     |    -      |  -                  | P1234   |  -  |Pravastatin |  -   |   -    |
|p009      |Kay         |d07      |James      |  -                  |  -      | -   |  -         | -    |   -    |
|p001      |Thornton    |d06      |Gibson     |   d06               |O17663   |15/05/2017 |Omeprazole |40 mg |1 x day,Daily      |
|p001      |Thornton    |d06      |Gibson     |  d06  |T02378|15/05/2017   |Tramadol   |50 mg |3 x day,As required|

However, by using a relational database to store the information, we commit to maintaining certain data integrity properties. In the hospital case, the integrity constraint requires that all parts of the primary key are known: `patient_id`, `prescribing_doctor_id`, `drug_code`, and `date`. This constraint ensures that all data rows in the database are well-formed. 

However, this data structure is too restrictive for some of the data we might want to add. We can't add patients or drugs to the database without the drug being prescribed to a patient. But from what you know about primary keys in Part 8, we can't relax any of the constraints on the database. That means we need to look at a different database structure. Again, normalisation will tell us what that structure should be.

# Deletion anomalies
The fact that all the data is held in a single table means that deletions can have unintended consequences.

### Activity 3

Remember from notebook *08.2 Data Manipulation Language in SQL*, that the basic form of a `DELETE` statement for removing data from a table is:

<code>DELETE FROM &#x2329;table_name&#x232A;
WHERE &#x2329;condition&#x232A;;</code>

The `unnormalised_prescription` table contains the following row, which describes a prescription for the drug with code `O17663` which was made out for the patient with identifier `p001` on 15th May, 2017 by the doctor with identifier `d06`:

In [None]:
%%sql 

SELECT * 
FROM unnormalised_prescription
WHERE patient_id = 'p001'
    AND prescribing_doctor_id='d06'
    AND drug_code = 'O17663'
    AND date = '2017-05-15';

It turns out that this record of Thornton's prescription of Omeprazole was added to the database in error. Correct the error by writing a `DELETE` statement to remove this record from the database. Having done that, write a `SELECT` query to find the drug code for Omeprazole.

In [None]:
# Write your code in this cell

Explain what problems have been created by this deletion on the `unnormalised_prescription` table.

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

There are two parts to the task. First we need to remove the unwanted row from the `unnormalised_prescription` table:

In [None]:
%%sql 

DELETE FROM unnormalised_prescription 
WHERE patient_id = 'p001'
    AND prescribing_doctor_id='d06'
    AND drug_code = 'O17663'
    AND date = '2017-05-15';

And next, we need to write a query to find the drug code for Omeprazole:

In [None]:
%%sql

SELECT drug_code
FROM unnormalised_prescription
WHERE drug_name = 'Omeprazole';

The query does not return any data: all the information on Omeprazole has disappeared.

#### End of Activity 3

-------------------------------------------------

### Discussion

Because drugs (and other information) can only exist in the context of prescriptions, we can easily lose information about drugs, patients, and doctors if we remove all references to them from the `unnormalised_prescription` table. 

This is a similar failing to the insertion anomalies above. We want to be able to record facts about drugs (and patients, and doctors) independently of whether they're involved in a prescription. However, the unnormalised data we're using and the constraints on the primary key mean we can't do that. 

Again, this indicates that this data model does not support the tasks we need it for: we need to be able to record information about drugs (and patients and doctors) separately from the prescription of drugs to patients.

This anomaly is closely related to the insertion anomalies above, and has the same solution: a properly normalised database will allow us to delete information about prescriptions without forcing us to delete information about the drugs.

## Update anomalies

Finally in this notebook, we will look at a way that unnormalised data can cause us problems when we attempt to update  data in an unnormalised table. 

### Activity 4

Remember from notebook *08.2 Data Manipulation Language in SQL*, that the basic form of an `UPDATE` statement for removing data from a table is:

<code>UPDATE &#x2329;table_name&#x232A;
SET column1 = value1, column2 = value2, ...
    WHERE &#x2329;condition&#x232A;;</code>

The `unnormalised_prescription` table contains the following row, which describes a prescription for the drug Tamsulosin which was made out for the patient with identifier `p007` on 19th June, 2017 by the doctor with identifier `d07`:

In [None]:
%%sql 

SELECT *
FROM unnormalised_prescription
WHERE patient_id = 'p007'
    AND prescribing_doctor_id='d07'
    AND drug_name = 'Tamsulosin' 
    AND date = '2017-06-19';

Again, it turns out that this row in the database incorrectly records the details of the prescription: this prescription was actually for Simvastatin, rather than Tamsulosin. Correct the error by writing an `UPDATE` statement to alter this record from the database so that the drug code is set to Simvastatin's code, `S33558`, rather than Tamsulosin's code of `T05223`.

In [None]:
# Write your code in this cell

Based on what you have learned from Activities 1, 2 and 3, what problems do you think may be created by this update on the `unnormalised_prescription` table?

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

For this task, we need to update the misrecorded row in the `unnormalised_prescription` table:

In [None]:
%%sql

UPDATE unnormalised_prescription
SET drug_code = 'S33558'
    WHERE patient_id = 'p007'
        AND doctor_id='d07'
        AND date = '2017-06-19';

Having made the update, let's look at how the drug information appears in the database.

In [None]:
%%sql 

SELECT DISTINCT drug_code, drug_name
FROM unnormalised_prescription;


There are now two problems: the drug code `S33558` is linked to both Simvastatin and Tamsulosin. Similarly, the drug Tamsulosin is linked to drug codes `S33558` and `T05223`. Our attempt at correcting an error in the database has resulted in ambiguities between the drug codes and drug names.

#### End of Activity 4

--------------------------------------------

### Discussion

Our attempts to update records in the unnormalised table has resulted in inconsistent data. The combination of drug code and drug name is repeated across several rows in the unnormalised data. If the drug code is updated in just some of those rows, there will be different drug code / drug name combinations in the table.

A similar effect would happen if we tried to update the name for a particular code, but did not update all the associated rows at the same time to maintain consistency.

Fundamentally, this problem occurs because the same information is recorded in several different rows of the table. Updating just some of the rows causes them to become inconsistent with other rows in the table. With different rows holding different information (such as different names for the same drug code, or the same name for two different drug codes), the user doesn't know which is correct.

_Normalisation_ will tell us how to separate the data into different entities so that information is recorded just once, will still being useful to the database user.

## What next?


This notebook has shown some of the problems with unnormalised data. 

The insertion and deletion anomalies are because our unnormalised data contains a combination of several entities. As we discussed in Part 8, the `prescription` table contains information about prescriptions, but it also contains data about patients, drugs, and doctors via the primary key. However, in the original Hospital database, most of the information about these entities was divided into separate tables. The structure of the unnormalised `unnormalised_prescription` table means that we are often prevented from dealing with these entities independently.

The update anomaly occurs because data is repeated across several rows. There are several instances of each drug code and drug name, and of each patient identifier and patient name. When just one of these rows is updated, it can become inconsistent with the other rows in the table. Again, this is because the unnormalised data contains copies of these other entities, rather than splitting them out into different tables.

_Normalisation_ is a principled process for identifying and splitting out all the different entites that exist in a dataset. Once you've normalised the hospital data, you'll see how the anomalies you encountered here are absent when dealing with normalised data.

Chapter 4 of Harrington gives an informal description of how to identify entities. In contrast, normalisation is a formal process of structuring database tables, working from the characteristics of the data and fields themselves.



If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to notebook *10.2 Normalisation - Antique Opticals*.