# <a name="top"></a>Normalisation - Hospital. Our solution


This Notebook shows how to normalise the prescription example from the TM351 running example.

We strongly suggest you often refer back to the [_Antique Opticals_ normalisation example](10.2%20Normalisation%20-%20Antique%20Opticals.ipynb) as you work through this Notebook. All the techniques you need to complete this exercise are given in that example. You're not expected to write wholly new SQL or Pandas statements here; instead, you should be applying the _Antique Opticals_ examples to this case study.

* [Moving to first normal form (1NF)](#1nf)
* [Moving to second normal form (2NF)](#2nf)
* [Moving to third normal form (3NF)](#3nf)

## The data


This is the data you will be normalising. Your task is to move this data from the unnormalised form given below into a collection of relations in 3NF, implemented as a collection of PostgreSQL tables.

You should refer to notebook *10.2 Normalisation - Antique Opticals* for examples of how to carry out each of the steps.

An example form which is the source of the data is shown below.

<img src="images/tm351-patient_record.png" alt="Drawing" style="width: 75%;"/>

The functional dependencies in this example are:

| This attribute | functionally defines this attribute |
| ------------- |:------------- |
| `patient_id`  | `patient_name` |
| `patient_id`  | `doctor_id` |
| `doctor_id`   | `doctor_name`  |
| `drug_code`   |  `drug_name`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `dosage`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `duration` |

You should use the same data as in Notebook 10.1, in which we imported the unnormalised data from the csv file `unnormalised_prescription.csv`, as:

In [None]:
!head unnormalised_prescription.csv

We have not included solutions in this notebook: for our solution, you should look in notebook *10.4 Our solution to Normalisation - the Hospital scenario*.

### When things go wrong

You will almost certainly make mistakes during the process of working through this notebook. When you do, just clear out the database and repeat the steps you know work.

To clear out the database, re-run the database cleanup cell (making sure you have an active connection):


## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

In [None]:
%run reset_databases.ipynb

## Load the data


This is the data you will be normalising. Your task is to move this data from the unnormalised form given below into a collection of relations in 3NF, implemented as a collection of PostgreSQL tables.

Functional dependencies are given below. 

In [None]:
prescriptions_detail = pd.read_csv('unnormalised_prescription.csv', parse_dates=['date'])
prescriptions_detail

The functional dependencies in this example are:

| This attribute | functionally defines this attribute |
| ------------- |:------------- |
| `patient_id`  | `patient_name` |
| `patient_id`  | `doctor_id` |
| `doctor_id`   | `doctor_name`  |
| `drug_code`   |  `drug_name`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `dosage`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `duration` |

# <a name="1nf"></a> Moving from unnormalised data to first formal form (1NF)
* [Top](#top)

Convert the data above into one or more relations, each in 1NF. Verify that the normalised tables accurately represent the original data. 

One relation should use `patient_id` as its primary key.


Remember the mantra: to be in third normal form, 
> attributes must be dependent on the key, the whole key, and nothing but the key.

Where there are multiple values of an attribute for a key, a _repeating group_, we need to extract the repeating values into a new relation.

More formally, **a relation in 1NF has no repeating groups**.

## Solution

We want `patient_id` to be the key, but the layout shows there are multiple values for some attributes for each `patient_id`: any given patient might have had have several drugs prescribed.

Hence, the `date`, `drug_code`, and so on do not depend on just the `patient_id`. 

That means this dataset is not in 1NF. We move to 1NF by extracting repeating groups into separate relations.

There are two 1NF relations here: 
1. `patients_doctors`, one for each `patient_id`
2. `prescribed_drugs`, several for each `patient_id`.

We'll work out the correct key for `prescribed_drugs` when we've rearranged the data a bit and can see things more clearly.

Let's start by pulling out the patient data into a new DataFrame. 

In [None]:
prescriptions_detail.columns

In [None]:
patients_doctors = prescriptions_detail[['patient_id', 'patient_name', 'doctor_id', 
                                 'doctor_name']].drop_duplicates()
patients_doctors

We can now ask which columns could are *candiate keys* for this relation.

In [None]:
for c in patients_doctors.columns:
    print(c + ' : ' + str(patients_doctors[c].is_unique))

`patient_id` seems like it will do the job, which is what we were expecting.

Now we'll extract the `prescribed_drugs` relation in the same way as above: pull out the columns and drop the duplicates.

In [None]:
prescribed_drugs = prescriptions_detail[['patient_id', 'prescribing_doctor_id', 'date',
       'drug_code', 'drug_name', 'dosage', 'duration']].drop_duplicates()
prescribed_drugs

In [None]:
for c in prescribed_drugs.columns:
    print(c + ' : ' + str(prescribed_drugs[c].is_unique))

No column is a candidate key, suggesting we need a composite primary key. Let's try `(patient_id, drug_code)`.

In [None]:
(prescribed_drugs['patient_id'].astype(str) + prescribed_drugs['drug_code'].astype(str)).is_unique

So `(patient_id, drug_code)` doesn't work either. How about if we include the prescribing doctor? 

In [None]:
(prescribed_drugs['patient_id'].astype(str) +
 prescribed_drugs['prescribing_doctor_id'].astype(str) +
 prescribed_drugs['drug_code'].astype(str)).is_unique

So `(patient_id, prescribing_doctor_id, drug_code)` still doesn't work. How about if we include the date? 

(Note the need to convert the date to a string, so it can be appended to the other keys.)

In [None]:
(prescribed_drugs['patient_id'].astype(str) +
 prescribed_drugs['prescribing_doctor_id'].astype(str) +
 prescribed_drugs['drug_code'].astype(str) +
 prescribed_drugs['date'].astype(str)).is_unique

That works. We now have two relations in 1NF. Let's make sure we can combine them to recreate the original dataset. As in notebook *10.2 Normalisation - Antique Opticals*, we can use the `.equals()` method to compare the dataframes, remembering to make sure that the two dataframes we're comparing have their columns in the same order:

In [None]:
patients_doctors.merge(prescribed_drugs, on=['patient_id'])

In [None]:
reconstruct_prescriptions_detail=patients_doctors.merge(prescribed_drugs, on=['patient_id'])[prescriptions_detail.columns]

reconstruct_prescriptions_detail

In [None]:
prescriptions_detail.equals(reconstruct_prescriptions_detail[prescriptions_detail.columns])

Now that we have two 1NF relations, let's put them in PostreSQL for the subsequent steps.

In [None]:
patients_doctors.to_sql('patient_doctor', 
                        DB_CONNECTION,
                        if_exists='replace',
                        index=False)

In [None]:
prescribed_drugs.to_sql('prescribed_drug',
                        DB_CONNECTION,
                        if_exists='replace',
                        index=False)

Add the primary keys and make sure the DBMS is happy. Because we'll be modifying these tables below, we won't add foreign key constraints yet.

In [None]:
%%sql

ALTER TABLE patient_doctor
ADD CONSTRAINT patient_doctor_pk
    PRIMARY KEY (patient_id);

ALTER TABLE prescribed_drug
ADD CONSTRAINT prescribed_drug_pk
    PRIMARY KEY (patient_id, drug_code, date);

ALTER TABLE prescribed_drug 
ADD CONSTRAINT prescribed_drug_patient_doctor_fk
    FOREIGN KEY (patient_id) REFERENCES patient_doctor;

In [None]:
%%sql 

SELECT *
FROM patient_doctor;

In [None]:
%%sql 

SELECT *
FROM prescribed_drug;

Now we make sure we can recreate the original dataset from the PostgreSQL tables.

(Convenience: get Jupyter to print the column names in a form we can cut-and-paste into the SQL query.)

In [None]:
', '.join(prescriptions_detail)

In [None]:
%%sql prescriptions_detail_recreated <<

SELECT patient_doctor.patient_id, patient_name, doctor_id, doctor_name, prescribing_doctor_id, 
       drug_code, date, drug_name, dosage, duration
FROM patient_doctor, prescribed_drug
WHERE patient_doctor.patient_id = prescribed_drug.patient_id;

Pull that SQL query result into a new DataFrame and give it appropriate column names.

In [None]:
prescriptions_detail_recreated

And again, check that the recreated dataset is the same as the one we started with.

In [None]:
prescriptions_detail.equals(prescriptions_detail_recreated)

### The current ERD
For interest, this is the ERD of where we are now.

![ERD of first normal form](images/prescription-1nf.png)

# <a name="2nf"></a>Moving to second normal form (2NF)
* [Top](#top)

Convert the 1NF tables you created above into a collection of relations, implemented as Postresql tables, each in 2NF. 

The functional dependencies in this example are:

| This attribute | functionally defines this attribute |
| ------------- |:------------- |
| `patient_id`  | `patient_name` |
| `patient_id`  | `doctor_id` |
| `doctor_id`   | `doctor_name`  |
| `drug_code`   |  `drug_name`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `dosage`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `duration` |

To reiterate, to be in third normal form, 
> attributes must be dependent on the key, the whole key, and nothing but the key.

Formally, **a relation in 2NF has all attributes functionally dependent on the whole of the primary key**.


## Solution

There is only one relation with a composite key, `prescribed_drug` with key `(patient_id, prescribing_doctor_id, date, drug_code)`.

In [None]:
%%sql

SELECT *
FROM prescribed_drug;

The functional dependencies show that `drug_name` is dependent on just `drug_code`, but `dosage` and `duration` depend on the whole of the primary key.

That suggests we should pull out `drug_code` and `drug_name` into a separate `drug` table, leaving `prescription` with columns `patient_id`, `prescribing_doctor_id`, `date`, `drug_code`, `dosage`, and `duration`, with primary key `(patient_id, prescribing_doctor_id, date, drug_code)`.

First, let's check that there's only one value for `drug_name` for each `drug_id`:

In [None]:
%%sql 

SELECT drug_code
FROM prescribed_drug
GROUP BY drug_code
HAVING COUNT (DISTINCT drug_name) > 1;

There are no values of `drug_code` which have more than one associated value in `drug_name`.

In [None]:
%%sql 
DROP TABLE IF EXISTS drug CASCADE;

CREATE TABLE drug AS
    SELECT DISTINCT drug_code, drug_name
    FROM prescribed_drug;
    
ALTER TABLE drug 
ADD CONSTRAINT drug_pk PRIMARY KEY (drug_code);

SELECT *
FROM drug;

Now we create the `prescription` table and define the foreign keys connecting it to `drug` and `patient_doctor`.

In [None]:
%%sql

DROP TABLE IF EXISTS prescription;

CREATE TABLE prescription AS
    SELECT DISTINCT patient_id, prescribing_doctor_id, date, drug_code, dosage, duration
    FROM prescribed_drug;
    
ALTER TABLE prescription
ADD CONSTRAINT prescription_pk
    PRIMARY KEY (patient_id, prescribing_doctor_id, drug_code, date);

ALTER TABLE prescription 
ADD CONSTRAINT prescription_drug_fk
    FOREIGN KEY (drug_code) REFERENCES drug;
    
ALTER TABLE prescription
ADD CONSTRAINT prescription_patient_doctor_fk
    FOREIGN KEY (patient_id) REFERENCES patient_doctor;

SELECT * 
FROM prescription;

Now check we can recreate the `prescribed_drug` table.

In [None]:
%%sql recreated_prescribed_drugs << 

SELECT patient_id, prescribing_doctor_id, date, prescription.drug_code, drug_name, dosage, duration
FROM prescription, drug
WHERE prescription.drug_code = drug.drug_code
ORDER BY patient_id, prescribing_doctor_id, date, drug_code;

In [None]:
recreated_prescribed_drugs

That seems OK, but let's check formally.

In [None]:
%%sql prescribed_drugs << 

SELECT patient_id, prescribing_doctor_id, date, drug_code, drug_name, dosage, duration
FROM prescribed_drug
ORDER BY patient_id, prescribing_doctor_id, date, drug_code;

(In this case, we know that the columns and rows will be correctly aligned, because the order of columns was determined in the `SELECT` clause, and both dataframes have been `ORDERed BY` the primary key. Therefore we can call `.equals` without needing to reorder `recreated_prescribed_drugs`'s columns.)

In [None]:
prescribed_drugs.equals(recreated_prescribed_drugs)

Success!

### Cleanup
Let's remove the prescribed_drug table.

In [None]:
%%sql

DROP TABLE prescribed_drug;

### The current ERD
For interest, this is the ERD of where we are now.

![ERD of second normal form](images/prescription-2nf.png)

# <a name="3nf"></a>Moving to Third Normal Form (3NF)
* [Top](#top)


Convert the 2NF tables you created above into a collection of relations, implemented as PostreSQL tables, each in 3NF. 

The functional dependencies in this example are:

| This attribute | functionally defines this attribute |
| ------------- |:------------- |
| `patient_id`  | `patient_name` |
| `patient_id`  | `doctor_id` |
| `doctor_id`   | `doctor_name`  |
| `drug_code`   |  `drug_name`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `dosage`   |
| (`patient_id`, `prescribing_doctor_id`, `drug_code`, `date`) | `duration` |

To be in third normal form, 
> attributes must be dependent on the key, the whole key, and nothing but the key.

To move to third normal form (3NF), we have to ensure the first clause of that mantra: each attribute is directly functionally dependent on the key, and not functionally dependent on any other attribute. As before, we ensure this is true by splitting relations as necessary, while ensuring that all relations remain in 2NF (and hence also in 1NF). 

Formally, **a relation in 3NF has all attributes _directly_ functionally dependent on the whole of the primary key**.

## Solution

All the functional dependencies are direct, except for `doctor_name`. This field is directly dependent on `doctor_id`, but `doctor_id` is not a key anywhere. That means we should split the `patient_doctor` table into `patient` and `doctor` tables.

First, the `doctor`. Is there just one name for each `doctor_id`?

In [None]:
%%sql

SELECT doctor_id
FROM patient_doctor
GROUP BY doctor_id
HAVING COUNT (DISTINCT doctor_name) > 1;

Looks good. Now to create the `doctor` table.

In [None]:
%%sql
DROP TABLE IF EXISTS doctor CASCADE;

CREATE TABLE doctor AS
    SELECT DISTINCT doctor_id, doctor_name
    FROM patient_doctor;
    
ALTER TABLE doctor
ADD CONSTRAINT doctor_pk
PRIMARY KEY (doctor_id);

SELECT *
FROM doctor;

Now the `patient`. 

In [None]:
%%sql

DROP TABLE IF EXISTS patient CASCADE;

CREATE TABLE patient AS
    SELECT DISTINCT patient_id, patient_name, doctor_id
    FROM patient_doctor;

ALTER TABLE patient 
ADD CONSTRAINT patient_pk
    PRIMARY KEY (patient_id);

ALTER TABLE patient
ADD CONSTRAINT patient_doctor_fk 
    FOREIGN KEY (doctor_id) REFERENCES doctor;

SELECT *
FROM patient;

Can we recreate the `patient_doctor` table from the normalised tables?

In [None]:
%%sql recreated_patient_doctor <<

SELECT patient_id, patient_name, doctor.doctor_id, doctor_name
FROM patient, doctor
WHERE doctor.doctor_id = patient.doctor_id
ORDER BY patient_id;

In [None]:
recreated_patient_doctor

This looks good, but again, let's check.

In [None]:
%%sql patient_doctor <<

SELECT patient_id, patient_name, doctor_id, doctor_name
FROM patient_doctor
ORDER BY patient_id;

In [None]:
recreated_patient_doctor.equals(patient_doctor)

Success!

We can now tidy up the foreign key constraints between `prescription` and `patient` (rather than between `prescription` and `patient_doctor`) and then drop the `patient_doctor` table.

In [None]:
%%sql

ALTER TABLE prescription 
DROP CONSTRAINT prescription_patient_doctor_fk;

ALTER TABLE prescription
ADD CONSTRAINT prescription_patient_fk 
    FOREIGN KEY (patient_id) REFERENCES patient;
    
drop table patient_doctor;

And finally, can we recreate the original dataset from the normalised tables?

In [None]:
%%sql recreated_prescription_details <<

SELECT patient.patient_id, patient_name, doctor.doctor_id, doctor_name, 
       prescribing_doctor_id, drug.drug_code, date, drug_name, dosage, duration
FROM patient, doctor, prescription, drug
WHERE doctor.doctor_id = patient.doctor_id 
    AND prescription.patient_id = patient.patient_id
    AND prescription.drug_code = drug.drug_code
ORDER BY patient_id, prescribing_doctor_id, date, drug_code;

In [None]:
recreated_prescription_details

In [None]:
prescriptions_detail.equals(recreated_prescription_details)

Success! We have successfully normalised the data associated with the prescription form.

### The current ERD
For interest, this is the ERD of where we are now.

![ERD of third normal form](images/prescription-3nf.png)

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to notebook *10.5 Improvements with normalised data*.