# <a name="top"></a>Normalisation - Antique Opticals: 45 minutes to scan: most time taken in 10.3


This Notebook will walk you through the process of normalisation, using the example of _Antique Opticals_ taken from Harrington. The example is based on normalising the order data shown in Harrington figure 3.2, but extended to show all the steps moving to third normal form (3NF). Briefly, the data represents a set of customer orders for DVDs. See the module material for more information on this case study.

You should follow this notebook to understand the steps undertaken in normalisation, but you need not attempt to reimplement the steps there. Instead, **treat this notebook as a reference guide**. As you're working through and understanding this Notebook, you may also find it useful to refer to Harrington chapter 7 for the fine details of normalisation and normal forms. However, the key ideas behind the normal forms are repeated here.

In Notebook 10.3, you will normalise the Prescription running example, following a similar process to here. While you work through Notebook 10.3, refer back to this notebook for how to address the tasks in normalisation.

* [Moving to first normal form (1NF)](#1nf)
* [Moving to second normal form (2NF)](#2nf)
* [Moving to third normal form (3NF)](#3nf)
* [Discussion](#discussion)

## Setting up

The next group of cells set up your database connection, and reset the database to a clean state. Check notebook *08.1 Data Definition Language in SQL* if you are unsure what the next cells do.

You may need to change the given values of the variables `DB_USER` and `DB_PWD`, depending on which environment you are using

In [None]:
# Make the connection

%run sql_init.ipynb
print("Connecting with connection string : {}".format(DB_CONNECTION))
%sql $DB_CONNECTION

In [None]:
%run reset_databases.ipynb

## <a name="1nf"></a> Moving from unnormalised data to first formal form (1NF)
* [Top](#top)

Remember the mantra: to be in third normal form, 

> attributes must be dependent on the key, the whole key, and nothing but the key.

This is our final destination, but first we need to move to first normal form (1NF). To be in 1NF, we have to ensure that the final clause of the mantra is true: each attribute must depend on nothing but the key. In other words, for one value of a given key in a relation, there must be exactly one value for each attribute. If there are multiple values for each key, it means the attribute value depends on more than just the key.

Where there are multiple values of an attribute for a key, a _repeating group_, we need to extract the repeating values into a new relation.

More formally, **a relation in 1NF has no repeating groups**.


As an implementation note, we'll be working initially with DataFrames before moving to PostgreSQL. The tables in SQL databases, like PostgreSQL, are based on relations which are, by definition, in first normal form (1NF). Therefore, SQL databases have difficulty representing truly unnormalised data where there can be more than one value for each attribute for a given key. We'll handle the unnormalised data in a DataFrame, then move the data into PostgresSQL once it's in 1NF. 

(Actually, most RDBMSs allow for non-primitive data in table cells, such as storing lists of numbers, but we'll not go into that here for simplicity of presentation.)


Let's start by loading some sample data into a DataFrame and see where we are. We have created an initial CSV file containing some data for the Antique Opticals database in the file `antique-opticals.csv`, which we have included in this folder.

As in notebook 10.1, we will use pandas' `.read_csv()` function to import the data from the CSV file into a dataframe:

In [None]:
orders_detail_df = pd.read_csv('antique-opticals.csv')
orders_detail_df

It's quite wide, so let's look at the first few rows. We can use the `.T` method to transpose these rows and make the data a bit more readable:

In [None]:
# Show the first 5 rows of orders_detail_df, transposed

orders_detail_df.head().T

Our understanding of the domain suggests that there should be a `customer` relation with `customer_number` as the key, but the layout shows there are multiple values for some attributes for each `customer_number`. For example, we can see that Reed Calderon made two orders, on 29th July and 12th November, and there were two disks in each order. 

Hence, the `order_date` does not depend on just the `customer_number`. Nor does `title` depend on just the `customer_number`, and `title` does not depend on just the `order_number` either.

That means this dataset is not in 1NF.

In order to move to 1NF, we need to extract the repeating groups into separate relations.

There are three 1NF relations here: 
1. `customer`, with one row for each `customer_number`,
2. `order`, with one row for each `order_number`, and
3. `order_item`, with several rows for each `order_number`.

We'll work out the correct key for `order_item` when we've rearranged the data a bit and can see things more clearly.


### The `customer` relation

Let's start by pulling out the customer data into a new DataFrame. We can create a new DataFrame, `customers_df`, by pulling out the columns that we expect to appear in `customers_df`, which are `customer_number`, `first_name`, `surname`, `street`, `postcode` and `phone_number`:

In [None]:
customers_df = orders_detail_df[
    ['customer_number', 'first_name', 'surname', 'street', 'postcode', 'phone_number']
]

customers_df.head()

Unfortunately, this is a bit messy, with duplication of rows where a customer created several orders. We can remove these repeated rows with the `.drop_duplicates()` method. Note that as used here, `.drop_duplicates()` returns a new dataframe, rather than making the update in-place.

In [None]:
# Use the drop_duplicates method to remove repeated columns

customers_df=customers_df.drop_duplicates()
customers_df

As you learned in Part 8, each relation needs to have a *primary key*, which is unique for each row in the relation. pandas' Series have an attribute `.is_unique`, which is True if all the members of the series are different, and False otherwise.

We can use this attribute to determine which columns are candidate keys for the relation (remember from [Part 8, section 4](https://learn2.open.ac.uk/mod/oucontent/olinkremote.php?website=TM351&targetdoc=Part%208%20Introduction%20to%20relational%20databases&targetptr=4) that a candidate key is "a column (or combination of columns) which uniquely identifies each row"):

In [None]:
for c in customers_df.columns:
    print(c + ' : ' + str(customers_df[c].is_unique))

The columns `customer_number`, `street`, `postcode` and `phone_number` all contain unique values, so any of these columns could form the primary key.

In fact, `customer_number` looks as though it will do the job, which is what we were expecting.


### The `orders` relation

Now we'll extract the `orders` relation in the same way as above: pull out the columns, and drop the duplicates.

First, we'll pull out the columns. For the `orders` relation, the columns we want are `order_number`, `customer_number` and `order_date`.

In [None]:
# Select the appropriate columns from orders_detail

orders_df = orders_detail_df[['order_number', 'customer_number', 'order_date']]

# and remove any duplicated rows:

orders_df=orders_df.drop_duplicates()
orders_df

What could the primary key be for `orders`?

In [None]:
for c in orders_df.columns:
    print(c + ' : ' + str(orders_df[c].is_unique))

`order_number` seems like a good candidate key for this relation.

### The `order_items` relation

Finally, we need to extract the `order_items` relation. As before, specify the columns, and then drop the duplicated rows:

In [None]:
# Select the appropriate columns from orders_detail_df
order_items_df = orders_detail_df[
    ['order_number', 'item_number', 'title', 'price', 'item_dispatched', 'distributor_id', 'distributor']]

# and remove any duplicated rows:
order_items_df=order_items_df.drop_duplicates()

order_items_df

So what should we use as a primary key for `order_items`? Again, let's look for columns which do not contain any duplicates:

In [None]:
for c in order_items_df.columns:
    print(c + ' : ' + str(order_items_df[c].is_unique))

In this case, no single column is a candidate key, implying that we need a combination of columns to make a composite key. Let's try `(order_number, item_number)` (note how we convert both series to strings using the `.astype(str)` method):

In [None]:
(order_items_df['order_number'].astype(str) + order_items_df['item_number'].astype(str)).is_unique

The `.is_unique` attribute is now True. The combination of `order_number` and `item_number` contains no duplicates, and is therefore a candidate key.

### Recreating the original table

We now have three dataframes representing relations in 1NF: `customers_df`, `orders_df`, and `order_items_df`. Let's make sure we can combine them to recreate the original table. 

([`merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) is covered in notebook *3.3 Combining data from multiple datasets*; the syntax is 

<code>this_dataframe.merge(that_dataframe, on=['column_1', 'column_2', ...])</code>

This does an inner join on `this_dataframe` and `that_dataframe`, using the columns specified. The columns must be present in both DataFrames for this to work. Should you wish to use some kind of outer join, use the `how` keyword argument.)

In [None]:
recombined_customers_df=customers_df.merge(orders_df, on=['customer_number']).merge(order_items_df, on=['order_number'])

recombined_customers_df

It looks OK, but are all the values the same? We want to compare with the original dataframe, `orders_detail_df`. For this, we can use the `.equals()` method on a dataframe. However, there's a small gotcha: the columns need to be in the same order in both dataframes. Therefore, we need to reorder the columns when we call `.equals()`. In the next cell, the expression

<code>recombined_customers_df[orders_detail_df.columns]</code>

returns the dataframe `recombined_customers_df`, but with the columns in the same order as the columns in `orders_detail_df`. This allows us to do the equality test:

In [None]:
orders_detail_df.equals(recombined_customers_df[orders_detail_df.columns])

The equality check should have returned True, showing that the dataframes `customers_df`, `orders_df` and `order_items_df` can be reconstructed into the dataframe `orders_detail_df`.

### Putting the 1NF relations into PostgreSQL

Now that we have three 1NF relations, let's put them in PostreSQL for the subsequent steps.

Note that we have to call the `order` table `disk_order`, as `ORDER` itself is an SQL reserved word.

For each of the dataframes, we will create a table in PostgreSQL, and add the primary key that we have identified, and appopriate foreign keys. Check Notebook *09.1: Defining Foreign Keys in SQL* if you need to remind yourself how foreign keys are defined in SQL.

First, create the `customer` table from the `customers_df` dataframe:

In [None]:
customers_df.to_sql('customer',
                    DB_CONNECTION,
                    if_exists='replace',
                    index=False)

and add the primary key:

In [None]:
%%sql

ALTER TABLE customer 
ADD CONSTRAINT customer_pk
    PRIMARY KEY (customer_number);

Next, create the `disk_order` table from the `orders_df` dataframe:

In [None]:
orders_df.to_sql('disk_order',
                 DB_CONNECTION,
                 if_exists='replace',
                 index=False)

and add the primary key and a foreign key referencing the `customer` table:

In [None]:
%%sql

ALTER TABLE disk_order
ADD CONSTRAINT disk_order_pk
    PRIMARY KEY (order_number);

ALTER TABLE disk_order
ADD CONSTRAINT disk_order_customer_fk 
    FOREIGN KEY (customer_number) REFERENCES customer;

Finally, create the `order_item` table from the `order_items_df` dataframe:

In [None]:
order_items_df.to_sql('order_item',
                      DB_CONNECTION,
                      if_exists='replace',
                      index=False)

and add the composite primary key and a foreign key referencing the `disk_order` table:

In [None]:
%%sql

ALTER TABLE order_item 
ADD CONSTRAINT order_item_pk
    PRIMARY KEY (order_number, item_number);

ALTER TABLE order_item
ADD CONSTRAINT order_item_disk_order_fk 
    FOREIGN KEY (order_number) REFERENCES disk_order;

### Check the PostgreSQL tables

Let's check that we can extract the data from the database tables. Start with a simple queries on each of the `customer`, `disk_order` and `order_item` tables:

In [None]:
%%sql 

SELECT *
FROM customer;

In [None]:
%%sql

SELECT * 
FROM disk_order;

In [None]:
%%sql

SELECT *
FROM order_item;

Now we make sure we can recreate the original dataset from the PostgreSQL tables.

(Convenience: get Jupyter to print the column names in a form we can cut-and-paste into the SQL query.)

In [None]:
', '.join(orders_detail_df)

Note how we're using the `<<` notation of SQL Magic to put the results of the query into a Python variable. We've also told SQL magic to convert SQL result sets into pandas DataFrames for us. (See the section *Executing SQL queries* in  Notebook *08.1 Data Definition Language in SQL.ipynb* to see the use of the `<<` notation.)

In [None]:
%%sql orders_detail_recreated <<

SELECT customer.customer_number, first_name, surname, street, postcode, phone_number, 
    disk_order.order_number, order_date, item_number, title, price, item_dispatched, 
    distributor_id, distributor
FROM customer, disk_order, order_item
WHERE customer.customer_number = disk_order.customer_number
    AND disk_order.order_number = order_item.order_number
ORDER BY customer_number, order_number, item_number;

In [None]:
orders_detail_recreated

And again, check that the recreated dataset is the same as the one we started with.

In [None]:
orders_detail_df.equals(orders_detail_recreated)

### Activity 1
Draw the ERD of these three relations: `customer`, `disk_order`, and `order_item`.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

<img src="images/antique-opticals-1nf.png" alt="Antique Opticals 1NF ERD" style="width: 50%;"/>

The participation conditions depend on the specifics of the domain. It seems reasonable that every order item is part of an order. It seems reasonable (though not necessarily so) that every order contains at least one item. It seems reasonable that every order relates to a customer, but that not every customer must have an order. However, in a real situation, you'd need to check these assumptions with what the actual business rules are.

#### End of Activity 1

-------------------------------------------

## <a name="2nf"></a>Moving to second normal form (2NF)
* [Top](#top)


To reiterate, to be in third normal form, 
> attributes must be dependent on the key, the whole key, and nothing but the key.

We have relations in 1NF. To move to second normal form (2NF), we have to ensure the second clause of that mantra: each attribute depends on all elements of a composite primary key (while all relations remain in 1NF). 

"Depends" in this context means _functionally dependent_: attribute *a* depends on attribute *b* if, when we know the value of *b*, there is precisely one value of *a*. Which functional dependencies hold for a given dataset depend on the real-world context and is not something which can be gleaned from the data alone.

For example, For _Antique Opticals_, `surname` is functionally dependent on `customer_number`: if we know the `customer_number`, we know the `surname`. The reverse is not true: if we know the `surname`, we don't necessarily know the `customer_number` (consider the `surname` "Blankley" in the _Antique Opticals_ example). 

Formally, **a relation in 2NF has all attributes functionally dependent on the whole of the primary key**.



The functional dependencies in this example are:

| This attribute | functionally defines this attribute |
| ------------- |:------------- |
| `customer_number` | `first_name` |
| `customer_number` | `surname` |
| `customer_number` | `street` |
| `customer_number` | `postcode` |
| `customer_number` | `phone_number` |
| `order_number` | `order_date` |
| `order_number` | `customer_number` |
| `item_number` | `title` |
| `item_number` | `price` |
| `item_number` | `distributor_id`|
| (`order_number`, `item_number`) | `item_dispatched` |
| `distributor_id` | `distributor` |

In moving to 2NF, we extract the attributes which are dependent on part of the primary key into new relations, such that in each relation, the attributes are dependent on all parts of the key.

We only need consider relations with composite primary keys; relations with simple primary keys are automatically in 2NF.

In the _Antique Opticals_ example, there is only one relation (table) with a composite key: `order_item`, with key `(order_number, item_number)`.

In [None]:
%%sql 

SELECT *
FROM order_item;

In this relation, most attributes describe the item, not the particular state of the ordered item. The functional dependencies tell us that the `title`, `price`, `distributor_id` and `distributor` depend on the `item_number` alone. `item_dispatched` depends on the combination of `order_number` and `item_number`. No attributes in this relation depend on the `order_number` alone.

That means we should split `order_item` into two relations: one with the existing composite primary key holding just the `item_dispatched` attribute, with all other attributes moving to an `item` relation.

First, let's check that there's only one value of `title`, `price`, `distributor_id` and `distributor` for each `item_number`. The `DISTINCT` keyword means we only count how many _different_ values there are. The `GROUP BY` means we calculate the counts for each `item_number`.

In [None]:
%%sql

SELECT item_number, COUNT(DISTINCT title) AS n_title,  COUNT(DISTINCT price) AS n_price,  
    COUNT(DISTINCT distributor_id) AS n_dist_id,  COUNT(DISTINCT distributor) AS n_dist
FROM order_item
GROUP BY item_number;

Looks good, but if there were many items, we shouldn't be reliant on someone scanning down the list. Let's ask the database for items where there's more than one description per code (using `HAVING` to filter the groups).

In [None]:
%%sql

SELECT item_number
FROM order_item
GROUP BY item_number
HAVING COUNT(DISTINCT title) > 1 OR COUNT(DISTINCT price) > 1 OR  
    COUNT(DISTINCT distributor_id) > 1 OR COUNT(DISTINCT distributor) > 1;

All seems fine, so now let's create the two new tables.

Note the `CREATE TABLE ... AS SELECT ...` notation. As we saw in notebook *08.2 Data Manipulation Language in SQL*, this creates a new table and immediately populates it. The column names come from how they appear in the result of the `SELECT`; the column types come from the source table; and the values inserted into the new table come from the results of the query. We use `DISTINCT` to prevent taking multiple rows where an item has been ordered more than once.

We also create the primary key on this table, and quickly check that it looks sensible.

In [None]:
%%sql 

DROP TABLE IF EXISTS distributor_item CASCADE;

CREATE TABLE distributor_item AS
    SELECT DISTINCT item_number, title, price, distributor_id, distributor
    FROM order_item;
    
ALTER TABLE distributor_item
ADD CONSTRAINT distributor_item_pk
    PRIMARY KEY (item_number);

SELECT * 
FROM distributor_item;

We can now create the `order_line` table in much the same way. We also create the primary key constraint and the two foreign key constraints, connecting `order_line` to `disk_order` and `distributor_item`.

In [None]:
%%sql

DROP TABLE IF EXISTS order_line;

CREATE TABLE order_line AS
    SELECT DISTINCT order_number, item_number, item_dispatched
    FROM order_item;
    
ALTER TABLE order_line 
ADD CONSTRAINT order_line_pk
    PRIMARY KEY (order_number, item_number);

ALTER TABLE order_line 
ADD CONSTRAINT order_line_distributor_item_fk 
    FOREIGN KEY (item_number) REFERENCES distributor_item;

ALTER TABLE order_line
ADD CONSTRAINT order_line_disk_order_fk 
    FOREIGN KEY (order_number) REFERENCES disk_order (order_number);
    
SELECT * 
FROM order_line;

Now check we can recreate the `order_item` table.

In [None]:
%%sql recreated_order_items << 

SELECT order_line.order_number, order_line.item_number, title, price, item_dispatched, 
    distributor_id, distributor 
FROM order_line, distributor_item
WHERE order_line.item_number = distributor_item.item_number
ORDER BY order_number, item_number;

In [None]:
recreated_order_items

That seems OK, but let's check formally.

In [None]:
%%sql order_items << 

SELECT order_number, item_number, title, price, item_dispatched, 
    distributor_id, distributor 
FROM order_item
ORDER BY order_number, item_number;

In [None]:
pd.DataFrame(recreated_order_items)

We can now compare the result sets using the `.equals()` method:

In [None]:
order_items.equals(recreated_order_items)

Success!

### Clean up
Finally, let's get rid of that `order_item` table.

In [None]:
%%sql

DROP TABLE order_item;

### Activity 2

Draw an ERD of the relations `customer`, `disk_order`, `order_line`, and `distributor_item`.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

<img src="images/antique-opticals-2nf.png" alt="Antique Opticals 2NF ERD" style="width: 50%;"/>

Again, the participation conditions depend on the domain. While every order line must be for some item, it's not clear if _Antique Opticals_ can hold items without an order. 

#### End of Activity 2

--------------------------------------------------------

## <a name="3nf"></a>Moving to Third Normal Form (3NF)
* [Top](#top)



To be in third normal form, 
> attributes must be dependent on the key, the whole key, and nothing but the key.

To move to third normal form (3NF), we have to ensure the first clause of that mantra: each attribute is directly functionally dependent on the key, and not functionally dependent on any other attribute. As before, we ensure this is true by splitting relations as necessary, while ensuring that all relations remain in 2NF (and hence also in 1NF). 

Formally, **a relation in 3NF has all attributes _directly_ functionally dependent on the whole of the primary key**.



To reiterate, the functional dependencies in a dataset are defined by the real-world context. Just to reiterate, the functional dependencies in this example are:

| This attribute | functionally defines this attribute |
| ------------- |:------------- |
| `customer_number` | `first_name` |
| `customer_number` | `surname` |
| `customer_number` | `street` |
| `customer_number` | `postcode` |
| `customer_number` | `phone_number` |
| `order_number` | `order_date` |
| `order_number` | `customer_number` |
| `item_number` | `title` |
| `item_number` | `price` |
| `item_number` | `distributor_id`|
| (`order_number`, `item_number`) | `item_dispatched` |
| `distributor_id` | `distributor` |

In the _Antique Opticals_ domain, all attributes are directly dependent on their respective primary keys apart from the `distributor`. This is directly dependent on the `distributor_id` attribute, not the `item_number`. 

Now we can create the new `item` and `distributor` tables, pulling data from the `distributor_item` table.

First, the `distributor`. Check that there is only one `distributor` for each `distributor_id`.

In [None]:
%%sql

SELECT distributor_id
FROM distributor_item
GROUP BY distributor_id
HAVING COUNT(DISTINCT distributor) > 1;

That's fine, so create the `distributor` table and add its primary key.

In [None]:
%%sql

DROP TABLE IF EXISTS distributor CASCADE;

CREATE TABLE distributor AS
    SELECT DISTINCT distributor_id, distributor
    FROM distributor_item;
    
ALTER TABLE distributor
ADD CONSTRAINT distributor_pk
 PRIMARY KEY (distributor_id);

SELECT *
FROM distributor;

Now the `item`. No need to check for uniqueness as we already have `item_number` as the primary key in `distributor_item`.

In [None]:
%%sql

DROP TABLE IF EXISTS item CASCADE;

CREATE TABLE item AS
    SELECT DISTINCT item_number, title, price, distributor_id
    FROM distributor_item;
    
ALTER TABLE item
ADD CONSTRAINT item_pk
    PRIMARY KEY (item_number);

ALTER TABLE item
ADD CONSTRAINT item_distributor_fk 
    FOREIGN KEY (distributor_id) REFERENCES distributor;

SELECT *
FROM item;

Finally, add the foreign key constraint from `order_line` to `item`.

In [None]:
%%sql

ALTER TABLE order_line
ADD CONSTRAINT order_line_item_fk 
    FOREIGN KEY (item_number) REFERENCES item;

Can we recreate the `distributor_item` table from the normalised tables?

In [None]:
%%sql recreated_distributor_items <<

SELECT item_number, title, price, item.distributor_id, distributor
FROM item, distributor
WHERE distributor.distributor_id = item.distributor_id
ORDER BY item_number;

In [None]:
recreated_distributor_items

This looks good, but again, let's check.

In [None]:
%%sql distributor_items <<

SELECT item_number, title, price, distributor_id, distributor
FROM distributor_item
ORDER BY item_number;

In [None]:
distributor_items.equals(recreated_distributor_items)

Success!

And finally, can we recreate the original dataset from the normalised tables?

As we're joining five tables, the SQL query is rather long, but not complicated.

In [None]:
%%sql order_details_recreated << 

SELECT customer.customer_number, first_name, surname, street, postcode, phone_number, 
    disk_order.order_number, order_date, item.item_number, title, price, item_dispatched, 
    distributor.distributor_id, distributor
FROM customer, disk_order, order_line, item, distributor
WHERE customer.customer_number = disk_order.customer_number
    AND disk_order.order_number = order_line.order_number
    AND item.item_number = order_line.item_number
    AND distributor.distributor_id = item.distributor_id
ORDER BY customer.customer_number, disk_order.order_number, order_line.item_number;

In [None]:
order_details_recreated

In [None]:
orders_detail_df.equals(order_details_recreated)

Success!

### Cleanup
Let's now get rid of the `distributor_item` table. Note that first we have to drop the foreign key constraint from `order_line` to `distributor_item`.

In [None]:
%%sql

ALTER TABLE order_line
DROP CONSTRAINT order_line_distributor_item_fk;

DROP TABLE distributor_item;

### Activity 3

Draw an ERD of the relations `customer`, `disk_order`, `order_line`, `distributor`, and `item`.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

<img src="images/antique-opticals-3nf.png" alt="Antique Opticals 3NF ERD" style="width: 50%;"/>

Again, the participation conditions depending on the domain. It could well be that we can have items without a known distributor, and distributors who are not the source of any items. On the other hand, these relationships could be mandatory.

#### End of Activity 3

---------------------------------------

This concludes the normalisation worked example. We've seen how to decompose a large table into a set of normalised relations, and shown that those relations can be recombined to recover the original table. 

# <a name="discussion"></a>Discussion
* [Top](#top)



There are choices to be made at all steps in normalisation. In particular, we have choices to make about where to start. 

In this example, we started with the `customer_number` key, thinking that the `orders` relation would have multiple values for each `customer_number`. That moved us to splitting the original dataset into three relations in 1NF, and hence the rest of the steps above.

The example in figure 3.2 of Harrington would lead us to start with data with `order_number` as the primary key. That wouldn't have customer details as repeating groups with orders.

Alternatively, we could have started with rows indexed by `(order_number, item_number)`? That relation would be in 1NF.


### Activity 4

If we'd started with either the `order_and_customer` (i.e. one row per order) or `ordered_item_and_order_and_customer` (i.e. one row per ordered item) relations, would we have ended with the same set of relations after moving to 3NF?

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Depending on where we started, the move to 1NF would be easier. In the most extreme example, starting with rows indexed by `(order_number, item_number)`, the move to 1NF would be trivial. If we started with `order_and_customer`, the move to 1NF would have generated the same `order_item` relation as we did in the main notebook. In both cases, the move to 2NF would again pull out the `distributor_item` relation. 

The move to 3NF would be much more involved, as there are many transitive functional dependencies in the original dataset. It would be at this step that the `customer`, `order`, and `distributor` relations would have been identified, based on the attributes which functionally depend on those relations' primary keys. For instance, the `order_and_customer` relation would have attributes like `surname` and `postcode` functionally dependent on the `customer_number`, not the `order_number`. That would have prompted us to split out the `customer`s from the `order`s. 

#### End of Activity 4

---------------------------------------

## Real-world considerations

While the process you've covered here outlines all the steps of normalisation, we've made some assumptions about the requirements of any OLTP (OnLine Transaction Processing: see Part 8, section 2) system using this data, assumptions that would not be valid in a live system.

### Activity 5

Consider the `price` attribute.

Why would treating the `price` as we have not be conscionable in a live system?

**Hint:** Consider an order for "Batman Returns" made now versus an order for "Batman Returns" made a year ago.

Write your answer in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

The price could well have changed over past year. If so, what would happen if a customer phoned up today, querying an order made last year? Perhaps more importantly, how much income should we declare to the tax service over the whole year?

We need to distinguish between the price we would charge for an order placed now, and the price we actually charged a customer on an order. How we implement that depends on many factors. We could just store the price charged in the order line, copied from the current price when the order is placed. We could maintain a list of prices, annotated with the dates they're effective, and look up the charged price by using a combination of item number and order date. As with many things in computing, there's a trade-off between processing time and storage space used.

A similar situation occurs with customer addresses, customer names, perhaps DVD titles, and many other things. The different ways of dealing with these issues is outside the scope of this module, but you should be aware of the issues.

#### End of Activity 5

------------------------------------------------------

# Summary

In this notebook, you've followed the process of normalisation from unnormalised data to data in 3NF. You did this by splitting relations into new relations to ensure the new relations are in the required normal form. 

The formal details of the process are in the Hannington book, but remember the mnemonic: to be in third normal form, 
> attributes must be dependent on the key, the whole key, and nothing but the key.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to notebook *10.3 Normalisation - the Hospital scenario*.