# Multi-value columns (optional)
In this Notebook, you will compare how to access and update data stored in a single-value column 
(*normalised data*) with the same data stored in a multi-value column (*unnormalised data*). You will use the following three different representations of movie genres:

- `movie_genre` table (Movies dataset)

movie_id | genre
---------|------
1 | Adventure
1 | Animation
1 | Children
1 | Comedy
1 | Fantasy

- `movie_genres_array` table - movie genres stored as 
SQL [`ARRAY`](http://www.postgresql.org/docs/9.3/static/arrays.html) elements.

movie_id | genre
---------|------
1 | ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']

- `movie_genres_list` table - movie genres stored as a 'pipe' ('|') separated list.

movie_id | genre
---------|------
1 | 'Adventure&#124;Animation&#124;Children&#124;Comedy&#124;Fantasy'

Enable access to the PostgreSQL database engine via [SQL Cell Magic](https://pypi.python.org/pypi/ipython-sql).

In [None]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

In [None]:
%%sql
DROP TABLE IF EXISTS movie_genre CASCADE;

CREATE TABLE movie_genre (
 movie_id INTEGER NOT NULL,
 genre VARCHAR(20) NOT NULL,
 PRIMARY KEY (movie_id, genre)
);

DROP TABLE IF EXISTS movie_genres_array CASCADE;

CREATE TABLE movie_genres_array (
 movie_id INTEGER NOT NULL,
 genres VARCHAR(20)[] NOT NULL,
 PRIMARY KEY (movie_id)
);

DROP TABLE IF EXISTS movie_genres_list CASCADE;

CREATE TABLE movie_genres_list (
 movie_id INTEGER NOT NULL,
 genres VARCHAR(100) NOT NULL,
 PRIMARY KEY (movie_id)
);

Populate the tables from files using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [None]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [None]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()

# open movie_genre.dat
io = open('data/movie_genre.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'movie_genre')
# close movie_genre.dat'
io.close()
# commit transaction
conn.commit()

# open movie_genres_array.dat
io = open('data/movie_genres_array.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'movie_genres_array')
# close movie_genres_array.dat'
io.close()
# commit transaction
conn.commit()

# open movie_genres_list.dat
io = open('data/movie_genres_list.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'movie_genres_list')
# close movie_genres_list.dat'
io.close()
# commit transaction
conn.commit()

# close cursor
c.close()
# close database connection
conn.close()

- `movie_genre` table (Movies dataset)

In [None]:
%%sql
SELECT * 
FROM movie_genre
WHERE movie_id = 1;

- `movie_genres_array` table - movie genres stored as 
SQL [`ARRAY`](http://www.postgresql.org/docs/9.3/static/arrays.html) elements.

In [None]:
%%sql
SELECT * 
FROM movie_genres_array
WHERE movie_id = 1;

- `movie_genres_list` table - movie genres stored as a 'pipe' ('|') separated list.

In [None]:
%%sql
SELECT * 
FROM movie_genres_list
WHERE movie_id = 1;

## Searching the data

To illustrate how to search the data stored in the three different representations of movie genres, let's determine 
the number of 'horror' movies recorded in the Movies dataset.

- `movie_genre` table (Movies dataset)

In [None]:
%%sql
SELECT COUNT(*) AS number_of_horror_movies
FROM movie_genre
WHERE genre = 'Horror'

- `movie_genres_array` table - movie genres stored as 
SQL [`ARRAY`](http://www.postgresql.org/docs/9.3/static/arrays.html) elements.

In [None]:
%%sql
SELECT COUNT(*) AS number_of_horror_movies
FROM movie_genres_array
WHERE 'Horror' = ANY(genres);

Notes:

The result of [`ANY`](http://www.postgresql.org/docs/9.3/static/functions-comparisons.html) is *true* if any array 
element matches the supplied string.

- `movie_genres_list` table - movie genres stored as a 'pipe' ('|') separated list.

In [None]:
%%sql
SELECT COUNT(*) AS number_of_horror_movies
FROM movie_genres_list
WHERE genres LIKE '%Horror%';

Notes:
    
The [`LIKE`](http://www.postgresql.org/docs/9.3/static/functions-matching.html) expression returns *true* if the 
column value matches the supplied pattern where `%` matches any sequence of zero or more characters. 

## Extracting the data

To illustrate how to extract the data stored in the three different representations of movie genres, let's determine 
the number of movies classified under each genre.

- `movie_genre` table (Movies dataset)

In [None]:
%%sql
SELECT genre, COUNT(*) AS number_of_movies
FROM movie_genre
GROUP BY genre
ORDER BY genre;

- `movie_genres_array` table - movie genres stored as 
SQL [`ARRAY`](http://www.postgresql.org/docs/9.3/static/arrays.html) elements.

In [None]:
%%sql
SELECT genre, COUNT(*) AS number_of_movies
FROM (SELECT movie_id, UNNEST(genres) AS genre
      FROM movie_genres_array) AS unnested
GROUP BY genre
ORDER BY genre;

Notes:

The [`UNNEST`](http://www.postgresql.org/docs/9.3/static/functions-array.html#ARRAY-FUNCTIONS-TABLE) function 
expands an array into a set of rows as illustrated by the following `SELECT` statement.

In [None]:
%%sql
SELECT movie_id, UNNEST(genres) AS genre
FROM movie_genres_array
WHERE movie_id = 1;

Notes:
    
The resultant table of the execution of the 

`SELECT movie_id, UNNEST(genres) AS genre FROM movie_genres_array` 

statement is identical to the `movie_genre` table as illustrated by the following statements. You will learn about nested `SELECT` statements (SQL subqueries) in Part 11.

In [None]:
movie_genre = %sql \
 SELECT * \
 FROM movie_genre \
 ORDER BY movie_id, genre;

movie_genres_array_unnested = %sql \
 SELECT movie_id, UNNEST(genres) AS genre \
 FROM movie_genres_array \
 ORDER BY movie_id, genre;
    
movie_genre == movie_genres_array_unnested

- `movie_genres_list` table - movie genres stored as a 'pipe' ('|') separated list.

In [None]:
%%sql
SELECT genre, COUNT(*) AS number_of_movies
FROM (SELECT movie_id, UNNEST(STRING_TO_ARRAY(genres,'|')) AS genre
      FROM movie_genres_list) AS unnested
GROUP BY genre
ORDER BY genre;

Notes:

The [`STRING_TO_ARRAY`](http://www.postgresql.org/docs/9.3/static/functions-array.html#ARRAY-FUNCTIONS-TABLE) 
function unpacks the supplied string into array elements using supplied delimiter.

## Updating the data

To illustrate how to update the data stored in the three different representations of movie genres, let's suppose we 
have to replace the genre *Animation* with *Drama* for the movie identified by the `movie_id` of 1.

- `movie_genre` table (Movies dataset)

In [None]:
%%sql
UPDATE movie_genre
 SET genre = 'Drama'
 WHERE movie_id = 1 AND genre = 'Animation';
    
SELECT *
FROM movie_genre
WHERE movie_id = 1;

- `movie_genres_array` table - movie genres stored as 
SQL [`ARRAY`](http://www.postgresql.org/docs/9.3/static/arrays.html) elements.

In [None]:
%%sql
UPDATE movie_genres_array
 SET genres = ARRAY_REPLACE(genres,'Animation','Drama')
 WHERE movie_id = 1;

SELECT *
FROM movie_genres_array
WHERE movie_id = 1;

Notes:
    
The [`ARRAY_REPLACE`](http://www.postgresql.org/docs/9.3/static/functions-array.html#ARRAY-FUNCTIONS-TABLE) 
function replaces each array element equal to the value supplied with a new value. Other 
[`ARRAY`](http://www.postgresql.org/docs/9.3/static/functions-array.html#ARRAY-FUNCTIONS-TABLE) functions 
are provided to add and delete elements.

- `movie_genres_list` table - movie genres stored as a 'pipe' ('|') separated list.

In [None]:
%%sql
UPDATE movie_genres_list
 SET genres = REPLACE(genres,'Animation','Drama')
 WHERE movie_id = 1;

SELECT *
FROM movie_genres_list
WHERE movie_id = 1;

Notes:
    
The [`REPLACE`](http://www.postgresql.org/docs/9.3/static/functions-string.html) function replaces all occurrences 
of a substring in a string with another substring. 
Other [`STRING`](http://www.postgresql.org/docs/9.3/static/functions-string.html) functions are provided to add and 
delete substrings.

## Surrogate keys

In Exercise 10.4 we introduced *surrogate keys* as a means of identifying rows in a table when no natural identifier exists.

A *surrogate* key is typically automatically generated by the DBMS and we will illustrate how this can be 
accomplished with PostgreSQL using the example given in Exercise 10.4.

- Create and populate the unnormalised `book` table

In [None]:
%%sql
DROP TABLE IF EXISTS book CASCADE;
CREATE TABLE book (
 isbn CHAR(14) NOT NULL,
 title VARCHAR(100) NOT NULL,
 authors VARCHAR(250) NOT NULL,
 cost DECIMAL(5,2) NOT NULL,
 PRIMARY KEY (isbn)
);

In [None]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()

# open book+authors.dat
io = open('data/book+authors.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'book')
# close book+authors.dat'
io.close()
# commit transaction
conn.commit()

# close cursor
c.close()
# close database connection
conn.close()

In [None]:
%%sql
SELECT *
FROM book
ORDER BY isbn;

- Create and populate the `author` table.

In [None]:
%%sql
DROP TABLE IF EXISTS author CASCADE;
CREATE TABLE author (
 author_id SERIAL,
 author_name VARCHAR(25) NOT NULL,
 PRIMARY KEY (author_id)
);

INSERT INTO author(author_name) 
 SELECT UNNEST(STRING_TO_ARRAY(authors,', ')) AS author_name
 FROM book
 GROUP BY author_name;

SELECT *
FROM author
ORDER BY author_id;

Notes:

As the primary key of the `author` table, `author_id`, is defined as a 
[`SERIAL`](http://www.postgresql.org/docs/9.3/static/datatype-numeric.html#DATATYPE-SERIAL) type, it will 
automatically assigned integer values from a sequence generator each time a row is added to the table.

- Create and populate the `book_authors` table.

In [None]:
%%sql
DROP TABLE IF EXISTS book_authors CASCADE;
CREATE TABLE book_authors (
 isbn CHAR(14) NOT NULL,
 author_id INTEGER NOT NULL,
 PRIMARY KEY (isbn, author_id)
);

INSERT INTO book_authors(isbn, author_id)
 SELECT isbn, author_id
 FROM (SELECT isbn, UNNEST(STRING_TO_ARRAY(authors,', ')) AS author_name FROM book) AS authors NATURAL JOIN author;

SELECT *
FROM book_authors
ORDER BY (isbn, author_id);

Notes:

The `book_authors` table is populated by matching authors' names in the `book` and `author` tables after expanding 
the `authors` array in the `author` table into a set of rows.

- Recreate the original `book` table.

In [None]:
%%sql
SELECT isbn, title, author_name, cost
FROM (book NATURAL JOIN book_authors) NATURAL JOIN author
ORDER BY isbn, author;

## Summary
In this Notebook you have compared how to access and update data stored in a single-value column (normalised data) 
with the same data stored in a multi-value column (unnormalised data). You have also seen how *surrogate keys* are 
implemented in PostgreSQL.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `10.6 Referential integrity and referential actions`.