# SQL DML
In this Notebook you will use the SQL SELECT statement to answer queries about the data recorded in

  (a) the `patient` table
    
  (b) the Movies dataset.

Enable access to the PostgreSQL database engine via SQL cell magic.

In [1]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

'Connected: test@tm351test'

## (a) the `patient` table

As the `patient` table is updated by other Notebooks, recreate it.

In [2]:
%%sql
DROP TABLE IF EXISTS patient CASCADE;

CREATE TABLE patient (
  patient_id CHAR(4) NOT NULL
    CHECK (patient_id SIMILAR TO 'p[0-9][0-9][0-9]'),
  patient_name VARCHAR(20) NOT NULL,
  date_of_birth DATE NOT NULL,
  gender CHAR(1) NOT NULL
    CHECK (gender = 'F' OR gender = 'M'),
  height DECIMAL(4,1)
    CHECK (height > 0),
  weight DECIMAL(4,1)
    CHECK (weight > 0),
 PRIMARY KEY (patient_id)
 );

Done.
Done.


[]

Populate the `patient` table from a CSV file named `patients.csv` using [Psycopg](http://initd.org/psycopg/docs/index.html), 
a PostgreSQL database adapter for Python.

In [4]:
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg

In [5]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()
# open patient.csv
io = open('data/patient.csv', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'patient', sep=',', null='')
# close patient.csv
io.close()
# commit transaction
conn.commit()
# close cursor
c.close()
# close database connection
conn.close()

In [6]:
%%sql
SELECT * 
FROM patient
ORDER BY patient_id;

17 rows affected.


patient_id,patient_name,date_of_birth,gender,height,weight
p001,Thornton,1980-01-22,F,162.3,71.6
p007,Tennent,1980-04-01,M,176.8,70.9
p008,James,1980-07-08,M,167.9,70.5
p009,Kay,1980-09-25,F,164.7,53.2
p015,Harris,1980-12-04,M,180.6,64.3
p031,Rubinstein,1980-12-23,F,,
p037,Boswell,1981-06-11,F,,
p038,Ming,1981-09-23,M,186.3,85.4
p039,Maher,1981-10-09,F,161.9,73.0
p068,Monroe,1981-10-21,F,165.0,62.6


## Activity 1 - `patient` table
Execute SQL `SELECT` statements to answer the following queries about patients:
1. Give the details of female patients who were born before 1981.
2. For each birth year, give the number of patients who were born that year, the number whose weight has been 
recorded, and the minimum, maximum and average weights.
3. Give the number of female patients and male patients who are 'overweight' according to their 
[BMI (Body Mass Index)](https://en.wikipedia.org/wiki/Body_mass_index).

In [15]:
%%sql
SELECT * FROM patient
WHERE extract(YEAR FROM date_of_birth) < 1981

6 rows affected.


patient_id,patient_name,date_of_birth,gender,height,weight
p001,Thornton,1980-01-22,F,162.3,71.6
p007,Tennent,1980-04-01,M,176.8,70.9
p008,James,1980-07-08,M,167.9,70.5
p009,Kay,1980-09-25,F,164.7,53.2
p015,Harris,1980-12-04,M,180.6,64.3
p031,Rubinstein,1980-12-23,F,,


Solutions can be found in the `09.2.soln SQL DML` Notebook, 
but please DO attempt the activity yourself before looking at these solutions.

In [32]:
%%sql
SELECT extract(YEAR FROM date_of_birth) as BIRTH_YEAR, 
COUNT(patient_id) as TOTAL_BORN 
FROM patient
GROUP BY extract(YEAR FROM date_of_birth)


SyntaxError: invalid syntax (<ipython-input-32-4aa6d4d5dda7>, line 7)

In [35]:
%%sql
SELECT extract(YEAR FROM date_of_birth) as BIRTH_YEAR, 
COUNT(patient_id) as TOTAL_BORN_WITH_WEIGHT_RECORDED 
FROM patient
WHERE weight IS NOT NULL
GROUP BY extract(YEAR FROM date_of_birth)

3 rows affected.


birth_year,total_born_with_weight_recorded
1982.0,6
1981.0,4
1980.0,5


In [36]:
%%sql
SELECT extract(YEAR FROM date_of_birth) as BIRTH_YEAR, 
MIN(weight) as MIN_WEIGHT 
FROM patient
WHERE weight IS NOT NULL
GROUP BY extract(YEAR FROM date_of_birth)

3 rows affected.


birth_year,min_weight
1982.0,49.2
1981.0,62.6
1980.0,53.2


In [37]:
%%sql
SELECT extract(YEAR FROM date_of_birth) as BIRTH_YEAR, 
MAX(weight) as MIN_WEIGHT 
FROM patient
WHERE weight IS NOT NULL
GROUP BY extract(YEAR FROM date_of_birth)

3 rows affected.


birth_year,min_weight
1982.0,91.4
1981.0,85.4
1980.0,71.6


In [38]:
%%sql
SELECT extract(YEAR FROM date_of_birth) as BIRTH_YEAR, 
AVG(weight) as MIN_WEIGHT 
FROM patient
WHERE weight IS NOT NULL
GROUP BY extract(YEAR FROM date_of_birth)

3 rows affected.


birth_year,min_weight
1982.0,63.98333333333333
1981.0,74.425
1980.0,66.1


In [59]:
%%sql
SELECT gender, COUNT(*)
FROM patient
WHERE (weight / (height / 100)^2) > 24
GROUP BY gender

2 rows affected.


gender,count
F,2
M,3


## (b) the Movies dataset

This Notebook will be just using the `movie` table from the Movies dataset.

`movie (movie_id, title, year, rt_all_critics_rating, rt_top_critics_rating, rt_audience_rating, ml_user_rating)`

Each row records the following data about a particular movie identified by the `movie_id` primary key (PK) column.

column | description
------ | -----------
movie_id  (PK) | movie identifier
title | movie title
year | year of release
rt_all_critics_rating | RottenTomatoes - all critics: average rating
rt_top_critics_rating | RottenTomatoes - top critics: average rating
rt_audience_rating | RottenTomatoes - audience: average rating
ml_user_rating | MovieLens - users: average rating



In [60]:
%%sql
DROP TABLE IF EXISTS movie;

CREATE TABLE movie(
 movie_id INTEGER NOT NULL,
 title VARCHAR(250) NOT NULL,
 year INTEGER NOT NULL,
 rt_all_critics_rating REAL,
 rt_top_critics_rating REAL,
 rt_audience_rating REAL,
 ml_user_rating REAL,
 PRIMARY KEY (movie_id)
);

Done.
Done.


[]

Populate the `movies` table from the file named `movie.dat` using Psycopg.

In [61]:
# open a connection to the PostgreSQL database tm351test
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)
# create a cursor
c = conn.cursor()
# open movie.dat
io = open('data/movie.dat', 'r')
# execute the PostgreSQL copy command
c.copy_from(io, 'movie')
# close movie.dat
io.close()
# commit transaction
conn.commit()
# close cursor
c.close()
# close database connection
conn.close()

In [62]:
%%sql
SELECT * 
FROM movie
ORDER BY movie_id
LIMIT 10;

10 rows affected.


movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating
1,Toy Story,1995,9.0,8.5,3.7,3.9
2,Jumanji,1995,5.6,5.8,3.2,3.2
3,Grumpier Old Men,1995,5.9,7.0,3.2,3.2
4,Waiting to Exhale,1995,5.6,5.5,3.3,2.9
5,Father of the Bride Part II,1995,5.3,5.4,3.0,3.1
6,Heat,1995,7.7,7.2,3.9,3.8
7,Sabrina,1995,7.4,7.2,3.8,3.4
8,Tom and Huck,1995,4.2,0.0,2.7,3.1
9,Sudden Death,1995,5.2,5.6,2.6,3.0
10,GoldenEye,1995,6.8,6.2,3.4,3.4


## Activity 2 - Movies dataset I
Characterise the data in the `movie` table by executing SQL `SELECT` statements to answer the following questions: 

    1 How many movies are there?
    2 How many unique movie titles are there?
    3 What are the earliest and latest years of release?
    4 What are the ranges of values for critics, audience and user ratings?
    5 Missing data - How many movies are recorded without:
        5.1 a title?
        5.2 a year of release?
        5.3 critics, audience or user ratings?

Compare your answers with those from the same questions asked in the `08.1 Movies dataset` Notebook.  

In [79]:
%%sql
SELECT COUNT(*) as total_movies, 
COUNT(DISTINCT title) as unique_titles,
MIN(year) as earliest_year,
MAX(year) as latest_year,
MIN(rt_top_critics_rating) as rt_top_critics_rating_min,
MAX(rt_top_critics_rating) as rt_top_critics_rating_max,
MIN(rt_audience_rating) as rt_audience_rating_min,
MAX(rt_audience_rating) as rt_audience_rating_max,
MIN(ml_user_rating) as ml_user_rating_min,
MAX(ml_user_rating) as ml_user_rating_max,
FROM movie

1 rows affected.


total_movies,unique_titles,earliest_year,latest_year,rt_top_critics_rating_min,rt_top_critics_rating_max,rt_audience_rating_min,rt_audience_rating_max,ml_user_rating_min,ml_user_rating_max
10681,10410,1915,2008,0.0,10.0,0.0,5.0,0.5,5.0


In [82]:
%%sql
SELECT COUNT(movie_id) as movies_no_title
FROM movie
WHERE title IS NULL

1 rows affected.


movies_no_title
0


In [84]:
%%sql
SELECT COUNT(movie_id) as no_year
FROM movie
WHERE year IS NULL

1 rows affected.


no_year
0


In [85]:
%%sql
SELECT COUNT(movie_id) as no_review
FROM movie
WHERE rt_top_critics_rating IS NULL

1 rows affected.


no_review
714


Solutions can be found in the `09.2.soln SQL DML` Notebook, 
but please DO attempt the activity yourself before looking at these solutions.

## Activity 3 - Movies dataset II
Execute SQL `SELECT` statements to answer the following queries about movies: 

    1 How many movies have the word 'Dog' in their title?
    2 Movies are often remade and released with the same name. Which movies have been made more than 3 times?
    3 How many movies have been released each decade? Plot the results as a histogram.

In [88]:
%%sql
SELECT * FROM movie
WHERE title LIKE '%dog%'

3 rows affected.


movie_id,title,year,rt_all_critics_rating,rt_top_critics_rating,rt_audience_rating,ml_user_rating
8528,Dodgeball: A True Underdog Story,2004,6.3,6.3,3.4,3.3
54278,Underdog,2007,4.0,4.0,3.1,2.7
63082,Slumdog Millionaire,2008,8.2,8.2,4.0,4.2


In [94]:
%%sql
SELECT title, count(title) 
FROM movie
GROUP BY title
HAVING count(title) > 3

1 rows affected.


title,count
Hamlet,5


In [95]:
%%sql
SELECT year, count(*) as movies_made
FROM movie
GROUP BY year

94 rows affected.


year,movies_made
1970,71
2000,405
1975,74
1947,39
1962,69
1980,161
1981,178
1931,16
1972,83
1956,53


Solutions can be found in the `09.2.soln SQL DML` Notebook, 
but please DO attempt the activity yourself before looking at these solutions.

## Summary
In this Notebook you have used the SQL SELECT statement to answer queries about the data recorded in

  (a) the `patient` table
    
  (b) the Movies dataset.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `09.3 SQL views`.