# Advanced SQL query syntax part 1 - WHERE


**Author:** 'Felipe Millacura'
    
**Date:** '14th March 2021'

### Learning Objectives


* Be able to use additional comparison operators, with `AND` and `OR` combinations
* Write queries using `BETWEEN`, `NOT` and `IN`
* Understand and be able to use inexact comparisons with `LIKE` and wildcards
* Understand `IS NULL`
* Be able to create column aliases using `AS`
* Use `DISTINCT` to return unique records by column
* Understand and be able to use aggregate functions
* Be able to sort records and `LIMIT` the number returned
* Understand `GROUP BY` and `HAVING` for group level aggregation and filtering



## Introduction

As stated earlier, a Data Analyst more often read records in databases than create, update or delete them (recall CRUD). So, it makes sense to focus effort on learning `SELECT` syntax beyond simple `WHERE` clauses. We’ll work through a series of examples, using the`dvd_rental` local `PostgreSQL` database we will set up in this lesson.

We’ll then come back to look at how we can manipulate the returned data from a query.


## The dataset

We will query the Sakila DVD Rental database. The Sakila Database holds information about a company that rents movie DVDs and it's a 15 tables demo dataset for [PostgreSQL](https://www.postgresqltutorial.com/postgresql-sample-database/) 

**Note:** One quirk you may notice as you explore this "fake" database is that the rental dates are all from 2005 and 2006, while the payment dates are all from 2007. Don't worry about this. 


To assist you in the queries ahead, the **Entity Relationship Diagram (ERD)** for the DVD Rental database is provided below. You can find further information about ERDs [here](https://www.smartdraw.com/entity-relationship-diagram/)

![](images/database_dvd.png)

## Setup 

First, we need to install PostgreSQL into our local machine (or using Udacity's workplace). The instructions differ from OS but you can find them here for [Windows](https://www.postgresqltutorial.com/install-postgresql/), [Mac](https://www.postgresqltutorial.com/install-postgresql-macos/) or [Linux](https://www.postgresqltutorial.com/install-postgresql-linux/) 

Once PostgreSQL has been properly installed you will need to follow some [additional steps](https://www.postgresqltutorial.com/connect-to-postgresql-database/) to *restore* your `dvd_rental` database 

Once installed, we need to establish a Python connection to the database. Remember that we need to use the *magic commands* to load the `ipython-sql` extension.

In [2]:
%load_ext sql


Optionally, you can create an engine for later use using `sqlalchemy`'s `create_engine` 

In [None]:
from sqlalchemy import create_engine

In [None]:
# default
engine = create_engine('postgresql://postgres:trialpostgres@localhost/dvdrental')


This allows you to store SQL queries results directly in a pandas DataFrame by using [`pd.read_sql()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html)

In [None]:
import pandas as pd

In [None]:
df_customer = pd.read_sql('SELECT some_column FROM some_table', engine)

df_customer.head(4)

# Connecting to a PostgreSQL database

To connect `ipython-sql` to your database use the following format:

```python
# Format
%sql dialect+driver://username:password@host:port/database
# Example
%sql postgresql://postgres:password123@localhost/dvdrental
            
```          




* `dialect+driver` in this case would just be `postgresql`, but feel free to use a different database software here.
* `username:password` is where you will substitute your username and password.
* `host` is usually just localhost.
* `port` does not need to be specified most of the time.
* `database` is the name of the database to connect to.

In [3]:
%sql  postgresql://postgres:trialpostgres@localhost/dvdrental

Now we are ready to work with our DataBase

In [4]:
%sql SELECT * FROM actor

 * postgresql://postgres:***@localhost/dvdrental
200 rows affected.


actor_id,first_name,last_name,last_update
1,Penelope,Guiness,2013-05-26 14:47:57.620000
2,Nick,Wahlberg,2013-05-26 14:47:57.620000
3,Ed,Chase,2013-05-26 14:47:57.620000
4,Jennifer,Davis,2013-05-26 14:47:57.620000
5,Johnny,Lollobrigida,2013-05-26 14:47:57.620000
6,Bette,Nicholson,2013-05-26 14:47:57.620000
7,Grace,Mostel,2013-05-26 14:47:57.620000
8,Matthew,Johansson,2013-05-26 14:47:57.620000
9,Joe,Swank,2013-05-26 14:47:57.620000
10,Christian,Gable,2013-05-26 14:47:57.620000


## Simple `WHERE` clauses

So far we've seen pretty simple `WHERE` clauses, e.g. find the actor with `id` equal to 3.


In [5]:
%%sql 

SELECT * 
FROM actor 
WHERE actor_id = 3

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


actor_id,first_name,last_name,last_update
3,Ed,Chase,2013-05-26 14:47:57.620000


# Additional comparison operators

OK, but what about this problem:<br>

> "Find all films with `rental_duration` 3 equivalent hours or less."

We can solve this using operators other than `=`


| operator | meaning |
| --- | --- |
| != | not equal to |
| > | greater than |
| < | less than |
| >= | greater than or equal to |
| <= | less than or equal to |

In [6]:
%%sql 

SELECT * 
FROM film 
WHERE rental_duration <= 3

 * postgresql://postgres:***@localhost/dvdrental
203 rows affected.


film_id,title,description,release_year,language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features,fulltext
2,Ace Goldfinger,A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China,2006,1,3,4.99,48,12.99,G,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes']",'ace':1 'administr':9 'ancient':19 'astound':4 'car':17 'china':20 'databas':8 'epistl':5 'explor':12 'find':15 'goldfing':2 'must':14
6,Agent Truman,A Intrepid Panorama of a Robot And a Boy who must Escape a Sumo Wrestler in Ancient China,2006,1,3,2.99,169,17.99,PG,2013-05-26 14:50:58.951000,['Deleted Scenes'],'agent':1 'ancient':19 'boy':11 'china':20 'escap':14 'intrepid':4 'must':13 'panorama':5 'robot':8 'sumo':16 'truman':2 'wrestler':17
9,Alabama Devil,A Thoughtful Panorama of a Database Administrator And a Mad Scientist who must Outgun a Mad Scientist in A Jet Boat,2006,1,3,2.99,114,21.99,PG-13,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes']","'administr':9 'alabama':1 'boat':23 'databas':8 'devil':2 'jet':22 'mad':12,18 'must':15 'outgun':16 'panorama':5 'scientist':13,19 'thought':4"
17,Alone Trip,A Fast-Paced Character Study of a Composer And a Dog who must Outgun a Boat in An Abandoned Fun House,2006,1,3,0.99,82,14.99,R,2013-05-26 14:50:58.951000,"['Trailers', 'Behind the Scenes']",'abandon':22 'alon':1 'boat':19 'charact':7 'compos':11 'dog':14 'fast':5 'fast-pac':4 'fun':23 'hous':24 'must':16 'outgun':17 'pace':6 'studi':8 'trip':2
21,American Circus,A Insightful Drama of a Girl And a Astronaut who must Face a Database Administrator in A Shark Tank,2006,1,3,4.99,129,17.99,R,2013-05-26 14:50:58.951000,"['Commentaries', 'Behind the Scenes']",'administr':17 'american':1 'astronaut':11 'circus':2 'databas':16 'drama':5 'face':14 'girl':8 'insight':4 'must':13 'shark':20 'tank':21
23,Anaconda Confessions,A Lacklusture Display of a Dentist And a Dentist who must Fight a Girl in Australia,2006,1,3,0.99,92,9.99,R,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes']","'anaconda':1 'australia':18 'confess':2 'dentist':8,11 'display':5 'fight':14 'girl':16 'lacklustur':4 'must':13"
25,Angels Life,A Thoughtful Display of a Woman And a Astronaut who must Battle a Robot in Berlin,2006,1,3,2.99,74,15.99,G,2013-05-26 14:50:58.951000,['Trailers'],'angel':1 'astronaut':11 'battl':14 'berlin':18 'display':5 'life':2 'must':13 'robot':16 'thought':4 'woman':8
26,Annie Identity,A Amazing Panorama of a Pastry Chef And a Boat who must Escape a Woman in An Abandoned Amusement Park,2006,1,3,0.99,86,15.99,G,2013-05-26 14:50:58.951000,"['Commentaries', 'Deleted Scenes']",'abandon':20 'amaz':4 'amus':21 'anni':1 'boat':12 'chef':9 'escap':15 'ident':2 'must':14 'panorama':5 'park':22 'pastri':8 'woman':17
37,Arizona Bang,A Brilliant Panorama of a Mad Scientist And a Mad Cow who must Meet a Pioneer in A Monastery,2006,1,3,2.99,121,28.99,PG,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes']","'arizona':1 'bang':2 'brilliant':4 'cow':13 'mad':8,12 'meet':16 'monasteri':21 'must':15 'panorama':5 'pioneer':18 'scientist':9"
46,Autumn Crow,A Beautiful Tale of a Dentist And a Mad Cow who must Battle a Moose in The Sahara Desert,2006,1,3,4.99,108,13.99,G,2013-05-26 14:50:58.951000,"['Trailers', 'Commentaries', 'Deleted Scenes', 'Behind the Scenes']",'autumn':1 'battl':15 'beauti':4 'cow':12 'crow':2 'dentist':8 'desert':21 'mad':11 'moos':17 'must':14 'sahara':20 'tale':5


<blockquote class='task'>

Task - 2 mins
<br>
    
Write and execute a query answering this problem:
<br>
  <br>  
<center>"Find all countries different than Brazil."</center>

<details>
<summary><b>Solution</b></summary>

 %%sql 
           SELECT * 
           FROM country 
           WHERE country != 'Brazil'

</details>
</blockquote>

In [8]:
%%sql

SELECT * 
FROM country
WHERE country != 'Brazil'

 * postgresql://postgres:***@localhost/dvdrental
108 rows affected.


country_id,country,last_update
1,Afghanistan,2006-02-15 09:44:00
2,Algeria,2006-02-15 09:44:00
3,American Samoa,2006-02-15 09:44:00
4,Angola,2006-02-15 09:44:00
5,Anguilla,2006-02-15 09:44:00
6,Argentina,2006-02-15 09:44:00
7,Armenia,2006-02-15 09:44:00
8,Australia,2006-02-15 09:44:00
9,Austria,2006-02-15 09:44:00
10,Azerbaijan,2006-02-15 09:44:00


## `AND` and `OR` 

If required, we can create more complex clauses using the `AND` and `OR` operators

> "Find all films with `rental_rate` higher or equal than 4 and `rental_duration` higher than 3 hours".

%%sql

SELECT film_id, title, rental_rate, rental_duration
FROM film 
WHERE rental_rate >= 4 AND rental_duration > 3

In [9]:
%%sql

SELECT film_id, title, rental_rate, rental_duration
FROM film 
WHERE rental_rate >= 4 AND rental_duration > 3

 * postgresql://postgres:***@localhost/dvdrental
274 rows affected.


film_id,title,rental_rate,rental_duration
133,Chamber Italian,4.99,7
384,Grosse Wonderful,4.99,5
8,Airport Pollock,4.99,6
98,Bright Encounters,4.99,4
7,Airplane Sierra,4.99,6
10,Aladdin Calendar,4.99,6
13,Ali Forever,4.99,4
20,Amelie Hellfighters,4.99,4
28,Anthem Luke,4.99,5
31,Apache Divine,4.99,5


Sometimes we have to be careful with the **order of evaluation** of conditions. Consider the following example


"Find all films with a `rental_rate` equal to 4.99 with either a `rental_duration` of 3 or lower or a `release_year` of 2006".



The logic of this as written is fairly clear. All the returned films should have a rating equal to 4.99 **and** they should either have been released on 2006 **or** or a have a rental duration of 3 or less!


In [10]:
%%sql

SELECT film_id, title, rental_rate, rental_duration, release_year
FROM film 
WHERE rental_rate = 4.99 AND rental_duration <= 3 OR release_year = 2006

 * postgresql://postgres:***@localhost/dvdrental
1000 rows affected.


film_id,title,rental_rate,rental_duration,release_year
133,Chamber Italian,4.99,7,2006
384,Grosse Wonderful,4.99,5,2006
8,Airport Pollock,4.99,6,2006
98,Bright Encounters,4.99,4,2006
1,Academy Dinosaur,0.99,6,2006
2,Ace Goldfinger,4.99,3,2006
3,Adaptation Holes,2.99,7,2006
4,Affair Prejudice,2.99,5,2006
5,African Egg,2.99,6,2006
6,Agent Truman,2.99,3,2006


Argh! We get films with rental rates other than 4.99! What's gone wrong? We need to worry about the order of evaluation. We want the `OR` operation to execute **before** the `AND` operation, and we enforce this by use of parentheses!


In [11]:
%%sql

SELECT film_id, title, rental_rate, rental_duration, release_year
FROM film 
WHERE rental_rate = 4.99 AND (rental_duration <= 3 OR release_year = 2006)

 * postgresql://postgres:***@localhost/dvdrental
336 rows affected.


film_id,title,rental_rate,rental_duration,release_year
133,Chamber Italian,4.99,7,2006
384,Grosse Wonderful,4.99,5,2006
8,Airport Pollock,4.99,6,2006
98,Bright Encounters,4.99,4,2006
2,Ace Goldfinger,4.99,3,2006
7,Airplane Sierra,4.99,6,2006
10,Aladdin Calendar,4.99,6,2006
13,Ali Forever,4.99,4,2006
20,Amelie Hellfighters,4.99,4,2006
21,American Circus,4.99,3,2006



That's better! Let's see a few more examples using `AND` and `OR`

> "Find all films with  rental duration between 3 and 5 hours inclusive".


In [13]:
%%sql

SELECT film_id, title, description
FROM film 
WHERE rental_duration >= 3 AND rental_duration <= 5

 * postgresql://postgres:***@localhost/dvdrental
597 rows affected.


film_id,title,description
384,Grosse Wonderful,A Epic Drama of a Cat And a Explorer who must Redeem a Moose in Australia
98,Bright Encounters,A Fateful Yarn of a Lumberjack And a Feminist who must Conquer a Student in A Jet Boat
2,Ace Goldfinger,A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
4,Affair Prejudice,A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank
6,Agent Truman,A Intrepid Panorama of a Robot And a Boy who must Escape a Sumo Wrestler in Ancient China
9,Alabama Devil,A Thoughtful Panorama of a Database Administrator And a Mad Scientist who must Outgun a Mad Scientist in A Jet Boat
213,Date Speed,A Touching Saga of a Composer And a Moose who must Discover a Dentist in A MySQL Convention
13,Ali Forever,A Action-Packed Drama of a Dentist And a Crocodile who must Battle a Feminist in The Canadian Rockies
15,Alien Center,A Brilliant Drama of a Cat And a Mad Scientist who must Battle a Feminist in A MySQL Convention
17,Alone Trip,A Fast-Paced Character Study of a Composer And a Dog who must Outgun a Boat in An Abandoned Fun House


> "Find all films that were released in years other than 2006".


In [14]:
%%sql

SELECT film_id, title, description, release_year
FROM film 
WHERE release_year < '2006' OR release_year > '2006'

 * postgresql://postgres:***@localhost/dvdrental
0 rows affected.


film_id,title,description,release_year


In [15]:
%%sql

SELECT * 
FROM film 
WHERE release_year != '2006' 

 * postgresql://postgres:***@localhost/dvdrental
0 rows affected.


film_id,title,description,release_year,language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features,fulltext


## `BETWEEN`, `NOT` and `IN`

The syntax in the last two examples is clumsy: it would be better to be able to define a range and then `SELECT` for records with fields in/not in that range. The `BETWEEN` keyword lets us do just that! 

Let's rewrite one of the previous queries using `BETWEEN`:

In [16]:
%%sql

SELECT film_id, title, description
FROM film 
WHERE rental_duration BETWEEN 3 AND  5 

 * postgresql://postgres:***@localhost/dvdrental
597 rows affected.


film_id,title,description
384,Grosse Wonderful,A Epic Drama of a Cat And a Explorer who must Redeem a Moose in Australia
98,Bright Encounters,A Fateful Yarn of a Lumberjack And a Feminist who must Conquer a Student in A Jet Boat
2,Ace Goldfinger,A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China
4,Affair Prejudice,A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank
6,Agent Truman,A Intrepid Panorama of a Robot And a Boy who must Escape a Sumo Wrestler in Ancient China
9,Alabama Devil,A Thoughtful Panorama of a Database Administrator And a Mad Scientist who must Outgun a Mad Scientist in A Jet Boat
213,Date Speed,A Touching Saga of a Composer And a Moose who must Discover a Dentist in A MySQL Convention
13,Ali Forever,A Action-Packed Drama of a Dentist And a Crocodile who must Battle a Feminist in The Canadian Rockies
15,Alien Center,A Brilliant Drama of a Cat And a Mad Scientist who must Battle a Feminist in A MySQL Convention
17,Alone Trip,A Fast-Paced Character Study of a Composer And a Dog who must Outgun a Boat in An Abandoned Fun House


We can also check for the opposite using `NOT`

In [17]:
%%sql

SELECT * 
FROM film 
WHERE rental_duration NOT BETWEEN 3 AND  5

 * postgresql://postgres:***@localhost/dvdrental
403 rows affected.


film_id,title,description,release_year,language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features,fulltext
133,Chamber Italian,A Fateful Reflection of a Moose And a Husband who must Overcome a Monkey in Nigeria,2006,1,7,4.99,117,14.99,NC-17,2013-05-26 14:50:58.951000,['Trailers'],'chamber':1 'fate':4 'husband':11 'italian':2 'monkey':16 'moos':8 'must':13 'nigeria':18 'overcom':14 'reflect':5
8,Airport Pollock,A Epic Tale of a Moose And a Girl who must Confront a Monkey in Ancient India,2006,1,6,4.99,54,15.99,R,2013-05-26 14:50:58.951000,['Trailers'],'airport':1 'ancient':18 'confront':14 'epic':4 'girl':11 'india':19 'monkey':16 'moos':8 'must':13 'pollock':2 'tale':5
1,Academy Dinosaur,A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies,2006,1,6,0.99,86,20.99,PG,2013-05-26 14:50:58.951000,"['Deleted Scenes', 'Behind the Scenes']",'academi':1 'battl':15 'canadian':20 'dinosaur':2 'drama':5 'epic':4 'feminist':8 'mad':11 'must':14 'rocki':21 'scientist':12 'teacher':17
3,Adaptation Holes,A Astounding Reflection of a Lumberjack And a Car who must Sink a Lumberjack in A Baloon Factory,2006,1,7,2.99,50,18.99,NC-17,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes']","'adapt':1 'astound':4 'baloon':19 'car':11 'factori':20 'hole':2 'lumberjack':8,16 'must':13 'reflect':5 'sink':14"
5,African Egg,A Fast-Paced Documentary of a Pastry Chef And a Dentist who must Pursue a Forensic Psychologist in The Gulf of Mexico,2006,1,6,2.99,130,22.99,G,2013-05-26 14:50:58.951000,['Deleted Scenes'],'african':1 'chef':11 'dentist':14 'documentari':7 'egg':2 'fast':5 'fast-pac':4 'forens':19 'gulf':23 'mexico':25 'must':16 'pace':6 'pastri':10 'psychologist':20 'pursu':17
7,Airplane Sierra,A Touching Saga of a Hunter And a Butler who must Discover a Butler in A Jet Boat,2006,1,6,4.99,62,28.99,PG-13,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes']","'airplan':1 'boat':20 'butler':11,16 'discov':14 'hunter':8 'jet':19 'must':13 'saga':5 'sierra':2 'touch':4"
10,Aladdin Calendar,A Action-Packed Tale of a Man And a Lumberjack who must Reach a Feminist in Ancient China,2006,1,6,4.99,63,24.99,NC-17,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes']",'action':5 'action-pack':4 'aladdin':1 'ancient':20 'calendar':2 'china':21 'feminist':18 'lumberjack':13 'man':10 'must':15 'pack':6 'reach':16 'tale':7
11,Alamo Videotape,A Boring Epistle of a Butler And a Cat who must Fight a Pastry Chef in A MySQL Convention,2006,1,6,0.99,126,16.99,G,2013-05-26 14:50:58.951000,"['Commentaries', 'Behind the Scenes']",'alamo':1 'bore':4 'butler':8 'cat':11 'chef':17 'convent':21 'epistl':5 'fight':14 'must':13 'mysql':20 'pastri':16 'videotap':2
12,Alaska Phantom,A Fanciful Saga of a Hunter And a Pastry Chef who must Vanquish a Boy in Australia,2006,1,6,0.99,136,22.99,PG,2013-05-26 14:50:58.951000,"['Commentaries', 'Deleted Scenes']",'alaska':1 'australia':19 'boy':17 'chef':12 'fanci':4 'hunter':8 'must':14 'pastri':11 'phantom':2 'saga':5 'vanquish':15
14,Alice Fantasia,A Emotional Drama of a A Shark And a Database Administrator who must Vanquish a Pioneer in Soviet Georgia,2006,1,6,0.99,94,23.99,NC-17,2013-05-26 14:50:58.951000,"['Trailers', 'Deleted Scenes', 'Behind the Scenes']",'administr':13 'alic':1 'databas':12 'drama':5 'emot':4 'fantasia':2 'georgia':21 'must':15 'pioneer':18 'shark':9 'soviet':20 'vanquish':16


Note these two points:

* the range defined by `BETWEEN` is **inclusive** of the end points. So, in the first example, records with `rental_duration` of exactly 3 or 5 will be selected
* we use the `NOT BETWEEN` combination in the second example to select all records with `rental_duration` not in the range (where, again, the end points are included in the range).


The `IN` operator helps us deal with discrete valued fields more efficiently. Consider:


> "Find whether the countries Spain, South Africa, Ireland or Germany are in our database"

The long-winded way to write this is

In [18]:
%%sql

SELECT *
FROM country
WHERE country = 'Spain' OR country = 'South Africa' OR country = 'Ireland' OR country = 'Germany'

 * postgresql://postgres:***@localhost/dvdrental
3 rows affected.


country_id,country,last_update
38,Germany,2006-02-15 09:44:00
85,South Africa,2006-02-15 09:44:00
87,Spain,2006-02-15 09:44:00


or using `IN`

In [19]:
%%sql

SELECT * 
FROM country 
WHERE country IN ('Spain', 'South Africa', 'Ireland', 'Germany')


 * postgresql://postgres:***@localhost/dvdrental
3 rows affected.


country_id,country,last_update
38,Germany,2006-02-15 09:44:00
85,South Africa,2006-02-15 09:44:00
87,Spain,2006-02-15 09:44:00


We can also use `NOT` with `IN`

> "Find all countries other than Finland, Argentina or Canada."


In [20]:
%%sql

SELECT * 
FROM country 
WHERE country NOT IN ('Finland', 'Argentina', 'Canada')

 * postgresql://postgres:***@localhost/dvdrental
106 rows affected.


country_id,country,last_update
1,Afghanistan,2006-02-15 09:44:00
2,Algeria,2006-02-15 09:44:00
3,American Samoa,2006-02-15 09:44:00
4,Angola,2006-02-15 09:44:00
5,Anguilla,2006-02-15 09:44:00
7,Armenia,2006-02-15 09:44:00
8,Australia,2006-02-15 09:44:00
9,Austria,2006-02-15 09:44:00
10,Azerbaijan,2006-02-15 09:44:00
11,Bahrain,2006-02-15 09:44:00


## `LIKE` and wildcards

Your manager comes to you and says

> "I was talking with a colleague about a movie last month, it was about a Crocodile in Space or something. I can't remember the  name exactly, I think it began 'Conne...' something. Can you find it?"

You can see we're dealing with an **inexact comparison** here. How do we do this? Using the `LIKE` operator with a **wildcard**.

In [21]:
%%sql

SELECT film_id, title, description
FROM film 
WHERE title LIKE 'Conne%'

 * postgresql://postgres:***@localhost/dvdrental
2 rows affected.


film_id,title,description
177,Connecticut Tramp,A Unbelieveable Drama of a Crocodile And a Mad Cow who must Reach a Dentist in A Shark Tank
178,Connection Microcosmos,A Fateful Documentary of a Crocodile And a Husband who must Face a Husband in The First Manned Space Station


Here are the wildcards we can use: 

| wildcard | meaning |
| --- | --- |
| _ | a single character |
| % | a collection of characters |


We can place wildcards **anywhere** inside the string in the condition:


> "Find all actors with last names containing the phrase 'ere' anywhere"

In [22]:
%%sql

SELECT * 
FROM actor
WHERE last_name LIKE '%ere%'

 * postgresql://postgres:***@localhost/dvdrental
3 rows affected.


actor_id,first_name,last_name,last_update
41,Jodie,Degeneres,2013-05-26 14:47:57.620000
107,Gina,Degeneres,2013-05-26 14:47:57.620000
166,Nick,Degeneres,2013-05-26 14:47:57.620000


> "Find all actors with a last name beginning with 'D'

In [26]:
%%sql

SELECT * 
FROM actor 
WHERE first_name ILIKE 'd%'

 * postgresql://postgres:***@localhost/dvdrental
7 rows affected.


actor_id,first_name,last_name,last_update
18,Dan,Torn,2013-05-26 14:47:57.620000
56,Dan,Harris,2013-05-26 14:47:57.620000
59,Dustin,Tautou,2013-05-26 14:47:57.620000
95,Daryl,Wahlberg,2013-05-26 14:47:57.620000
116,Dan,Streep,2013-05-26 14:47:57.620000
129,Daryl,Crawford,2013-05-26 14:47:57.620000
182,Debbie,Akroyd,2013-05-26 14:47:57.620000


<blockquote class='task'>
<b>Task - 2 mins</b> Write a query using `LIKE` and wildcards to answer:

    
    "Find all actors having 'a' as the second letter of their first names."

<details>
<summary><b>Hint</b></summary>
You can use a '_' wildcard for the first letter of `first_name`.
</details>
<br>
<details>
<summary><b>Solution</b></summary>
%%sql
SELECT * 
FROM actor 
WHERE first_name LIKE '_a%'")

</details>
</blockquote>


`LIKE` distinguishes between capital and lower case letters. If we need a case-insensitive version, we can use `ILIKE`.


## `IS NULL`

> "We need to ensure our staff records are up-to-date. Find all the staff who do not have a picture in the system."

In [27]:
%%sql 

SELECT *
FROM staff
WHERE picture IS NULL 

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


staff_id,first_name,last_name,address_id,email,store_id,active,username,password,last_update,picture
2,Jon,Stephens,4,Jon.Stephens@sakilastaff.com,2,True,Jon,8cb2237d0679ca88db6464eac60da96345513964,2006-05-16 16:13:11.793280,



We can also use the `NOT` operator here too: `IS NOT NULL` is a valid condition!


## Recap - Full query syntax

In summary, here are the different keyword components of a `SELECT` query, the order in which they must appear, and whether they are required or optional <br>

This table shows where we've got to:<br>

| Order | Keyword | Specifies | Required? |
| --- | --- |--- | --- |
| 1 | SELECT | Column to query | Yes |
| 2 | FROM | Table to query |  Yes |
| 3 | WHERE | Row-level filter | No |

<br>
while here's a look ahead at what's still to come:<br>

| Order | Keyword | Specifies | Required? |
| --- | --- |--- | --- |
| 5 | GROUP BY| Grouping for aggregates | No |
| 6 | HAVING | Group-level filter | No |
| 7 | ORDER BY | Sort order | No |
| 8 | LIMIT | How many records to return | No |




## Manipulating returned data

After we have applied conditions to filter data in a `SELECT` statement, we can also manipulate what is returned. The easiest way to do this is to limit the returned fields!

In [31]:
%%sql 

SELECT actor_id, first_name, last_name
FROM actor
WHERE last_update = '2013-05-26 14:47:57.620000'

 * postgresql://postgres:***@localhost/dvdrental
200 rows affected.


actor_id,first_name,last_name
1,Penelope,Guiness
2,Nick,Wahlberg
3,Ed,Chase
4,Jennifer,Davis
5,Johnny,Lollobrigida
6,Bette,Nicholson
7,Grace,Mostel
8,Matthew,Johansson
9,Joe,Swank
10,Christian,Gable


So here we return only the `actor_id`, `first_name` and `last_name`.

SQL offers additional operators to manipulate the return.

## Aliases via `AS`

> “Can we get a list of all actors with their first and last names combined together into one field called full_name?”
    
**Column aliases** are the way to solve problems like these! We use the `CONCAT()` function to **concatenate** (this is just a fancy way of saying ‘join strings together’) each pair of names into the full name. We set up a column alias using `AS full_name` to store the concatenated strings.

In [33]:
%%sql
SELECT actor_id, first_name, last_name, CONCAT(first_name, ' ', last_name) AS full_name 
FROM actor

 * postgresql://postgres:***@localhost/dvdrental
200 rows affected.


actor_id,first_name,last_name,full_name
1,Penelope,Guiness,Penelope Guiness
2,Nick,Wahlberg,Nick Wahlberg
3,Ed,Chase,Ed Chase
4,Jennifer,Davis,Jennifer Davis
5,Johnny,Lollobrigida,Johnny Lollobrigida
6,Bette,Nicholson,Bette Nicholson
7,Grace,Mostel,Grace Mostel
8,Matthew,Johansson,Matthew Johansson
9,Joe,Swank,Joe Swank
10,Christian,Gable,Christian Gable


The new `full_name` column will be at the right of the output. We see a problem with this: some of the records could have single names for `full_name`, but this represents a problem with the underlying data. We could add in a `WHERE` clause to filter out these problem rows.

In [34]:
%%sql

SELECT *, CONCAT(first_name, ' ', last_name) AS full_name 
FROM actor 
WHERE first_name IS NOT NULL AND last_name IS NOT NULL

 * postgresql://postgres:***@localhost/dvdrental
200 rows affected.


actor_id,first_name,last_name,last_update,full_name
1,Penelope,Guiness,2013-05-26 14:47:57.620000,Penelope Guiness
2,Nick,Wahlberg,2013-05-26 14:47:57.620000,Nick Wahlberg
3,Ed,Chase,2013-05-26 14:47:57.620000,Ed Chase
4,Jennifer,Davis,2013-05-26 14:47:57.620000,Jennifer Davis
5,Johnny,Lollobrigida,2013-05-26 14:47:57.620000,Johnny Lollobrigida
6,Bette,Nicholson,2013-05-26 14:47:57.620000,Bette Nicholson
7,Grace,Mostel,2013-05-26 14:47:57.620000,Grace Mostel
8,Matthew,Johansson,2013-05-26 14:47:57.620000,Matthew Johansson
9,Joe,Swank,2013-05-26 14:47:57.620000,Joe Swank
10,Christian,Gable,2013-05-26 14:47:57.620000,Christian Gable


It is good practise to use aliases when creating new columns or aggregate functions (which we will come onto soon) so that if someone else (including your future self!) uses your output that your column/result names have meaning.

## `DISTINCT()`

>“Our database may be out of date! After the recent restructuring, we should now have 16 film categories in the database. How many categories do film belong to at present in the database?”

How do we solve this problem? It’s not enough to simply return **all** the categories of the film, as there will be a large amount of duplication. Instead, we need the **unique** list of departments. The `DISTINCT()` function returns a unique list.

In [35]:
%%sql

SELECT DISTINCT(category_id), name
FROM category

 * postgresql://postgres:***@localhost/dvdrental
16 rows affected.


category_id,name
16,Travel
3,Children
11,Horror
6,Documentary
7,Drama
2,Animation
9,Foreign
8,Family
15,Sports
12,Music


## Aggregate functions

> “How many actor registries where updated at 14:47:57 on 2013-05-26?”

The `COUNT()` aggregate function can help us with counting problems:

In [38]:
%%sql
SELECT COUNT(*) 
FROM actor 
WHERE last_update = '2013-05-26 14:47:57.620000' 

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


count
200


In addition to `COUNT()`, we have the following aggregate functions: 

function 	| purpose
---|---
`SUM()` |	sum of a column
`AVG()` |	average of a column
`MIN()` |	minimum value of a column
`MAX()` |	maximum value of a column

As mentioned above it's good practice to use an alias (using the `AS` function) to give meaning to the result, such as:

In [39]:
%%sql
SELECT COUNT(*) AS updated_on_2013
FROM actor 
WHERE last_update = '2013-05-26 14:47:57.620000' 

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


updated_on_2013
200


**Task - 5 mins** Design queries using aggregate functions and what you have learned so far to answer the following question:

"What are the maximum and minimum rental rating among all films?"

<br>
<details>
    <summary><b>Solution</b></summary>

   %%sql  SELECT MAX(rental_rate) AS max_rating
           FROM film
    
  %%sql SELECT MIN(rental_rate) AS min_rating
           FROM film   
   

    
    
  ### Could also do it in a single query:
    
  %%sql SELECT MAX(rental_rate) AS max_rating,
           MIN(rental_rate) AS min_rating
           FROM film
    
</details>

## Sorting by columns

The `ORDER BY` operator lets us **sort** the returns of queries, either in descending (`DESC`) or ascending (`ASC`) order. The `ORDER BY` operator and associated keywords **always comes after** any `WHERE` clause.

The `LIMIT` operator is a natural partner to `ORDER BY`: it lets us limit **how many** records are returned by a query.

So, we saw before the minimum and maximum rating of films in the database. Let's find out which film have those ratings using the new operators!


In [44]:
%%sql 

SELECT rental_rate AS lowest_rate
FROM film 
WHERE rental_rate IS NOT NULL 
ORDER BY rental_rate ASC 
LIMIT 1

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


lowest_rate
0.99


In [47]:
%%sql 

SELECT title, rental_rate AS highest_rate
FROM film 
WHERE rental_rate IS NOT NULL 
ORDER BY rental_rate DESC 


 * postgresql://postgres:***@localhost/dvdrental
1000 rows affected.


title,highest_rate
French Holiday,4.99
Bucket Brotherhood,4.99
Frisco Forrest,4.99
Prejudice Oleander,4.99
Frontier Cabin,4.99
Poseidon Forever,4.99
Fugitive Maguire,4.99
Wyoming Storm,4.99
Pluto Oleander,4.99
Platoon Instinct,4.99


You see above we had problems in ordering with `NULL`s in the column we're sorting on. We can either filter them out with a `WHERE` clause, as we did above, or we can use the operators `NULLS FIRST` or `NULLS LAST` to specify where to put them in the list of records. These operators are always placed immediately after `DESC` or `ASC` for the respective column.

Let's rewrite the queries above using these operators:


In [49]:
%%sql 

SELECT title, rental_rate 
FROM film 
ORDER BY rental_rate ASC NULLS FIRST
LIMIT 1

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


title,rental_rate
Academy Dinosaur,0.99


In [None]:
%%sql 

SELECT * 
FROM film 
ORDER BY rental_rate DESC NULLS LAST
LIMIT 1

Finally, we can perform **multi-level sorts** on two or more columns:


> "Order films by rental rate, highest first, and then alphabetically by film title." 


In [50]:
%%sql 
SELECT rental_rate, title
FROM film 
ORDER BY rental_rate DESC NULLS LAST, title ASC NULLS LAST

 * postgresql://postgres:***@localhost/dvdrental
1000 rows affected.


rental_rate,title
4.99,Ace Goldfinger
4.99,Airplane Sierra
4.99,Airport Pollock
4.99,Aladdin Calendar
4.99,Ali Forever
4.99,Amelie Hellfighters
4.99,American Circus
4.99,Anthem Luke
4.99,Apache Divine
4.99,Apocalypse Flamingos


## MODE

How do we get 'the most common' value in a field? In statistics, this would correspond to the **mode**, but ANSI SQL (the specification language) does not offer a `MODE()` aggregate function (although PostgreSQL does as of version 9.4). So we need a way round that in cases where we are asked to find the most commonly occuring records. For example if we were asked:


> "Find the film rate most common across the whole database."

We will use a combination of the functions we have learnt so far to do this. Let's first write a `SELECT` to find "the most common rating (`rental_rate`) across the whole database". Let's think about the steps involved:

1. filter out any film with null `rental_rate`
2. group films by `rental_rate`
3. count the number of films in each group
4. return the `rental_rate` of the group with the highest count

In [55]:
%%sql 

SELECT rental_rate AS mode
FROM film
WHERE rental_rate IS NOT NULL
GROUP BY rental_rate
ORDER BY COUNT(rental_rate) DESC
LIMIT 1


 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


mode
0.99


This is the most commonly rated score of `rental_rate` in the entire database. We will learn at the next sessionabout subqueries and so how to use our result above to return the films that have this rating. 


## Grouping records

Consider the following problem

> "Find the number of films with each rental duration available on the database." 


We could solve this manually, but it would be a real pain. First, we would need to get a list of the film types (we saw how to do this earlier using `DISTINCT`), and then write a query using a `COUNT()` aggregate with a `WHERE` clause specifying each type in turn, something like


In [56]:
%%sql 

SELECT COUNT(film_id) AS number_films
FROM film 
WHERE rental_duration = 3

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


number_films
203


and so on many more times for each rental duration. Annoying, and not very general code: we need to know all rental durations before we start! Thankfully, SQL gives us the `GROUP BY` operator to automate this!


## `GROUP BY`

In [57]:
%%sql

SELECT rental_duration, COUNT(film_id) AS number_films
FROM film
GROUP BY rental_duration

 * postgresql://postgres:***@localhost/dvdrental
5 rows affected.


rental_duration,number_films
4,203
6,212
7,191
3,203
5,191


Yay, this looks more useful! Note what SQL has done here: it first **groups** records by a specified column (`rental_duration` in this case), and then applies the aggregate function (`COUNT()`) to **each group**.

Let's see another few examples of grouping in queries:


> How many ratings are there for each qualification?

In [58]:
%%sql

SELECT rental_rate, COUNT(film_id) AS number_ratings
FROM film
GROUP BY rental_rate

 * postgresql://postgres:***@localhost/dvdrental
3 rows affected.


rental_rate,number_ratings
2.99,323
4.99,336
0.99,341


"How many actor names start with an E?" 


In [59]:
%%sql

SELECT first_name, COUNT(actor_id) AS number_actors
FROM actor
WHERE first_name LIKE 'E%'
GROUP BY first_name

 * postgresql://postgres:***@localhost/dvdrental
5 rows affected.


first_name,number_actors
Ed,3
Elvis,1
Emily,1
Ellen,1
Ewan,1


You'll see here that we are counting the `id` column. You will have seen earlier that we used `COUNT(*)` which counts all rows returned. The difference between the two if there are any NULLS in the column. 

Let's take a look at the difference using a column which has NULLS in the staff `picture` column:


In [60]:
%%sql

SELECT COUNT(picture)
FROM staff

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


count
1


In [61]:
%%sql

SELECT COUNT(*)
FROM staff

 * postgresql://postgres:***@localhost/dvdrental
1 rows affected.


count
2


The difference in numbers comes because there are ? `NULL` entries for the `picture` column and so doing `COUNT(picture)` only returns the number of non-null `picture` entries. 


## `HAVING`

So far we've seen that the `WHERE` clause let's us filter **records**, but what if we wish to filter **groups** by some value of an aggregate function? This is where the `HAVING` operator comes in!

Imagine one of our earlier queries had been even more specific:


> Show the number of films where at least 130 people gave a rental rate of 2.99 or 4.99 and had a rental duration of 6 or higher

In [67]:
%%sql

SELECT rental_duration, COUNT(film_id) AS number_films
FROM film
WHERE rental_rate IN (2.99, 4.99) 
GROUP BY rental_duration
HAVING COUNT(film_id) >130


 * postgresql://postgres:***@localhost/dvdrental
4 rows affected.


rental_duration,number_films
4,131
6,136
7,132
5,135


We've added in a `HAVING` clause **after** the `GROUP BY`, and notice that it filters using an **aggregate function** of a column of the original data.


# Recap - full query syntax

Now we've completed our discussion of query syntax, here are **all** the different components of a `SELECT` query, the order in which they must appear, and whether they are required or optional <br>

| Order | Keyword | Specifies | Required? |
| --- | --- |--- | --- |
| 1 | SELECT | Column to query | Yes |
| 2 | AS | Column alias | No |
| 3 | FROM | Table to query |  Yes |
| 4 | WHERE | Row-level filter | No |
| 5 | GROUP BY| Grouping for aggregates | No |
| 6 | HAVING | Group-level filter | No |
| 7 | ORDER BY | Sort order | No |
| 8 | LIMIT | How many records to return | No |

## Useful mnemonic for order of SQL

| Keyword | Mnemonic |
| --- | --- |
| SELECT | So |
| FROM | Few |
| WHERE | Workers
| GROUP BY| Go
| HAVING | Home
| ORDER BY | On Time

