# 4. Hard Sql questions

In this section we will use some hard sql questions to improve sql query skills

## Configure sql connection
Make sure your database server is up and running


In [1]:
%load_ext sql
%config SqlMagic.autocommit=False
%config SqlMagic.autolimit=20
%config SqlMagic.displaylimit=20
%sql postgresql://pliu:northwind@127.0.0.1:5432/northwind

## 4.1 Question 1 High-value customers

We want to send all of our high-value customers a special VIP gift. We're defining high-value customers as those
who've made at least 1 order with a total value (not including the discount) equal to $10,000 or more. We
only want to consider orders made in the year 1997


Your result rows should look like:

```text
 customer_id |        company_name        | order_id | total_order_amount 
-------------+----------------------------+----------+--------------------
 QUICK       | QUICK-Stop                 |    10691 |           10164.80
 QUICK       | QUICK-Stop                 |    10540 |           10191.70
 RATTC       | Rattlesnake Canyon Grocery |    10479 |           10495.60
 QUICK       | QUICK-Stop                 |    10515 |           10588.50
 SIMOB       | Simons bistro              |    10417 |           11283.20
 MEREP       | Mère Paillarde             |    10424 |           11493.20

```

### Hint

First, let's get the necessary fields for all orders made in the year 1997. Don't bother grouping yet, just work on
the Where clause. You'll need the **customer_id, company_name from customers; order_id from orders; and quantity and unit price from order_details**. Order by the total amount of the order, in descending order.



In [24]:
%%sql

select o.customer_id, c.company_name, o.order_id, 
round(cast(sum(od.unit_price*od.quantity) as numeric),2) as total_order_amount
from orders o
inner join order_details od
on o.order_id=od.order_id
inner join customers c
on o.customer_id=c.customer_id
where (extract(year from o.order_date)=1997)
group by o.customer_id, c.company_name, o.order_id
having sum(unit_price*quantity) > 10000
order by total_order_amount;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
6 rows affected.


customer_id,company_name,order_id,total_order_amount
QUICK,QUICK-Stop,10691,10164.8
QUICK,QUICK-Stop,10540,10191.7
RATTC,Rattlesnake Canyon Grocery,10479,10495.6
QUICK,QUICK-Stop,10515,10588.5
SIMOB,Simons bistro,10417,11283.2
MEREP,Mère Paillarde,10424,11493.2


## 4.2 Question 2 High-value customers - total orders

The manager has changed his mind. Instead of requiring that customers have at least one individual orders totaling
$10,000 or more, he wants to define high-value customers as those who have orders totaling $15,000 or more in 1997. 

How would you change the answer to the problem above? Sort the result by total_order_amount in descending order.


Your result rows should look like:

```text
 customer_id |         company_name         | total_order_amount 
-------------+------------------------------+--------------------
 QUICK       | QUICK-Stop                   |           64238.00
 SAVEA       | Save-a-lot Markets           |           60672.64
 ERNSH       | Ernst Handel                 |           53467.38
 MEREP       | Mère Paillarde               |           26087.10
 HUNGO       | Hungry Owl All-Night Grocers |           23959.05
 RATTC       | Rattlesnake Canyon Grocery   |           19658.70
 SIMOB       | Simons bistro                |           17482.15


```

### Hint

This query is almost identical to the one above, but there's just a few lines you need to delete or comment
out, to group at a different level.

In [25]:
%%sql

select o.customer_id, c.company_name, 
round(cast(sum(od.unit_price*od.quantity) as numeric),2) as total_order_amount
from orders o
inner join order_details od
on o.order_id=od.order_id
inner join customers c
on o.customer_id=c.customer_id
where (extract(year from o.order_date)=1997)
group by o.customer_id, c.company_name
having sum(unit_price*quantity) > 15000
order by total_order_amount desc;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
7 rows affected.


customer_id,company_name,total_order_amount
QUICK,QUICK-Stop,64238.0
SAVEA,Save-a-lot Markets,60672.64
ERNSH,Ernst Handel,53467.38
MEREP,Mère Paillarde,26087.1
HUNGO,Hungry Owl All-Night Grocers,23959.05
RATTC,Rattlesnake Canyon Grocery,19658.7
SIMOB,Simons bistro,17482.15


## 4.3 Question 3 High-value customers - with discount

Change the above query to use the discount when calculating high-value customers. Order by the total amount which includes the discount.


Your result rows should look like:

```text
 customer_id |       company_name         | total_order_amount_without_discount | total_order_amount_with_discount 
----------+------------------------------+-------------------------------------+----------------------------------
 QUICK       | QUICK-Stop                   |                        64238.00 |                    61109.91
 SAVEA       | Save-a-lot Markets           |                        60672.64 |                    57713.57
 ERNSH       | Ernst Handel                 |                        53467.38 |                    48096.26
 MEREP       | Mère Paillarde               |                        26087.10 |                    23332.31
 HUNGO       | Hungry Owl All-Night Grocers |                        23959.05 |                    20454.40
 RATTC       | Rattlesnake Canyon Grocery   |                        19658.70 |                    19383.75
 SIMOB       | Simons bistro                |                        17482.15 |                    16232.41


```

### Hint

To start out, just use the order_details table. You'll need to figure out how the **discount column** is structured.
Then include the discount in the total order amount calculation

In [29]:
%%sql


select o.customer_id, c.company_name,
round(cast(sum(od.unit_price*od.quantity) as numeric),2) as total_order_amount_without_discount,
round(cast(sum((od.unit_price*od.quantity)*(1-discount)) as numeric),2) as total_order_amount_with_discount
from orders o
inner join order_details od
on o.order_id=od.order_id
inner join customers c
on o.customer_id=c.customer_id
where (extract(year from o.order_date)=1997)
group by o.customer_id, c.company_name
having (sum((unit_price*quantity) * (1-discount))) > 15000
order by total_order_amount_with_discount desc;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
7 rows affected.


customer_id,company_name,total_order_amount_without_discount,total_order_amount_with_discount
QUICK,QUICK-Stop,64238.0,61109.91
SAVEA,Save-a-lot Markets,60672.64,57713.57
ERNSH,Ernst Handel,53467.38,48096.26
MEREP,Mère Paillarde,26087.1,23332.31
HUNGO,Hungry Owl All-Night Grocers,23959.05,20454.4
RATTC,Rattlesnake Canyon Grocery,19658.7,19383.75
SIMOB,Simons bistro,17482.15,16232.41


## 4.4 Question 4 Month-end orders

At the end of the month, sales people are likely to try much harder to get orders, to meet their month-end quotas. Show all orders made on the last day of the month. Order by employee_id and order_id


Your result rows should look like:

```text
  employee_id | order_id | order_date 
-------------+----------+------------
           1 |    10461 | 1997-02-28
           1 |    10616 | 1997-07-31
           2 |    10583 | 1997-06-30
           2 |    10686 | 1997-09-30
           2 |    10989 | 1998-03-31
           2 |    11060 | 1998-04-30
           3 |    10432 | 1997-01-31
           3 |    10806 | 1997-12-31

```

### Hint

In some database server such as **MS SQL, Mysql, etc.**, you can use predefined function EOMONTH(date) to get the last day of the month which correponds the input date. But in **Postgresql**, we don't have such function. But we can define our own function. Below is an example on how to define a function in Postgresql. 

```sql
-- last_day function take a date as input, then return a new date 
-- which is the last day of the month for the input date

CREATE OR REPLACE FUNCTION last_day(date)
RETURNS date AS
$$
  SELECT (date_trunc('MONTH', $1) + INTERVAL '1 MONTH - 1 day')::date;
$$ LANGUAGE 'sql' IMMUTABLE STRICT;

```

Use the above function in your filter to get the orders.

In [31]:
%%sql
-- last_day function take a date as input, then return a new date 
-- which is the last day of the month for the input date

CREATE OR REPLACE FUNCTION last_day(date)
RETURNS date AS
$$
  SELECT (date_trunc('MONTH', $1) + INTERVAL '1 MONTH - 1 day')::date;
$$ LANGUAGE 'sql' IMMUTABLE STRICT;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
Done.


[]

In [34]:
%%sql

select employee_id, order_id, order_date 
from orders
where order_date=last_day(order_date)
order by employee_id, order_id 

 * postgresql://pliu:***@127.0.0.1:5432/northwind
26 rows affected.


employee_id,order_id,order_date
1,10461,1997-02-28
1,10616,1997-07-31
2,10583,1997-06-30
2,10686,1997-09-30
2,10989,1998-03-31
2,11060,1998-04-30
3,10432,1997-01-31
3,10806,1997-12-31
3,10988,1998-03-31
3,11063,1998-04-30


## 4.5 Question 5 Orders with many line items

The Northwind mobile app developers are testing an app that customers will use to show orders. In order to make
sure that even the largest orders will show up correctly on the app, they'd like some samples of orders that have lots of individual line items. Show the 10 orders with the most line items, in order of total line items.


Your result rows should look like:

```text
  order_id | total_order_details 
----------+---------------------
    11077 |                  25
    10979 |                   6
    10657 |                   6
    10847 |                   6
    10360 |                   5
    10893 |                   5
    10553 |                   5
    10294 |                   5
    10514 |                   5
    11064 |                   5

```

### Hint

Use group by and aggregation function count.

In [38]:
%%sql

select order_id, count(order_id) as total_order_details
from order_details
group by order_id
order by total_order_details desc
limit 10;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
10 rows affected.


order_id,total_order_details
11077,25
10979,6
10657,6
10847,6
10360,5
10893,5
10553,5
10294,5
10514,5
11064,5


## 4.6 Question 6 Orders - random assortment

The Northwind mobile app developers would now like to just get a random assortment of orders for beta testing on
their app. Show a random set of 2% of all orders


Your result rows should look like:

```text
 customer_id |        company_name        | order_id | total_order_amount 
-------------+----------------------------+----------+--------------------
 QUICK       | QUICK-Stop                 |    10691 |           10164.80
 QUICK       | QUICK-Stop                 |    10540 |           10191.70
 RATTC       | Rattlesnake Canyon Grocery |    10479 |           10495.60
 QUICK       | QUICK-Stop                 |    10515 |           10588.50
 SIMOB       | Simons bistro              |    10417 |           11283.20
 MEREP       | Mère Paillarde             |    10424 |           11493.20

```

### Hint

In [None]:
%%sql

## 4.7 Question 7 Orders - accidental double-entry

Janet Leverling, one of the salespeople, has come to you with a request. She thinks that she accidentally double-
entered a line item on an order, with a **different product_id, but the same quantity**. She remembers that the
quantity was 60 or more. 

Show all the order_ids with line items that match the above condition, in order of order_id.


Your result rows should look like:

```text
 order_id 
----------
    10263
    10658
    10990
    11030


```

### Hint

You might start out with something like this: 
```sql
select order_id, product_id, quantity
from order_details where quantity>=60

```
However, this will only give us the orders where at least one order detail has a quantity of 60 or more. We need to
show all orders with more than one order detail with a quantity of 60 or more. Also, the same value for quantity
needs to be there more than once.

Try with group by and filtering the aggregation with having clause.

In [6]:
%%sql

select order_id
from order_details
where quantity>=60
group by order_id,quantity
having count(product_id)>1
order by order_id;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
4 rows affected.


order_id
10263
10658
10990
11030


## 4.8 Question 8 Orders - accidental double-entry details

Based on the previous question, we now want to show all columns of details of the order, for orders that match the above criteria.


Your result rows should look like:

```text
 order_id | product_id | unit_price | quantity | discount 
----------+------------+------------+----------+----------
    10263 |         24 |        3.6 |       28 |        0
    10263 |         74 |          8 |       36 |     0.25
    10263 |         30 |       20.7 |       60 |     0.25
    10263 |         16 |       13.9 |       60 |     0.25
    10658 |         60 |         34 |       55 |     0.05
    10658 |         21 |         10 |       60 |        0
    10658 |         40 |       18.4 |       70 |     0.05
    10658 |         77 |         13 |       70 |     0.05
    10990 |         34 |         14 |       60 |     0.15
    10990 |         21 |         10 |       65 |        0
    10990 |         55 |         24 |       65 |     0.15
    10990 |         61 |       28.5 |       66 |     0.15


```

You should have **16 rows in total**

### Hint

There are many ways of doing this, including CTE (common table expression) and derived tables. I suggest
using a CTE and a subquery. Here's a good article on CTEs (https://technet.microsoft.com/en-us/library/ms175972.aspx).

After building the cte, you can using a join or a filter. Below example use a inner join. Change it to use filter to get the same result

```sql
with target_order_ids as (select order_id
from order_details
where quantity>=60
group by order_id,quantity
having count(product_id)>1
order by order_id)

select od.order_id, product_id, unit_price, quantity, discount
from order_details od
inner join target_order_ids t
on od.order_id=t.order_id
order by od.order_id, quantity;
```

In [18]:
%%sql

with target_order_ids as (select order_id
from order_details
where quantity>=60
group by order_id,quantity
having count(product_id)>1
order by order_id)

select order_id, product_id, unit_price, quantity, discount
from order_details
where order_id in (select order_id from target_order_ids)
order by order_id, quantity;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
16 rows affected.


order_id,product_id,unit_price,quantity,discount
10263,24,3.6,28,0.0
10263,74,8.0,36,0.25
10263,30,20.7,60,0.25
10263,16,13.9,60,0.25
10658,60,34.0,55,0.05
10658,21,10.0,60,0.0
10658,40,18.4,70,0.05
10658,77,13.0,70,0.05
10990,34,14.0,60,0.15
10990,21,10.0,65,0.0


## 4.9 Question 9 Orders - accidental double-entry, get details with derived table

Here's another way of getting the same results as in the previous problem, using a **derived table instead of a CTE**. 


Your result rows should look like:

```text
order_id | product_id | unit_price | quantity | discount 
----------+------------+------------+----------+----------
    10263 |         24 |        3.6 |       28 |        0
    10263 |         74 |          8 |       36 |     0.25
    10263 |         30 |       20.7 |       60 |     0.25
    10263 |         16 |       13.9 |       60 |     0.25
    10658 |         60 |         34 |       55 |     0.05
    10658 |         21 |         10 |       60 |        0
    10658 |         40 |       18.4 |       70 |     0.05
    10658 |         77 |         13 |       70 |     0.05

```

### Hint

Join can use the name of a table, or it can be a select statement that returns a new table (derived table)

In [19]:
%%sql

select od.order_id, product_id, unit_price, quantity, discount 
From order_details od 
join (
     select order_id
     from order_details 
     Where quantity >= 60
     group by order_id, quantity 
     having count(product_id) > 1
     ) target_order_ids 
on target_order_ids.order_id = od.order_id
order by order_id, quantity;

 * postgresql://pliu:***@127.0.0.1:5432/northwind
16 rows affected.


order_id,product_id,unit_price,quantity,discount
10263,24,3.6,28,0.0
10263,74,8.0,36,0.25
10263,30,20.7,60,0.25
10263,16,13.9,60,0.25
10658,60,34.0,55,0.05
10658,21,10.0,60,0.0
10658,40,18.4,70,0.05
10658,77,13.0,70,0.05
10990,34,14.0,60,0.15
10990,21,10.0,65,0.0


## 4.10 Question 10 Late orders

Some customers are complaining about their orders arriving late. Which orders are late?


Your result rows should look like:

```text
 order_id | order_date | required_date | shipped_date 
----------+------------+---------------+--------------
    10264 | 1996-07-24 | 1996-08-21    | 1996-08-23
    10271 | 1996-08-01 | 1996-08-29    | 1996-08-30
    10280 | 1996-08-14 | 1996-09-11    | 1996-09-12

```
You shoul have 40 rows in total 

### Hint

We consider an order is late when the shipped_date >= required_date
Note, if the required_date or shipped_date are string types, you can not compare them with >= directly. You need to convert the string type to date type.


The information_schema.columns catalog contains the information on columns of all tables. To get information on columns of a table, you query the information_schema.columns catalog. For example:

```sql
SELECT 
   table_name, 
   column_name, 
   data_type 
FROM 
   information_schema.columns
WHERE 
   table_name = 'orders';
```



In [21]:
%%sql

select order_id, order_date, required_date, shipped_date
from orders
where shipped_date>=required_date

 * postgresql://pliu:***@127.0.0.1:5432/northwind
40 rows affected.


order_id,order_date,required_date,shipped_date
10264,1996-07-24,1996-08-21,1996-08-23
10271,1996-08-01,1996-08-29,1996-08-30
10280,1996-08-14,1996-09-11,1996-09-12
10302,1996-09-10,1996-10-08,1996-10-09
10309,1996-09-19,1996-10-17,1996-10-23
10320,1996-10-03,1996-10-17,1996-10-18
10380,1996-12-12,1997-01-09,1997-01-16
10423,1997-01-23,1997-02-06,1997-02-24
10427,1997-01-27,1997-02-24,1997-03-03
10433,1997-02-03,1997-03-03,1997-03-04


## 4.11 Question 11 Late orders - which employees?

Some sales people have more orders arriving late than others. Maybe they're not following up on the order
process, and need more training. Which sales people have the most orders arriving late?


Your result rows should look like:

```text
employee_id | last_name | total_late_orders 
-------------+-----------+-------------------
           4 | Peacock   |                10
           8 | Callahan  |                 5
           9 | Dodsworth |                 5
           3 | Leverling |                 5
           2 | Fuller    |                 4
           7 | King      |                 4

```
You should have 9 rows in total

### Hint

There are many solution, you can use common table expression. This makes the join smaller. 

```sql
with late_orders as (
    select employee_id, count(order_id) as total_late_orders 
    from 
        orders
    where 
        shipped_date>=required_date
    group by employee_id
                    )

select e.employee_id, e.last_name,total_late_orders
from employees e
join late_orders l
on e.employee_id=l.employee_id
order by total_late_orders desc;
```

Or you can join the two tables first then use group by to get the answer

In [26]:
%%sql

select e.employee_id, e.last_name,count(o.order_id) as total_late_orders 
from employees e
join orders o
on e.employee_id=o.employee_id
where shipped_date>=required_date
group by e.employee_id, e.last_name
order by total_late_orders desc;




 * postgresql://pliu:***@127.0.0.1:5432/northwind
9 rows affected.


employee_id,last_name,total_late_orders
4,Peacock,10
8,Callahan,5
9,Dodsworth,5
3,Leverling,5
7,King,4
2,Fuller,4
6,Suyama,3
1,Davolio,3
5,Buchanan,1


## 4.12 Question 12 Late orders vs. total orders

Andrew, the VP of sales, has been doing some more thinking some more about the problem of late orders. He
realizes that just looking at the number of orders arriving late for each salesperson isn't a good idea. It needs to be compared against the total number of orders per sales person. Order the result by employee_id.


Your result rows should look like:

```text
 employee_id | last_name | all_orders | late_orders 
-------------+-----------+------------+-------------
           1 | Davolio   |        123 |           3
           2 | Fuller    |         96 |           4
           3 | Leverling |        127 |           5
           4 | Peacock   |        156 |          10
           5 | Buchanan  |         42 |           1
           6 | Suyama    |         67 |           3
           7 | King      |         72 |           4
           8 | Callahan  |        104 |           5
           9 | Dodsworth |         43 |           5

```

### Hint

If you have multiple CTE, you can use **,** to separate them, but you can only have one **with** keyword in a sql query.

In [33]:
%%sql


with all_orders as (
    select employee_id, count(order_id) as all_orders 
    from 
        orders
    group by employee_id                   
),

late_orders as (
   select e.employee_id, e.last_name,count(o.order_id) as late_orders 
   from employees e
   join orders o
   on e.employee_id=o.employee_id
   where shipped_date>=required_date
   group by e.employee_id, e.last_name
   )

select l.employee_id, l.last_name, a.all_orders, l.late_orders
from  all_orders a
join late_orders l
on a.employee_id=l.employee_id
order by a.employee_id;



 * postgresql://pliu:***@127.0.0.1:5432/northwind
9 rows affected.


employee_id,last_name,all_orders,late_orders
1,Davolio,123,3
2,Fuller,96,4
3,Leverling,127,5
4,Peacock,156,10
5,Buchanan,42,1
6,Suyama,67,3
7,King,72,4
8,Callahan,104,5
9,Dodsworth,43,5


In [None]:
## 4.13 Question 13 


Your result rows should look like:

```text
 customer_id |        company_name        | order_id | total_order_amount 
-------------+----------------------------+----------+--------------------
 QUICK       | QUICK-Stop                 |    10691 |           10164.80
 QUICK       | QUICK-Stop                 |    10540 |           10191.70
 RATTC       | Rattlesnake Canyon Grocery |    10479 |           10495.60
 QUICK       | QUICK-Stop                 |    10515 |           10588.50
 SIMOB       | Simons bistro              |    10417 |           11283.20
 MEREP       | Mère Paillarde             |    10424 |           11493.20

```

### Hint