# SQL for Data Analysis
<hr>

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#SQL-Subqueries-and-Temporary-Tables" data-toc-modified-id="SQL-Subqueries-and-Temporary-Tables-1">SQL Subqueries and Temporary Tables</a></span></li><li><span><a href="#Introduction" data-toc-modified-id="Introduction-2">Introduction</a></span></li><li><span><a href="#Subqueries" data-toc-modified-id="Subqueries-3">Subqueries</a></span><ul class="toc-item"><li><span><a href="#Subquery-Formatting" data-toc-modified-id="Subquery-Formatting-3.1">Subquery Formatting</a></span></li><li><span><a href="#More-on-Subqueries" data-toc-modified-id="More-on-Subqueries-3.2">More on Subqueries</a></span></li><li><span><a href="#Practice-Questions" data-toc-modified-id="Practice-Questions-3.3">Practice Questions</a></span></li></ul></li><li><span><a href="#Common-Table-Expressions" data-toc-modified-id="Common-Table-Expressions-4">Common Table Expressions</a></span><ul class="toc-item"><li><span><a href="#Practice-Questions" data-toc-modified-id="Practice-Questions-4.1">Practice Questions</a></span></li></ul></li></ul></div>

## SQL Subqueries and Temporary Tables

## Introduction

Sometimes the question you are trying to answer doesn't have an answer when working directly with existing tables in a database.

Both <b>subqueries</b> and <b>table expressions</b> are methods for being able to write a query that creates a table, from the existing tables, and then write a query that interacts with this newly created table.

## Subqueries

> Allow you to answer more complex questions than you can with a single database table.

**Example:**

Find the average number of events for each day for each channel.

**Step 1:** Count up all the events in each channel in each day
```sql
SELECT DATE_TRUNC('day',occurred_at) AS day, 
       channel, 
       COUNT(*) AS event_count
FROM web_events
GROUP BY 1, 2
ORDER BY 1
```
**Step 2:** The query in Step 1 is used in the FROM clause in Step 2 to average across the event_count created in Step 1. Subqueries are required to have aliases which is added after the parentheses. An * may be used in the SELECT statement to pull all of the data from the original query.
```sql
SELECT channel,
       AVG(event_count) AS avg_event_count
FROM
(SELECT DATE_TRUNC('day',occurred_at) AS day, 
       channel, 
       COUNT(*) AS event_count
FROM web_events
GROUP BY 1, 2
ORDER BY 1
) sub
GROUP BY 1
ORDER BY 2 DESC
```
**Step 3:** Since we are now reordering based on this new aggregation, we no longer need the ORDER BY statement in the subquery and can remove it.
```sql
SELECT channel,
       AVG(event_count) AS avg_event_count
FROM
(SELECT DATE_TRUNC('day',occurred_at) AS day, 
       channel, 
       COUNT(*) AS event_count
FROM web_events
GROUP BY 1, 2
) sub
GROUP BY 1
ORDER BY 2 DESC
```
<i>Note: The query in Step 1 is called the inner query and the query in Step 2 is the outer query. The outer query will run across the result set created by the inner query</i>.

 ### Subquery Formatting

>The important thing to remember when using subqueries is to provide some way for the reader to easily determine which parts of the query will be executed together. 

**Well Formatted Query**

**Example 1**
```sql
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
                channel, COUNT(*) as events
      FROM web_events 
      GROUP BY 1,2
      ORDER BY 3 DESC) sub;
```
**Example 2**
```sql
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
                channel, COUNT(*) as events
      FROM web_events 
      GROUP BY 1,2
      ORDER BY 3 DESC) sub
GROUP BY day, channel, events
ORDER BY 2 DESC;
```

### More on Subqueries

Subqueries can be used in several places within a query. It can really be used anywhere you might use a table name or even a column name or an individual value. They're are especially useful in conditional logic, in conjunction with WHERE and JOIN clauses, or in the WHEN portion of a CASE statement. 

**Example:**

Return only orders that occurred in the same month as Parch & Posies first order ever.

<i>**Note:** Code is broken into steps for ease of understanding</i>.

**Step 1:** Get the date of the first order
```sql
SELECT MIN(occurred_at) AS min
  FROM orders
```
**Step 2:** Add a DATE_TRUNC function to get the month
```sql
SELECT DATE_TRUNC('month', MIN(occurred_at)) AS min_month
  FROM orders
```
**Step 3:** Write an outer query that uses this to filter the orders table and sorts by the occurred_at column.
```sql
SELECT *
  FROM orders
  WHERE DATE_TRUNC('month', occurred_at) =
(SELECT DATE_TRUNC('month', MIN(occurred_at)) AS min_month
  FROM orders)
  ORDER BY occurred_at
```
**Note:** This query works because the result of the subquery is only one cell. Most conditional logic will work with subqueries containing one-cell results. But IN is the only type of conditional logic that will work when the inner query contains multiple results.

>**Expert Tip**

>Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN case) rather than as a table.

>Also, notice the query here compared a single value. If we returned an entire column IN would need to be used to perform a logical argument. If we are returning an entire table, then we must use an ALIAS for the table, and perform additional logic on the entire table.

### Practice Questions

<b>1. Provide the name of the sales_rep in each region with the largest amount of total_amt_usd sales.</b>

>First, find the total_amt_usd totals associated with each sales rep, and the region in which they are located.

```sql
SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC;
```
>Next, pull the max for each region and use the result in the final step.
```sql
SELECT region_name, MAX(total_amt) total_amt
     FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
             FROM sales_reps s
             JOIN accounts a
             ON a.sales_rep_id = s.id
             JOIN orders o
             ON o.account_id = a.id
             JOIN region r
             ON r.id = s.region_id
             GROUP BY 1, 2) t1
     GROUP BY 1;
```
>JOIN the two tables above where the region and amount match.
```sql
SELECT t3.rep_name, t3.region_name, t3.total_amt
FROM(SELECT region_name, MAX(total_amt) total_amt
     FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
             FROM sales_reps s
             JOIN accounts a
             ON a.sales_rep_id = s.id
             JOIN orders o
             ON o.account_id = a.id
             JOIN region r
             ON r.id = s.region_id
             GROUP BY 1, 2) t1
     GROUP BY 1) t2
JOIN (SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
     FROM sales_reps s
     JOIN accounts a
     ON a.sales_rep_id = s.id
     JOIN orders o
     ON o.account_id = a.id
     JOIN region r
     ON r.id = s.region_id
     GROUP BY 1,2
     ORDER BY 3 DESC) t3
ON t3.region_name = t2.region_name AND t3.total_amt = t2.total_amt;
```

<b>2. For the region with the largest sales total_amt_usd, how many total orders were placed?</b>

>First pull the total_amt_usd for each region.
```sql
SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name;
```
>Then pull the region with the max amount from this table.
```sql
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
             FROM sales_reps s
             JOIN accounts a
             ON a.sales_rep_id = s.id
             JOIN orders o
             ON o.account_id = a.id
             JOIN region r
             ON r.id = s.region_id
             GROUP BY r.name) sub;
```
>Finally, pull the total orders for the region with this amount:
```sql
SELECT r.name, COUNT(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (
      SELECT MAX(total_amt)
      FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
              FROM sales_reps s
              JOIN accounts a
              ON a.sales_rep_id = s.id
              JOIN orders o
              ON o.account_id = a.id
              JOIN region r
              ON r.id = s.region_id
              GROUP BY r.name) sub);
```

<b>3. How many accounts had more total purchases than the account name which has bought the most standard_qty paper throughout their lifetime as a customer?</b>

>First, find the account that bought the most standard_qty paper as well as the total amount:
```sql
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
```
>Now use this to pull all the accounts with more total sales:
```sql
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total 
                   FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std, SUM(o.total) total
                         FROM accounts a
                         JOIN orders o
                         ON o.account_id = a.id
                         GROUP BY 1
                         ORDER BY 2 DESC
                         LIMIT 1) sub);
```
>This is now a list of all the accounts with more total orders. We can get the count with just another simple subquery:
```sql
SELECT COUNT(*)
FROM (SELECT a.name
       FROM orders o
       JOIN accounts a
       ON a.id = o.account_id
       GROUP BY 1
       HAVING SUM(o.total) > (SELECT total 
                   FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std, SUM(o.total) total
                         FROM accounts a
                         JOIN orders o
                         ON o.account_id = a.id
                         GROUP BY 1
                         ORDER BY 2 DESC
                         LIMIT 1) inner_tab)
             ) counter_tab;
```

<b>4. For the customer that spent the most (in total over their lifetime as a customer) total_amt_usd, how many web_events did they have for each channel?</b>

>First pull the customer with the most spent in lifetime value.
```sql
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1;
```
>Next, look at the number of events on each channel this company had, which we can match with just the id.
```sql
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id =  (SELECT id
                     FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
                           FROM orders o
                           JOIN accounts a
                           ON a.id = o.account_id
                           GROUP BY a.id, a.name
                           ORDER BY 3 DESC
                           LIMIT 1) inner_table)
GROUP BY 1, 2
ORDER BY 3 DESC;
```

<b>5. What is the lifetime average amount spent in terms of total_amt_usd for the top 10 total spending accounts?</b>

>First, find the top 10 accounts in terms of highest total_amt_usd.
```sql
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10;
```
>Next find average of these 10 amounts.
```sql
SELECT AVG(tot_spent)
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
      FROM orders o
      JOIN accounts a
      ON a.id = o.account_id
      GROUP BY a.id, a.name
      ORDER BY 3 DESC
       LIMIT 10) temp;
```

<b>6. What is the lifetime average amount spent in terms of total_amt_usd, including only the companies that spent more per order, on average, than the average of all orders.</b>

>First, pull the average of all accounts in terms of total_amt_usd.
```sql
SELECT AVG(o.total_amt_usd) avg_all
FROM orders o
```
>Then, pull the accounts with more than this average amount.
```sql
SELECT o.account_id, AVG(o.total_amt_usd)
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
                               FROM orders o);
```
>Finally, find the average of these values.
```sql
SELECT AVG(avg_amt)
FROM (SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
    FROM orders o
    GROUP BY 1
    HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
                                   FROM orders o)) temp_table;
```

## Common Table Expressions

>The WITH statement is often called a Common Table Expression or CTE. These expressions serve the exact same purpose as subqueries, and are more common in practice, as they tend to be cleaner for a future reader to follow the logic.

<b>Example:</b>

**Find the average number of events for each channel per day.**
>**Solution: using subquery**
```sql
SELECT channel, AVG(events) AS average_events
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
             channel, COUNT(*) as events
      FROM web_events 
      GROUP BY 1,2) sub
GROUP BY channel
ORDER BY 2 DESC;
```
>**Solution: using WITH statement**

+ First, pull the inner statement
```sql
SELECT DATE_TRUNC('day',occurred_at) AS day, 
       channel, COUNT(*) as events
FROM web_events 
GROUP BY 1,2
```
+ This is the part we put in the WITH statement. Notice, we are aliasing the table as events below:
```sql
WITH events AS (
          SELECT DATE_TRUNC('day',occurred_at) AS day, 
                        channel, COUNT(*) as events
          FROM web_events 
          GROUP BY 1,2)
```
+ We can then use this newly created events table as if it is any other table in our database:
```sql
WITH events AS (
          SELECT DATE_TRUNC('day',occurred_at) AS day, 
                        channel, COUNT(*) as events
          FROM web_events 
          GROUP BY 1,2)
```
```sql
SELECT channel, AVG(events) AS average_events
FROM events
GROUP BY channel
ORDER BY 2 DESC;
```

> **We can create additional tables to pull from in the following way:**

```sql
WITH table1 AS (
          SELECT *
          FROM web_events),

     table2 AS (
          SELECT *
          FROM accounts)
```
```sql
SELECT *
FROM table1
JOIN table2
ON table1.account_id = table2.id;
```
**Note:**

+ When creating multiple tables using **WITH**, you add a comma after every table except the last table leading to your final query.

+ The new table name is always aliased using **table_name AS**, which is followed by your query nested between parentheses.

### Practice Questions

<b>1. Provide the name of the sales_rep in each region with the largest amount of total_amt_usd sales.</b>
```sql
WITH t1 AS (
  SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
   FROM sales_reps s
   JOIN accounts a
   ON a.sales_rep_id = s.id
   JOIN orders o
   ON o.account_id = a.id
   JOIN region r
   ON r.id = s.region_id
   GROUP BY 1,2
   ORDER BY 3 DESC), 
t2 AS (
   SELECT region_name, MAX(total_amt) total_amt
   FROM t1
   GROUP BY 1)
SELECT t1.rep_name, t1.region_name, t1.total_amt
FROM t1
JOIN t2
ON t1.region_name = t2.region_name AND t1.total_amt = t2.total_amt;
```

<b>2. For the region with the largest sales total_amt_usd, how many total orders were placed? </b>
```sql
WITH t1 AS (
   SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
   FROM sales_reps s
   JOIN accounts a
   ON a.sales_rep_id = s.id
   JOIN orders o
   ON o.account_id = a.id
   JOIN region r
   ON r.id = s.region_id
   GROUP BY r.name), 
t2 AS (
   SELECT MAX(total_amt)
   FROM t1)
SELECT r.name, COUNT(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (SELECT * FROM t2);
```

<b>3. For the account that purchased the most (in total over their lifetime as a customer) standard_qty paper, how many accounts still had more in total purchases? </b>
```sql
WITH t1 AS (
  SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
  FROM accounts a
  JOIN orders o
  ON o.account_id = a.id
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 1), 
t2 AS (
  SELECT a.name
  FROM orders o
  JOIN accounts a
  ON a.id = o.account_id
  GROUP BY 1
  HAVING SUM(o.total) > (SELECT total FROM t1))
SELECT COUNT(*)
FROM t2;
```

<b>4. For the customer that spent the most (in total over their lifetime as a customer) total_amt_usd, how many web_events did they have for each channel?</b>
```sql
WITH t1 AS (
   SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
   FROM orders o
   JOIN accounts a
   ON a.id = o.account_id
   GROUP BY a.id, a.name
   ORDER BY 3 DESC
   LIMIT 1)
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id =  (SELECT id FROM t1)
GROUP BY 1, 2
ORDER BY 3 DESC;
```

<b>5. What is the lifetime average amount spent in terms of total_amt_usd for the top 10 total spending accounts?</b>
```sql
WITH t1 AS (
   SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
   FROM orders o
   JOIN accounts a
   ON a.id = o.account_id
   GROUP BY a.id, a.name
   ORDER BY 3 DESC
   LIMIT 10)
SELECT AVG(tot_spent)
FROM t1;
```

<b>6. What is the lifetime average amount spent in terms of total_amt_usd, including only the companies that spent more per order, on average, than the average of all orders.</b>
```sql
WITH t1 AS (
   SELECT AVG(o.total_amt_usd) avg_all
   FROM orders o
   JOIN accounts a
   ON a.id = o.account_id),
t2 AS (
   SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
   FROM orders o
   GROUP BY 1
   HAVING AVG(o.total_amt_usd) > (SELECT * FROM t1))
SELECT AVG(avg_amt)
FROM t2;
```