### Setup

In [90]:
%%capture
%load_ext sql
%sql sqlite:///chinook.db

## Overview of the data

Let's start by getting familiar with our data. Remember that we can query the database to get a list of all tables and views in our database:

In [91]:
%%sql
SELECT 
    name,
    type
FROM sqlite_master
WHERE type IN ('table', 'view')

 * sqlite:///chinook.db
Done.


name,type
album,table
artist,table
customer,table
employee,table
genre,table
invoice,table
invoice_line,table
media_type,table
playlist,table
playlist_track,table


### Selecting albums to purchase

The Chinook record store has just signed a deal with a new record label, and you've been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:


![1.png](attachment:1.png)

The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.

You'll need to write a query to find out which genres sell the most tracks in the USA, write up a summary of your findings, and make a recommendation for the three artists whose albums we should purchase for the store.


1- *Write a query that returns each genre, with the number of tracks sold in the USA:*

    - *in absolute numbers*
    - *in percentages.*
    
2. *Write a paragraph that interprets the data and makes a recommendation for the three artists whose albums we should purchase for the store, based on sales of tracks from their genres.*

In [92]:
%%sql

SELECT SUM(quantity)
FROM invoice_line

 * sqlite:///chinook.db
Done.


SUM(quantity)
4757


In [94]:
SELECT g.name genre,
       SUM(il.quantity),
       CAST( SUM(il.quantity) AS FLOAT ) / (SELECT SUM(quantity) FROM invoice_line WHERE bi)
    FROM invoice_line il INNER JOIN genres g ON il.track_id = g.track_id
        INNER JOIN invoice i ON il.invoice_id = i.invoice_id
    WHERE i.billing_country = 'USA'
    GROUP BY g.name

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 4)

In [None]:
%%sql

-- Table genre of each track_id
WITH genres AS (
        SELECT t.track_id, 
               g.name
        FROM track t INNER JOIN genre g ON t.genre_id = g.genre_id),

     sales_us AS(
         SELECT  SUM(il.quantity) total_us
             FROM invoice_line il INNER JOIN invoice i ON il.invoice_id = i.invoice_id
             WHERE i.billing_country = 'USA')

SELECT g.name genre,
       SUM(il.quantity) total_sales,
       ROUND( CAST( SUM(il.quantity) AS FLOAT ) / (SELECT total_us FROM sales_us) * 100 , 2) || ' %' sales_percentage
        
    FROM invoice_line il INNER JOIN genres g ON il.track_id = g.track_id
        INNER JOIN invoice i ON il.invoice_id = i.invoice_id
    WHERE i.billing_country = 'USA'
    GROUP BY g.name
    ORDER BY total_sales DESC

**Recommendation**

Based on the data above I recommend to buy the album from Red Tone, Meteor and the Girls and Slim Jim Bites. The hip-hop genre is the least bought from the four options, the difference is small but that can make a difference.

### Analyzing employee sales performance.

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. You have been asked to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

You might like to consider whether any extra columns from the employee table explain any variance you see, or whether the variance might instead be indicative of employee performance.

*1) Write a query that finds the totall dollar amount of sales assigned to each sales support agent within the company. Add any extra attributes for that employee that you find are relevant to the analysis.*

*2) Write a short statement describing your results, and providing a possible interpretation.*

In [None]:
%%sql

WITH ss_agents AS ( 
        SELECT employee_id,
               first_name || ' ' || last_name name,
               2021 - birthdate age,
               2021 - hire_date antiquity, 
               country
            FROM employee
            WHERE title = 'Sales Support Agent' ),
   
    customer_sales AS ( 
        SELECT c.support_rep_id,
               COUNT(c.customer_id) number_customers,
               SUM(CAST(i.total AS FLOAT)) total
        FROM customer c INNER JOIN invoice i ON c.customer_id = i.customer_id
        GROUP BY support_rep_id )

SELECT ss.name, 
       ss.age,
       ss.antiquity,
       ss.country,
       cs.number_customers,
       cs.total
    FROM ss_agents ss 
        INNER JOIN customer_sales cs ON ss.employee_id = cs.support_rep_id
    ORDER BY cs.total DESC

Jane Peacock had a better performance than his collegues, she had almost the same customers assigned than Margaret but has more tracks sold.

### Analyzing sales by country

Your next task is to analyze the sales data for customers from each different country. You have been given guidance to use the **country value from the customers table**, and **ignore the country from the billing address** in the invoice table.

In particular, you have been directed to calculate data, for each country, on the:

- total number of customers
- total value of sales
- average value of sales per customer
- average order value

Because there are a number of countries with only one customer, you should group these customers as "Other" in your analysis. You can use the following 'trick' to force the ordering of "Other" to last in your analysis.


**Others** row.

If there is a particular value that you would like to force to the top or bottom of results, you can put what would normally be your most outer query in a subquery with a case statement that adds a numeric column, and then in the outer query sort by that column. Here's an example - let's start by creating a view so we're working with a manageable number of rows:

In [None]:
%%sql

CREATE VIEW top_5_names AS
     SELECT
         first_name,
         count(customer_id) count
     FROM customer
     GROUP by 1
     ORDER by 2 DESC
     LIMIT 5;


SELECT * FROM top_5_names;

Next, inside a subquery, we'll select all values from our view and add a sorting column using a case statement, before sorting using that new column in the outer query.

In [None]:
%%sql

SELECT
    first_name,
    count
FROM
    (
    SELECT
        t5.*,
        CASE
            WHEN t5.first_name = "Mark" THEN 1
            ELSE 0
        END AS sort
    FROM top_5_names t5
   )
ORDER BY sort ASC

You should be able to adapt this technique into your query to force 'Other' to the bottom of your results. When working through this exercise, you will need multiple subqueries and joins. Imagine you work on a team of data analysts, and write your query so that it will be able to be easily read and understood by your colleagues.

### Exercise

Write a query that collates data on purchases from different countries.
- Where a country has only one customer, collect them into an "Other" group.
- The results should be sorted by the total sales from highest to lowest, with the "Other" group at the very bottom.
- For each country, include:
    - total number of customers
    - total value of sales
    - average value of sales per customer
    - average order value

In [None]:
%%sql

DROP VIEW customers;
DROP VIEW countries;

CREATE VIEW customers AS 
      SELECT c.customer_id,
           CASE WHEN (SELECT COUNT(*) 
                       FROM customer
                       WHERE country = c.country) = 1 THEN 'Other'
                ELSE c.country 
            END AS country_name,
           SUM(i.total) total,
           COUNT(i.invoice_id) n_orders
        FROM customer c INNER JOIN invoice i ON c.customer_id = i.customer_id
        GROUP BY c.customer_id;

CREATE VIEW countries AS 
       SELECT country_name,
              COUNT(customer_id) "number of customers",
              ROUND(SUM(total),2) "total value of sales",
              ROUND(CAST(SUM(total) AS FLOAT) / COUNT(customer_id),2) "average value of sales per customer",
              ROUND(CAST(SUM(total) AS FLOAT) / SUM(n_orders),2) "average order value"
        FROM customers
        GROUP BY country_name
        HAVING COUNT(customer_id) > 1
        ORDER BY 3 DESC;
    

SELECT * 
FROM ( SELECT cs.*,  
              CASE 
                 WHEN cs.country_name = 'Other' THEN 1
                 ELSE 0
              END AS sort 
        FROM countries cs )
ORDER BY sort ASC

## Albums vs individual tracks

The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:

- purchase a whole album
- purchase a collection of one or more individual tracks.

The store does not let customers purchase a whole album, and then add individual tracks to that same purchase (unless they do that by choosing each track manually). When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately.

Management are currently considering changing their purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.

We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue.

It is very common when you are performing an analysis to have 'edge cases' which prevent you from getting a 100% accurate answer to your question. In this instance, we have two edge cases to consider:

- Albums that have only one or two tracks are likely to be purchased by customers as part of a collection of individual tracks.
- Customers may decide to manually select every track from an album, and then add a few individual tracks from other albums to their purchase.

In the first case, since our analysis is concerned with maximizing revenue we can safely ignore albums consisting of only a few tracks. The company has previously done analysis to confirm that **the second case does not happen often, so we can ignore this case also**.


In order to answer the question, we're going to have to **identify whether each invoice has all the tracks from an album**. We can do this by getting the list of tracks from an invoice and comparing it to the list of tracks from an album. We can find the album to compare the purchase to by looking up the album that one of the purchased tracks belongs to. It doesn't matter which track we pick, since if it's an album purchase, that album will be the same for all tracks.

Up until now, we've only ever compared two single values, using operators like `=` `!=` and `LIKE`. To compare two tables of value, we can use the `EXCEPT` operator that we learned in the previous mission.

Let's say we had three tables in a database, as shown in the diagram below

![Screenshot_1.png](attachment:Screenshot_1.png)

We want to find a way to compare the letters columns from test_table_2, and test_table_3 to test_table_1 to see if they they are identical to test_table_1. Let's use EXCEPT with the two identical tables and see what we get with the first two tables:

![Screenshot_2.png](attachment:Screenshot_2.png)

Here's the table that is returned:

![Screenshot_3.png](attachment:Screenshot_3.png)

Now, let's compare what we get with test_table_1 and test_table_3:

![Screenshot_4.png](attachment:Screenshot_4.png)

If you run this directly in SQLite, you will get no result at all. This is useful to us - we can check whether the exception of two subqueries IS NULL. If we reverse the order of the tables around the EXCEPT operator we get the same thing.

Let's try reversing the order of the EXCEPT operator for the first two tables:

![Screenshot_5.png](attachment:Screenshot_5.png)

Here, we get a null value even though the two tables are not identical. That's because all of the values for letter in test_table_2 are also in test_table_1, even if test_table_1 has an extra value.

Because of this we need to combine both variations with AND clause:

![Screenshot_6.png](attachment:Screenshot_6.png)

![Screenshot_7.png](attachment:Screenshot_7.png)

Once we've made the comparison, we can wrap it in a CASE statement to add a column that tells us if that invoice was an album purchase or not.

You have everything you need to collate data on album vs single track purchases. This is easily the hardest query you have written so far, so take your time, and remember the query writing tips from the first screen!

#### Exercise

1. Write a query that categorizes each invoice as either an album purchase or not, and calculates the following summary statistics:
    - Number of invoices
    - Percentage of invoices
    
    
2. Write one to two sentences explaining your findings, and making a prospective recommendation on whether the Chinook store should continue to buy full albums from record companies

In [127]:
%%sql

-- Table with the invoice_id and the track_id of one song
WITH first_track AS ( 
   SELECT i.invoice_id,
          il.track_id,
          t.album_id
      FROM invoice i INNER JOIN invoice_line il ON i.invoice_id = il.invoice_id
                     INNER JOIN track t ON t.track_id = il.track_id
      GROUP BY i.invoice_id
    ), 


    excepting AS (
        SELECT ft.*, 
               CASE 
                 WHEN  ( 
                     SELECT track_id 
                     FROM invoice_line 
                     WHERE invoice_id = ft.invoice_id

                     EXCEPT

                     SELECT track_id
                     FROM track
                     WHERE album_id = ft.album_id
                    ) IS NULL

                AND

                       (
                       SELECT track_id
                        FROM track
                        WHERE album_id = ft.album_id

                        EXCEPT

                        SELECT track_id
                        FROM invoice_line
                        WHERE invoice_id = ft.invoice_id
                    ) IS NULL

                    THEN 'yes'
                    ELSE 'no'
                END AS album_purchase
          FROM first_track ft  )

-- Main query

SELECT album_purchase,
       COUNT(invoice_id) number_of_invoices,
       ROUND(CAST(COUNT(invoice_id) AS FLOAT) /  (SELECT COUNT(*) FROM excepting)  , 4) * 100 percentage_of_invoices
  FROM excepting
  GROUP BY album_purchase

 * sqlite:///chinook.db
Done.


album_purchase,number_of_invoices,percentage_of_invoices
no,500,81.43
yes,114,18.57
