# Answering Business Questions using SQL

In this project, we're going to practice using SQL skills to answer business questions. We'll use the Chinook database provided as a SQLite database file called `chinook.db`. A copy of the database schema is below.

![nn](img/chinook-schema.svg)

## Creating Helper Functions
We'll create some helper functions in python to save some time and use a context manager to handle the connection to the SQLite database.
* `run_query()` function takes a SQL query as an argument and returns a pandas dataframe of that query.
* `run_command()` function takes a SQL command as an argument and executes it using the sqlite module. The `conn.isolation_level = None` tells SQLite to autocommit any changes.
* `show_tables()` function calls the `run_query()` function to return a list of all tables and views in the database.

In [1]:
import sqlite3, pandas

def run_query(query):
    with sqlite3.connect('dataset/chinook.db') as conn:
        return pandas.read_sql(query, conn)
    
def run_command(query):
    with sqlite3.connect('dataset/chinook.db') as conn:
        conn.isolation_level = None
        conn.execute(query)
        
def show_tables():
    query = 'SELECT name, type FROM sqlite_master WHERE type IN ("table", "view")'
    return run_query(query)

show_tables()

Unnamed: 0,name,type
0,album,table
1,artist,table
2,customer,table
3,employee,table
4,genre,table
5,invoice,table
6,invoice_line,table
7,media_type,table
8,playlist,table
9,playlist_track,table


## Selecting Albums to Purchase

The Chinook record store has just signed a deal with a new record label, and you've been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:

Artist Name|Genre
:-|:-
Regal|Hip-Hop
Red Tone|Punk
Meteor and the Girls|Pop
Slim Jim Bites|Blues

The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.

In [2]:
query = '''
        WITH invoice_usa AS
           ( SELECT il.invoice_line_id, 
                    il.track_id
               FROM invoice_line AS il
                    INNER JOIN invoice AS i
                    ON il.invoice_id = i.invoice_id
              WHERE billing_country = 'USA'
           )
        SELECT g.name AS genre_name,
               COUNT(iu.invoice_line_id) AS num_tracks_sold,
               CAST(COUNT(iu.invoice_line_id) AS FLOAT) / (SELECT COUNT(*) FROM invoice_usa) AS percentage_tracks_sold
          FROM genre AS g
               LEFT JOIN track AS t
               ON g.genre_id = t.genre_id
               LEFT JOIN invoice_usa AS iu
               on t.track_id = iu.track_id
         GROUP BY genre_name
         ORDER BY num_tracks_sold DESC
        '''
run_query(query)

Unnamed: 0,genre_name,num_tracks_sold,percentage_tracks_sold
0,Rock,561,0.533777
1,Alternative & Punk,130,0.123692
2,Metal,124,0.117983
3,R&B/Soul,53,0.050428
4,Blues,36,0.034253
5,Alternative,35,0.033302
6,Pop,22,0.020932
7,Latin,22,0.020932
8,Hip Hop/Rap,20,0.019029
9,Jazz,14,0.013321


Based on sales of tracks from their genres, we should select the following three albums from the list of four, since Hip-Hop is the most unpopular genre among four:
* 1st: Red Tone (Punk)
* 2nd: Slim Jim Bites (Blues)
* 3rd: Meteor and the Girls (Pop)

## Analyzing Employee Sales Performance

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. You have been asked to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

In [3]:
query = '''
        SELECT e.first_name || ' ' || e.last_name AS employee_name,
               e.title,
               e.reports_to,
               e.hire_date,
               SUM(i.total) AS total_sales
          FROM employee AS e
               LEFT JOIN customer AS c
               ON e.employee_id = c.support_rep_id
               LEFT JOIN invoice AS i
               ON c.customer_id = i.customer_id
         WHERE title = 'Sales Support Agent'
         GROUP BY employee_name
         ORDER BY total_sales DESC
        '''
run_query(query)

Unnamed: 0,employee_name,title,reports_to,hire_date,total_sales
0,Jane Peacock,Sales Support Agent,2,2017-04-01 00:00:00,1731.51
1,Margaret Park,Sales Support Agent,2,2017-05-03 00:00:00,1584.0
2,Steve Johnson,Sales Support Agent,2,2017-10-17 00:00:00,1393.92


The difference in sales roughly corresponds with the differences in their hiring dates.

## Analyzing Sales by Country
Your next task is to analyze the sales data for customers from each different country. In particular, you have been directed to calculate data, for each country, on the:

* total number of customers
* total value of sales
* average value of sales per customer
* average order value

Because there are a number of countries with only one customer, you should group these customers as "Other" in your analysis.


In [4]:
query = '''
        WITH country_or_others AS
           ( SELECT CASE
                        WHEN ( SELECT COUNT(*)
                                 FROM customer
                                WHERE country = c.country) = 1 THEN 'Others'
                        ELSE c.country
                    END AS country,
                    c.customer_id,
                    i.invoice_id,
                    i.total
               FROM customer AS c
                    LEFT JOIN invoice AS i
                    ON c.customer_id = i.customer_id
           )
        SELECT
               country,
               num_customers,
               total_sales,
               avg_sales,
               avg_order_value
          FROM
              (
              SELECT country,
                     COUNT(distinct customer_id) AS num_customers,
                     SUM(total) AS total_sales,
                     SUM(total) / COUNT(distinct customer_id) AS avg_sales,
                     SUM(total) / COUNT(distinct invoice_id) AS avg_order_value,
                     CASE
                         WHEN country <> 'Others' THEN 1
                         ELSE 0
                     END AS sort
                FROM country_or_others
               GROUP BY country
               ORDER BY total_sales DESC
              )
         ORDER BY sort DESC;
        '''
'''CASE
                   WHEN COUNT(c.customer_id) > 1 THEN c.country
                   ELSE 'OTHERS'
               END AS country_name,'''
run_query(query)

Unnamed: 0,country,num_customers,total_sales,avg_sales,avg_order_value
0,USA,13,1040.49,80.037692,7.942672
1,Canada,8,535.59,66.94875,7.047237
2,Brazil,5,427.68,85.536,7.011148
3,France,5,389.07,77.814,7.7814
4,Germany,4,334.62,83.655,8.161463
5,Czech Republic,2,273.24,136.62,9.108
6,United Kingdom,3,245.52,81.84,8.768571
7,Portugal,2,185.13,92.565,6.383793
8,India,2,183.15,91.575,8.721429
9,Others,15,1094.94,72.996,7.448571


Based on the data, there may be opportunity in the following countries:
* Czech Republic
* United Kingdom
* India

It's worth keeping in mind that because the amount of data from each of these countries is relatively low. Because of this, we should be cautious spending too much money on new marketing campaigns, as the sample size is not large enough to give us high confidence. A better approach would be to run small campaigns in these countries, collecting and analyzing the new customers to make sure that these trends hold with new customers.

## Albums vs Individual Tracks

The Chinook store is setup in a way that allows customer to make purchases in one of the two ways:

* purchase a whole album
* purchase a collection of one or more individual tracks.

The store does not let customers purchase a whole album, and then add individual tracks to that same purchase (unless they do that by choosing each track manually). When customers purchase albums they are charged the same price as if they had purchased each of those tracks separately.

Management are currently considering changing their purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album.

We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue.

It is very common when you are performing an analysis to have 'edge cases' which prevent you from getting a 100% accurate answer to your question. In this instance, we have two edge cases to consider:

* Albums that have only one or two tracks are likely to be purchased by customers as part of a collection of individual tracks.
Customers may decide to manually select every track from an album, and then add a few individual tracks from other albums to their purchase.
* In the first case, since our analysis is concerned with maximizing revenue we can safely ignore albums consisting of only a few tracks. The company has previously done analysis to confirm that the second case does not happen often, so we can ignore this case also.

In order to answer the question, we're going to have to identify whether each invoice has all the tracks from an album. We can do this by getting the list of tracks from an invoice and comparing it to the list of tracks from an album. We can find the album to compare the purchase to by looking up the album that one of the purchased tracks belongs to. It doesn't matter which track we pick, since if it's an album purchase, that album will be the same for all tracks.

In [5]:
query = '''
        WITH album_or_not AS
           ( SELECT i.invoice_id,
                    CASE
                        WHEN ( (
                                SELECT track_id 
                                  FROM track
                                 WHERE album_id = t.album_id
                                EXCEPT
                                SELECT track_id
                                  FROM invoice_line
                                 WHERE invoice_id = i.invoice_id
                               ) IS NULL
                               AND
                               (
                                SELECT track_id
                                  FROM invoice_line
                                 WHERE invoice_id = i.invoice_id
                                EXCEPT
                                SELECT track_id
                                  FROM track
                                 WHERE album_id = t.album_id
                               ) IS NULL 
                             ) THEN 'Album'
                        ELSE 'Individual'
                    END AS category
               FROM invoice AS i
                    INNER JOIN invoice_line AS il
                    ON i.invoice_id = il.invoice_id
                    INNER JOIN track AS t
                    ON il.track_id = t.track_id
              GROUP BY i.invoice_id
           )
        SELECT category,
               COUNT(category) AS num_invoices,
               CAST(COUNT(category) AS FLOAT) / ( SELECT COUNT(*) FROM album_or_not ) AS percent_invoices
          FROM album_or_not
         GROUP BY category
         ORDER BY num_invoices DESC;
        '''
run_query(query)

Unnamed: 0,category,num_invoices,percent_invoices
0,Individual,500,0.814332
1,Album,114,0.185668


Album purchases account for 18.6% of purchases. Based on this data, I would recommend against purchasing only selected tracks from albums from record companies, since there is potential to lose one fifth of revenue.