[Table of Contents](../../index.ipynb)

# FRC Analytics with Python - Session 18
# Structured Query Language (SQL) - Part III
**Last Updated: 25 November 2021**

In this third session on SQL we'll learn about creating tables and indexes and running upsert and aggregate queries.

## I. Notebook Setup

### A. If Using Google Colab
It's best if you clone the *pyclass_frc* Github repo and run this notebook from your local computer. But if you would like to run it from Google Colab, uncomment and run the lines in the next cell. (*Don't delete the exclamation points at the start of the lines!*) The cell will download and run a shell script that will create subfolders and download the required files for this session.

In [None]:
# !wget -nv https://raw.githubusercontent.com/irs1318dev/pyclass_frc/master/sessions/s19_SQL_III/chinook.sqlite3

### B. Imports and Database File
Run the next cell to set up the notebook to work with our SQLite databases.

In [None]:
import sqlite3

import pandas as pd

# Database files
robotics_db = 'robotics.sqlite3'
chinook_db = 'chinook.sqlite3'

### C. Reset Robotics Database
Several SQL commands in this notebook will generate an error if we try to execute them more than once. If you would like to re-run the SQL commands in this notebook, or you are getting a `UNIQUE constaint failed:...` error message, run the next cell to revert the database to its initial condition.

In [None]:
# Run this cell to revert the robotics.sqlite database back to its initial condition.
rcon = sqlite3.connect(robotics_db)
rcon.execute("DROP TABLE IF EXISTS Checkouts;")
rcon.execute("DROP TABLE IF EXISTS Members;")
rcon.execute("DROP TABLE IF EXISTS Tools;")
rcon.commit()
rcon.close()   

### D. SQL References
For your convenience, here are the SQL references that were discussed in session 16 and 17.
* [Official SQLite documentation](https://www.sqlite.org/lang.html)
* [Python sqlite3 Package Documentation](https://docs.python.org/3/library/sqlite3.html)
* [W3 Schools SQL Tutorial](https://www.w3schools.com/Sql/default.asp) 
* [Tutorialspoint SQLite Tutorial](https://www.tutorialspoint.com/sqlite/index.htm)

## II. Working with Tables

### A. Creating a Basic Table
Now that we've run SQL queries against numerous tables, it's time to explain how we create them. Suppose we wanted to create a database to keep track of the members of our robotics team. We want to use the database to track who is planning on attending competitions, who has turned in their permission slips, who has paid their dues, and other stuff like that. The database will need several tables. We'll start with a *Members* table.

In [None]:
# Create a table
query = """
    CREATE TABLE IF NOT EXISTS Members(
            member_id INTEGER PRIMARY KEY,
           given_name TEXT NOT NULL COLLATE NOCASE,
              surname TEXT NOT NULL COLLATE NOCASE,
                 role TEXT,
                email TEXT NOT NULL UNIQUE,
               mobile TEXT UNIQUE,
                phone TEXT,
          date_joined TEXT DEFAULT CURRENT_DATE);
"""
rcon = sqlite3.connect(robotics_db)
rcon.execute(query)
rcon.commit()
display(pd.read_sql_query("SELECT * FROM Members;", rcon))
rcon.close()

The preceding query created a new, empty table.

`CREATE TABLE` queries are not difficult to understand. They start with the clause `CREATE TABLE` and include a definition for each table column in parentheses. Let's look at the components of the *Members* query.
* The phrase `IF NOT EXISTS` is optional. If we left it out and ran the `CREATE TABLE` query twice, we would get an error the second time we ran the query because the table would already exist. If the `IF NOT EXISTS` phrase is present, the query does nothing if the table already exists. The phrase is included in this query because code cells are often run more than once in a Jupyter notebook.
* Each column definition includes the column's name and datatype.
* SQLite has few datatypes compared to other database systems. The most common data types are `TEXT`, `INTEGER`, `REAL`, and `NUMERIC`. The `NUMERIC` datatype is intended to store both integers and real numbers (i.e., floating point numbers). SQLite also has a `BLOB` datatype, which stands for large binary object. `BLOB` columns can store any sequence of bytes, such as a photo or an audio file.
* The phrase `NOT NULL` is a type of *column constraint*. It prevents users from adding records where the `NOT NULL` column is empty.
* `UNIQUE` is another type of column constraint. SQLite will not allow users to create records if the value in the `UNIQUE` column is the same as the value from another record in the table.
* The `DEFAULT CURRENT_DATE` phrase causes SQLite to enter the current date into the *date_joined* column if the user does not specify a different date. The `CURRENT_DATE` keyword could be replaced with any literal value, like 0, or 'bubble tea'.
* The `COLLATE NOCASE` clause tells SQLite to ignore case when comparing text values. This will affect how `WHERE` clauses operate.
* Finally, the `PRIMARY KEY` phrase instructs SQLite that the *member_id* column will be the primary key for the table. Remember, primary keys are used to identfy records in a table and SQLite will automatically insert unique integer values into this column.

Let's add some data to our new table.

In [None]:
# Add records to our table
members = [
    {'given_name': 'Aishwarya',
     'surname': 'Comar',
     'role': 'Mentor',
     'email': 'aishwarya.comar@frcteam.org'},
    {'given_name': 'Heng',
     'surname': 'Pan',
     'role': 'Student',
     'email': 'heng.pan@frcteam.org'},
    {'given_name': 'Pamela',
     'surname': 'Robinson',
     'role': 'Student',
     'email': 'pamela.robinson@frcteam.org'}
]

query = """
    INSERT INTO Members (given_name, surname, role, email)
                Values (:given_name, :surname, :role, :email);
"""
rcon = sqlite3.connect('robotics.sqlite3')
rcon.executemany(query, members)
rcon.commit()
display(pd.read_sql_query("SELECT * FROM Members;", rcon))
rcon.close()

FYI, we used the `sqlite3` package's `.executemany()` method to add multiple rows at one time. We passed a list of dictionary objects to the `.executemany()` method and Python matched the key values to named parameters in the SQL query. We also could have used question marks ('?') in the SQL query instead of named parameters like `:given_name`. If question marks are used, the list of values should contain tuples instead of dictionary objects, and the order of the tuples should match the column order in the SQL query.

### B. SQLite Datatypes
SQLite considers datatypes to be suggestions, not strict requirements. SQLite will store any type of data in any column regardless of the specified datatype. This is not true for other database systems which are strictly type. Other database systems might throw an error if a user tries to store a value like 'Spartabots' in an integer column. While SQLite does not throw errors if data doesn't match a column's datatype, SQLite does use the specified datatype to decide how the value will be stored on disk.

You might have noted that SQLite does not have a Boolean datatype. Use the INTEGER datatype for Boolean data, with 1 for *true* and 0 for *false*.

Additionally, SQLite has no datatype for dates or times. SQLite can store date-time values in TEXT, REAL, or INTEGER columns.
* If an INTEGER column is used, the date-time is stored as the number of seconds that have elapsed since January first, 1970 (i.e., Unix time).
* If a REAL column is used, the date-time is stored as the number of days since noon in Greenwich, UK on November 24th, 4714 BCE. 
* If a TEXT column is used, the date-time is stored as a string using [ISO8601 format](https://en.wikipedia.org/wiki/ISO_8601), `'YYYY-MM-DD HH:MM:SS.SSS'`. The value need not include the time components, but must be ordered from longer time periods to shorter time periods, i.e., from years to months to days to hours and so on.

The Issaquah Robotics Society prefers to use TEXT datatypes for date-time values, so that the date-time values don't require additional processing to be human-readable.

The [SQLite documentation provides additional information on datatypes.](https://www.sqlite.org/datatype3.html)

### C. Multi-Column Primary Keys
The primary key in our *Members* table contains a single-column primary key. Single-column keys are common, but there are situations where we might want to have a multi-colum primary key.

Here's a situation where one might want a multi-column primary key. Suppose that you decide to add a tool inventory program to your robotics team database. We'll need a *Tools* table. The next cell creates the *Tools* table and adds two records.

In [None]:
# Multi-column primary keys
rcon = sqlite3.connect('robotics.sqlite3')

query = """
    CREATE TABLE IF NOT EXISTS Tools(
              tool_id INTEGER PRIMARY KEY,
            tool_type TEXT NOT NULL,
          tool_number INTEGER UNIQUE NOT NULL,
          description TEXT NOT NULL COLLATE NOCASE,
        date_acquired TEXT);
"""

rcon = sqlite3.connect('robotics.sqlite3')
rcon.execute(query)

tools = [
    ('drill', 1, 'Ryobi 18V Lithium-Ion Cordless Drill', None),
    ('5-axis mill', 2, 'Grob G150 5-axis Universal Machining Center',
     '2024-08-01')
]

query = """
    INSERT INTO Tools (tool_type, tool_number, description, date_acquired)
                VALUES (?, ?, ?, ?);
"""
rcon.executemany(query, tools)
rcon.commit()
display(pd.read_sql_query("SELECT * FROM Tools;", rcon))
rcon.close()

Now suppose you want require team members to check out tools, and that you will use a database to record to whom a tool is checked out at any given time. We want to store this information in a *Checkouts* table, which links the *Members* and *Tools* tables.

In [None]:
# Create a Checkouts table
rcon = sqlite3.connect('robotics.sqlite3')

query = """
    CREATE TABLE IF NOT EXISTS Checkouts(
            tool_id INTEGER,
          member_id INTEGER,
      checkout_date TEXT NOT NULL DEFAULT CURRENT_DATE,
       checkin_date TEXT,
      
         CONSTRAINT CheckoutsKey
        PRIMARY KEY (tool_id, member_id)
    );
"""

rcon = sqlite3.connect('robotics.sqlite3')
rcon.execute(query)

checkouts = [
    (1, 1, '2021-10-01', '2021-10-01'),
    (2, 2, '2021-10-02', None),
    (1, 3, '2021-10-03', None)
]

query = """
    INSERT INTO Checkouts (tool_id, member_id, checkout_date, checkin_date)
                   VALUES (?, ?, ?, ?);    
"""
rcon.executemany(query, checkouts)
rcon.commit()
display(pd.read_sql_query("SELECT * FROM Checkouts;", rcon))
rcon.close()

The `CREATE TABLE` query for the *Checkouts* table uses a table-level constraint to add a multi-column primary key. The table's primary key consists of both the *tool_id* and *member_id* columns. This means that we can have duplicates in the *tool_id* or *member_id* columns, but we can't have two records with the same values in both the *tool_id* or *member_id* columns.

There are other types of table constraints in addition to primary keys. Table constraints can be used to create multi-key uniqueness constraints, check constraints that verify data values meet some specified criteria, and foreign key constraints that limit values to key values that are present in some other table. You can learn more about table-level constraints by reading through the official [SQLite documentation](https://www.sqlite.org/syntax/table-constraint.html) or at [sqlitetutorial.net](https://www.sqlitetutorial.net/).

The *Checkouts* table is not a realistic example with respect to multi-column primary keys. We'll see if you can figure out why in exercise #4.

### D. Indexes
The databases that we use for our robotics activities are small and we don't worry much about performance. But performance is a concern for databases used by large businesses, with millions of rows and terabytes of information -- especially if the databases are driving a web application and we want to minimize latency. Indexes are a database tool that can speed up queries for large databases.

Consider our *Members* table. We expect that `SELECT` queries with `WHERE surname = ...` clauses would occur frequently on the *Members* table. With a small database, SQLite can quickly search through all rows in the *Members* table to find the desired record. But if there were millions of records, such a search would take considerable time. SQLite has a `CREATE INDEX` statement that we can use to add an index to the *Members* table.

In [None]:
# Create an index
query = """
    CREATE INDEX IF NOT EXISTS surname_index
        ON Members (surname, given_name);
"""
rcon = sqlite3.connect('robotics.sqlite3')
rcon.execute(query)
rcon.commit()
display(pd.read_sql_query("SELECT * FROM sqlite_master;", rcon))
rcon.close()

We used the `CREATE INDEX` query to add an index on the *surname* and *given_name* columns of the *Members* table. Then demonstrated that the *surname_index* appears in the *sqlite_master* table. We can see in the *sqlite_master* table that indexes were automatically created for the primary key in the *Checkouts* table and the unique columns in the *Members* table.

To understand how indexes speed up searches, consider the index at the back of a book. When we create an index in a SQL database, we create a datastructure that cross references values to locations within the database, similar to how a book index cross-references terms to page numbers. Suppose I want to look up normal probability distributions in my statistics textbook. I could scan every page, starting at page 1, for mentions of the normal probability distritution. This is analogous to searching an un-indexed column with a `SELECT` query. Or I could go to the index at the back of the book. Since the index is sorted alphabetically, I can quickly locate the index entry for normal probability distributions, get the number of the page that contains the description of normal probability distributions, and turn directly to that page. This is analogous to searching an indexed column. 

There is a cost to using indexes. Every time we add a new record or update the indexed value of an existing record, we have to update the index, which takes a small amount of time for each insert or update. Since most records are retrieved more frequently than they are written, we coume out ahead if we add an index for frequently searched fields. But if we had a table that was updated often and read from rarely, we might want to avoid indexes on that table. [See SQLite's official documentation for more information on indexes.](https://www.sqlite.org/lang_createindex.html)

### E. Exercises 1 - 6

#### Ex. #1
Create an *Activities* table to store information about team activities. The table should have a primary key, activity type (e.g., meeting, competition, outreach event), date, location, and a field that indicates whether permission slips are required.

In [None]:
# Ex. #1



#### Ex. #2
Use an `INSERT` query to add three records to the *Activities* table from exercise #1. Make up the activities. At least one of the activities should require a permission slip.

In [None]:
# Ex. #2.



#### Ex. #3.
What datatype did you use for the field that indicates whether permission slips are required? Why?

In [None]:
# Ex. #3
#
#

#### Ex. #4
Explain why is the *Checkouts* table unrealistic?

**HINT:** What happens if someone tries to checkout a tool that they've checked out before?

In [None]:
# Ex. #4
#
#

#### Ex. #5
Creat an *Attendance* table to record what members attended the activities contained in the *Activities* table from exercise #1.
* The *Attendance* table should have foreign keys for both the *Members* and *Activities* tables.
* Use a multi-column primary key for the *Attendance* table.
* The *Attendance* table should indicate whether the member has turned in their permission slip.

In [None]:
# Ex. #5



#### Ex. #6
Is it realistic to use a multi-column primary key for the *Attendance* table? If so, why would it be realistic to use a multi-column key in the *Attendance* table, but not in the *Checkouts* table?

In [None]:
# Ex. #6
#
#

### F. Altering Tables
Use SQL's `ALTER TABLE` statement to rename a table, rename a column, add a column, or drop (delete) a column. For example, the following query will add a new column to the *Members* table.
```sql
ALTER TABLE Members
 ADD COLUMN graduation_year INTEGER;
```
The full syntax for `ALTER TABLE` statements is explained in the following syntax diagram from [SQLite's official documentation](https://www.sqlite.org/lang_altertable.html).

#### Alter Table Query
![ALTER TABLE Syntax](images/alter-table.png)

To read the diagram, begin at the start of the path and follow the arrows. Every round box with an uppercase term or punctuation symbol, such as ALTER or COLUMN or ".", represents a literal value that is typed directly into the query. Every round box in lowercase represents a value supplied by the user. Different types of queries are selected by following different paths. Optional items have a bypass path and items that can be repeated have a loop. 

Items in rectangular boxes, like "column-def" refer to a subordinate diagram. The subordinate diagram for column definitions is included below. Refer to SQLite's official documentation for other subordinate diagrams.

#### Column Def Clause
![Column Def](images/column-def.png)

The ability to read a syntax diagram comes in handy. Syntax diagrams can be used to precisely specify the syntax for a variety of languages. Mathematically speaking, a syntax diagram is a directed graph where the nodes are the terms and the paths show which terms are allowed. [Check out Wikipedia's article on syntax diagrams to learn more.](https://en.wikipedia.org/wiki/Syntax_diagram)

Other table schema changes can only be completed by creating a new table, transferring the old table's data to the new table, and deleting the old table. [See the SQLite documentation for additional guidance.](https://www.sqlite.org/lang_altertable.html) 

#### G. Dropping Tables
Use a `DROP Table` query to delete a table from a database. For example:

```sql
DROP TABLE IF EXISTS Members;
```

## III. UPSERT Queries
### A. INSERT Conflicts
What would happen if we tried to insert a team member into the *Members* table who already existed in the *Members* table?

In [None]:
query = """
    INSERT INTO Members (given_name, surname, role, email, mobile)
         VALUES (:given_name, :surname, :role, :email, :mobile);
"""

student = {
    'given_name': 'Heng',
     'surname': 'Pan',
     'role': 'Student',
     'email': 'heng.pan@frcteam.org',
     'mobile': '555-555-5555'}

rcon = sqlite3.connect('robotics.sqlite3')
rcon.execute(query, student)
rcon.commit()
rcon.close()

SQLite threw an `IntegrityError` because we tried to add a record with an email address that already existed in the *email* column and the *email* column has a `UNIQUE` constraint.

Suppose one of the team's members has typed personal inforamation into a form and we need to make sure that information gets into the *Members* table. We don't know if there is already a record in the *Members* table for the team member. What do we do?

One option is to run a SELECT query and check for a record with the same email address. If the record exists, run an UPDATE query, and run an INSERT QUERY if it doesn't. This is do-able, but inefficient because we are now running two queries for every update. 

### B. UPSERT to the Rescue
A better option is to add an UPSERT clause to our `INSERT` query.

In [None]:
# UPSERT Query

# THIS CELL WILL NOT RUN ON GOOGLE COLAB
# Upsert was not added to SQLite until version 3.24 and Google Colab uses
# an older version of SQLite

query = """
    INSERT INTO Members (given_name, surname, role, email, mobile)
         VALUES (:given_name, :surname, :role, :email, :mobile)
    ON CONFLICT (email) DO UPDATE
            SET given_name=:given_name,
                role=:role, email=:email, mobile=:mobile;
"""

student = {
    'given_name': 'Heng',
     'surname': 'Pan',
     'role': 'Student',
     'email': 'heng.pan@frcteam.org',
     'mobile': '555-555-5555'}

rcon = sqlite3.connect('robotics.sqlite3')
rcon.execute(query, student)
rcon.commit()
display(pd.read_sql_query("SELECT * FROM Members;", rcon))
rcon.close()

UPSERT is a combination of the words UPDATE and INSERT. UPSERT is a common term related to SQL queries, but there is no `UPSERT` keyword. When we talk about `UPSERT` queries, we're referring to `INSERT` queries with an add `ON CONFLICT` clause that converts the `INSERT` query into an update query. The preceding UPSERT query converts to an UPDATE query if any conflict occurs with the email column. The `ON CONFLICT ... DO UPDATE` clause allows us to specify exactly what columns get updated and what they get updated to.

We could also use `DO NOTHING` instead of `DO UPDATE`.

In [None]:
# Do Nothing Query

# THIS CELL WILL NOT RUN ON GOOGLE COLAB
# Upsert was not added to SQLite until version 3.24 and Google Colab uses
# an older version of SQLite

query = """
    INSERT INTO Members (given_name, surname, role, email, mobile)
         VALUES (:given_name, :surname, :role, :email, :mobile)
    ON CONFLICT (email) DO NOTHING;
"""

student = {
    'given_name': 'Aishwarya',
    'surname': 'Comar',
    'role': 'Mentor',
    'email': 'aishwarya.comar@frcteam.org',
    'mobile': '999-999-9999'}

rcon = sqlite3.connect('robotics.sqlite3')
rcon.execute(query, student)
rcon.commit()
display(pd.read_sql_query("SELECT * FROM Members;", rcon))
rcon.close()

When `DO NOTHING` is used, then the query leaves rows with conflicts unchanged.

### C. Exercise #7
Use an INSERT query to add a tool to the *Tools* table. Add an `ON CONFLICT` clause that will update the *description* column if there is a conflict with the *tool_number* column.

In [None]:
# Ex. #7



## IV. Aggregate Queries
We will be using the *Chinook* training database in this section. since we'll be connecting over and over and displaying query results, it makes sense to create a function that will handle the repetitive work.

In [None]:
def show_query(query, head=None, db=chinook_db):
    con = sqlite3.connect(db)
    if head is not None:
        display(pd.read_sql_query(query, con).head(head))
    else:
        display(pd.read_sql_query(query, con))
    con.close()

Suppose you took a summer job at the Chinook music store, and the general manager wanted to know from which countries the store receives the most orders. We could get the data from the *Invoice* table, which contains a record for each order.

In [None]:
# Invoice Table
query = """SELECT * FROM Invoice;"""
show_query(query, 6)

Unfortunately, the *Invoice* table doesn't directly indicate how many orders come from each country. We could filter the table to a single country with a `WHERE` clause and then count the records.

In [None]:
# Invoice Table
query = """
    SELECT COUNT(*) AS "Germany Orders"
      FROM Invoice
     WHERE BillingCountry = 'Germany';
"""
show_query(query)

This works, but we would have to run the query multiple times, once for each country that occurs in the *BillingCountry* column. 

### A. Aggregate Queries to the Rescue
Fortunately SQL provides a solution that's easier than manually counting rows or running several queries: aggregate queries.

In [None]:
# An Aggregate Query
query = """
    SELECT BillingCountry, COUNT(*) AS Quantity
      FROM Invoice
  GROUP BY BillingCountry
  ORDER BY Quantity DESC;
"""
show_query(query, 10)

The preceding query gives us exactly what we want. For every country that appears in the *BillingCountry* column, the query counts and displays the number of invoices from each country.

An aggregate query is any query that combines data from several rows into a single summary value. For example, the *Invoice* table contained 56 records representing invoices from Canada. The aggregate query *aggregated* the information from all 56 rows into a single row with the number of invoices just for Canada.

There are two characteristics that turned the query into an aggregate query.
* The `COUNT()` function is an aggregate function that aggregates records by counting them.
* The `GROUP BY BillingCountry` clause causes SQL to divide the records into separate groups based on the contents of the *BillingCountry* column. Consequently, the `COUNT()` function returns a separate count for each country.

Let's try removing the `GROUP BY` clause and see what happens.

In [None]:
# What happens is we remove GROUP BY?
query = """
    SELECT BillingCountry, COUNT(*) AS Quantity
      FROM Invoice
  ORDER BY Quantity DESC;
"""
show_query(query, 10)

Without the `GROUP BY` clause, the `COUNT()` function causes SQL to count all of the rows in the *Invoice* table and return just a single row. The *BillingCountry* column expression does not contain an aggregate function, so SQL arbitrarily picks a country and displays it in the *BillingCountry* results column.

What happens if we omit the `COUNT()` function and just include the `GROUP BY` clause?

In [None]:
# What happens if we remove COUNT()?
query = """
    SELECT BillingCountry
      FROM Invoice
      GROUP BY BillingCountry;
"""
show_query(query, 6)

Without the aggregate function, the result of the `GROUP BY` clause is very similar to using the `DISTINCT` keyword.

### B. Multiple Column `GROUP BY` Clauses
Suppose the general manager wanted to drill down further and see how many orders came from individual cities. We could get that information by adding another column to the `GROUP BY` clause.

In [None]:
# Multiple Columns in GROUP BY
query = """
    SELECT BillingCountry, BillingCity, COUNT(*) AS Quantity
      FROM Invoice
  GROUP BY BillingCountry, BillingCity
  ORDER BY BillingCountry;
"""
show_query(query, 10)

The `GROUP BY` clause can contain multiple columns. With multiple columns, SQL will return a row for every distinct combination of the two columns. Brazil, for example, has four cities that appear in the *Invoice* table, so the aggregate query generates four different rows for Brazil, one for each city.

Suppose the general manager only wants to track orders of &dollar;10 or more. We can use a `WHERE` clause to filter out orders of less than &dollar;10.

### C. `WHERE` vs. `HAVING`
Suppose the general manager only wants to track orders of &dollar;10 or more. We can use a `WHERE` clause to filter out orders of less than &dollar;10.

In [None]:
# Using a WHERE clause in an aggregate query
query = """
    SELECT BillingCountry, COUNT(*) AS Quantity
      FROM Invoice
    WHERE Total <= 10
  GROUP BY BillingCountry
  ORDER BY Quantity DESC;
"""
show_query(query, 10)

The following query will only contain countries with 25 or more orders. 

In [None]:
# Using a HAVING clause
query = """
    SELECT BillingCountry, COUNT(*) AS Quantity
      FROM Invoice
  GROUP BY BillingCountry
  HAVING Quantity >= 25
  ORDER BY Quantity DESC;
"""
show_query(query)

Why did one query use `WHERE` and the other `HAVING`? The answer is that `WHERE` filters records *before* the aggregation takes place, and `HAVING` filters records *after* the aggregation.

So for the query that used `WHERE`:
1. The `WHERE` clause removed all rows from the *Invoice* table where *Total* was less than 10.
2. `GROUP BY` split the rows into different groups by *BillingCountry*.
3. `COUNT()` counted the number of rows in each group.

For the query that used `HAVING`:
1. `GROUP BY` split the rows into different groups by *BillingCountry*.
2. `COUNT()` aggregated the rows in each group into a single row containing the row count.
3. The `HAVING` clause eliminated all groups with fewer than 25 rows.

Note that the `WHERE` clause occurs *before* the `GROUP BY` clause, and the `HAVING` clause occurs afterwards. This order matches the logical order of operations. In fact, SQL will thow an error if you try to put `WHERE` after group by or `HAVING` before `GROUP BY`.

In [None]:
# Throws an error - WHERE must occur before GROUP BY
query = """
    SELECT BillingCountry, COUNT(*) AS Quantity
      FROM Invoice
  GROUP BY BillingCountry
  WHERE Total > 10
  ORDER BY Quantity DESC;
"""
show_query(query)

### D. More Aggregate Functions
SQL provides several aggregate functions, not just `COUNT()`. The next query uses three different aggregate functions.

In [None]:
# More Aggregate Functions
query = """
    SELECT BillingCountry, COUNT(*) AS Quantity,
           AVG(Total) AS "Average Order", MAX(Total) AS "Max Order"
      FROM Invoice
  GROUP BY BillingCountry
  HAVING Quantity >= 20
  ORDER BY Quantity DESC;
"""
show_query(query)

[The full list of SQLite aggregate functions is available in the official SQLite documentation.](https://www.sqlite.org/lang_aggfunc.html) `SUM()` and `MIN()` are two other common aggregate functions.

### E. Joins in Aggregate Queries
Joins can be included in aggregate queries. The next query identifies the five biggest customers in terms of sales. 

In [None]:
# Joins in aggregate queries
query = """
    SELECT FirstName, LastName, Customer.CustomerId, SUM(Total) AS Total
      FROM Invoice
 LEFT JOIN Customer ON Invoice.CustomerId = Customer.CustomerId
  GROUP BY FirstName, LastName, Customer.CustomerId
  ORDER BY Total DESC
  LIMIT 5;
"""
show_query(query)

### F. Exercises 8 - 14

#### Ex. #8
Show the total sales by state. Your results should have three columns: country, state, and total.

In [None]:
# Ex #8.
query = """
  -- Write query in this string.
  
  
"""
show_query(query)

#### Ex. #9
Again, show the total sales by state. But this time, exclude order where the *BillingState* column is not NULL.

In [None]:
# Ex #9.
query = """
  -- Write query in this string.
  
  
"""
show_query(query)

#### Ex. #10
Did you use a `WHERE` or a `HAVING` clause in exercise #2? Could you have used the other type of clause? Why or why not?

In [None]:
# Ex. #10
#
#

#### Ex. #11
Which five companies provided the most sales? Include the company name, the number of sales, and the sales total in your results.

In [None]:
# Ex. #11
query = """
  -- Write query in this string.
  
  
"""
show_query(query)

#### Ex. #12
The first query in section II.E grouped by the *CustomerId* field. Was this required? What could have happened if we left the *CustomerId* field out of the query?

In [None]:
# Ex. #12
#
#

#### Ex. #13
Find out how much sales were attributed to each sales representative. Your results should include the last name and first name of the sales representative, as well as their sales total.

HINT: This requires a three-table join.

In [None]:
# Ex. #13
query = """
  -- Write query in this string.
  
  
"""
show_query(query)

#### Ex. #14
Using the *InvoiceLine* and other tables, determine the top three music genres in Canada by tracks sold.

In [None]:
# Ex. #14
query = """
  -- Write query in this string.
  
  
"""
show_query(query)

## VI. Save Your Work
Once you have completed the exercises, save a copy of the notebook outside of the git repository (outside of the *pyclass_frc* folder). Include your name in the file name. Send the notebook file to another student to check your answers.

## VII. Concept and Terminology Review
You should be able to define the following terms or describe the concept. 
* `CREATE TABLE`
* `INTEGER`, `REAL`, `NUMERIC`, and `TEXT`
* `COLLATE NOCASE`
* `NOT NULL`
* `UNIQUE`
* `PRIMARY KEY`
* `.executemany()`
* `CONSTRAINT`
* `CREATE INDEX`
* `UPSERT`
* `ON CONFLICT DO UPDATE`
* `ON CONFLICT DO NOTHING`
* Aggregate Query
* Aggregate Function
* `GROUP BY`
* `COUNT()`
* `SUM()`
* `AVG()`
* `MAX()`
* `MIN()`
* `HAVING`

[Table of Contents](../../index.ipynb)