In [1]:
from IPython.display import Image

# SQL Part 3

# Plan for Today

* Review
* Table relationships
* JOIN clauses
* HAVING clauses

# Review

Review what SQL is, why we are learning it, the basic structure of a query, relational databases, tables:
* We are working with **structured data** 

* This data is stored in a **relational database** : a database that contains a series of tables that relate to each other and can be connected (how do we connect them?) to show relationships

* The anatomy of a **database table** :  

<img src="Anatomy of a Table.png" width="400" height="400" />

# Working with Multiple Tables

**Data Aggregation** : The process of gathering data from multiple sources in order to combine it into a single, summarized collection.

The way that a relational database is set up, data is often spread across multiple tables. In a well-organized database a single table will usually only apply to a single thing, but the information that we often want as analysts comes from the relationships between tables. 

Why do they make tables that way? It is called **database normalization**, and it keeps you from having to change things in multiple places. For instance, if I had a table containing all the students and their classes, then wanted to change a student’s name, I would have to change it multiple times, once for each class the student enrolled in. This makes databases more efficient and helps ensure data quality, but the drawback is that it becomes less human readable and data can be spread out. JOINs are how we stitch the data back together. 

<img src="Table Relationships.png" width="700" height="700" />

To get data from multiple sources, we have to use the JOIN clause

<img src="https://c.tenor.com/5zNA2C94itMAAAAd/kitty-join-us.gif" width="300" height="300" />

Can only join on tables that have attributes/columns in common

# The Keys to It All 

<img src="https://bestanimations.com/media/keys/28804563key-animated-gif-3.gif#.Yqo8R2ZAC3Y.link" width="300" height="300" />

When you want to join tables, there must be something in common between the tables that you can join on! Otherwise you are basically concating them without regards to the relationships between the two. These columns with common values are usually the primary and foreign keys

**<span style="color:red">Jargon Alert</span>** : People often say that they columns that have values in common between tables are the columns you are 'joining on', because of the ON clause in the query

# Primary Keys

**Primary Key** : The column that has a unique value for every record/row. A lot of times you will find that this primary key is an 'ID' that ensures that the value is unique

# Foreign Keys

**Foreign Key** : A column in a table that have values from the primary key column in another table. These are not necessarily unique for every row. A table can include multiple foreign keys

# Together

In order to understand how the tables of a database work together, sometimes you will use a database diagram like this:

<img src="Database Diagram.png" width="500" height="500" />

In this diagram, the lines connecting tables indicate which table has the primary key (key side) and which table has the foreign key (infinity symbol). 

BookAuthor is an example of a **composite key** : BookID is not unique, and AuthorID is not unique, but the combination of BookID and AuthorID is unique

# All the Joins

<img src="All Join Types.png" width="700" height="700" />

Image from *https://www.sqlshack.com/internals-of-physical-join-operators-nested-loops-join-hash-match-join-merge-join-in-sql-server/*

**<span style="color:orange">NOTE</span>** : the 'OUTER' part of these JOIN clauses is optional 

# Inner Join

Gets only the records/rows that share the same values for the columns you are joining on

<img src="Inner Join.png" width="700" height="700" />

In [None]:
SELECT
    # desired table columns from both tables are inserted here
    table_name1.column_names
    table_nmae2.column_names
FROM
    table_name1
INNER JOIN 
    table_name2
ON 
    # indicate which columns have shared values that can be matched
    table_name1.column_name = table_name2.column_name

# Full Join

Gets all of the rows from both tables, and matches them up when possible based on having the same value in the columns that were joined on. If there is no match those values from the other table are left blank

<img src="Outer Join.png" width="700" height="700" />

In [None]:
SELECT
    # desired table columns from both tables are inserted here
    table_name1.column_names
    table_name2.column_names
FROM
    table_name1
FULL JOIN 
    table_name2
ON 
    # indicate which columns have shared values that can be matched
    table_name1.column_name = table_name2.column_name

# Left Join

All of the records/rows from one table (the 'left' one) are returned, and then only the records from the other table (the 'right' one) that match up to rows on the first table are included

<img src="Left Join.png" width="700" height="700" />

In [None]:
SELECT
    # desired table columns from both tables are inserted here
    table_name1.column_names
    table_name2.column_names
FROM
    table_name1 # the 'left' table
LEFT JOIN 
    table_name2 # the 'right' table
ON 
    # indicate which columns have shared values that can be matched
    table_name1.column_name = table_name2.column_name

# Right Join

All of the records/rows from the 'right' table are returned, and then only the records from the 'left' table that match up to rows on the 'right' table are included.

Rarely used - most of the time people just switch the order of the tables and stick with the left join. For instance the following two queries are equivalent:

In [None]:
SELECT
    # desired table columns from both tables are inserted here
    customers.customer_name
    sales.sales_rep
FROM
    customers # the 'left' table
LEFT JOIN 
    sales # the 'right' table
ON 
    # indicate which columns have shared values that can be matched
    customers.customer_id = sales.customer_id

In [None]:
SELECT
    # desired table columns from both tables are inserted here
    table_name1.column_names
    table_name2.column_names
FROM
    sales # the 'left' table
RIGHT JOIN 
    customers # the 'right' table
ON 
    # indicate which columns have shared values that can be matched
    sales.customer_id = customers.customer_id

# HAVING vs. WHERE Clause

**WHERE** : used to filter based on an original column; applies the condition before the rows are grouped

**HAVING** : used to filter based on a column that is created with an aggregate function; applies the condition after the rows are grouped, therefore is always used with a GROUP BY clause (if used without a GROUP BY clause it acts just like WHERE, so to avoid confusion avoid doing that)

# All Together Now

The order of clauses when writing a SQL query:
1. SELECT
2. FROM
3. JOIN
4. ON
5. WHERE
6. GROUP BY
7. HAVING
8. ORDER BY

# General Order of Execution

This is not necessarily guaranteed (may depend on the RDBMS), but a general rule of thumb

Why do we care? Because it can help explain behavior understand results (e.g. no aliases in the HAVING clause except in MySQL)

1. FROM + JOIN - you have to know where it is coming from before you get it!
2. WHERE - filter early so that later functions are applied to as few records as possible
3. GROUP BY - need to know what the groups are before you apply any functions to it
4. HAVING - filter based on the results of the GROUP BY, so has to come after it
5. SELECT - what specifically you want from the data determined by the above clauses
6. DISTINCT - pick out only the unique values for a certain column
7. ORDER BY - organize the data
8. TOP - grab the first couple