<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 3: Advanced SELECT Statements** 
_Retrieving data from multiple tables._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- Set algebra as implemented by SQL SELECT queries
- The use of implicit joins, explicit joins, and subqueries
- Variations of the SQL JOIN operator (natural, equijoin, theta join)
- Subqueries as SQL expressions 

### **Skills / Know how to ...**
- Use JOIN operators to connect data from multiple tables
- Write subqueries for common use cases where JOIN is insufficient
- Use `WITH` statements to simplify complex queries with subqueries

--------
## **LESSON 3 HIGHLIGHTS**

In [None]:
#@title Run this cell if video is does not appear TODO REPLACE WITH NEW VIDEO
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/YkDLv6CtEnc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

### **Run this boilerplate code before continuing on.** 
 

In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Install the Python to MySQL DBI connector
!pip install pymysql

%sql mysql+pymysql://buan6510student:buan6510@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016

**Rerun this code as needed to get and keep your software and database connection fresh. Also, note that we are again using the baseball database from Lesson 2, where you will find a diagram with tables and relationships.**  

---
## **BIG PICTURE: The Need for Data Integration**

 ---
## **Multi-Table `SELECT` Queries**

In this lesson we will explore three different ways to combine data from multiple tables in a SQL:
- Implicit joins that use relational algebra
- Explicit joins that use the JOIN operator
- Subqueries that nest *inside* other queries

---
## **Implicit Joins**

Implicit do not have the `JOIN` keyword in them. That the code is performing a join is *implied*. 

The format of an implicit join is:
```
SELECT *
FROM TableA, TableB
WHERE TableA.columnX = TableB.columnX
```  

It looks pretty simple:
- List two tables in the `FROM` clause.
- In the `WHERE` clause set rules that match columns from one table with columns in the other table.

In fact, in the earliest versions of SQL implicit joins were the only way to merge data from multiple tables in SQL. However, they come with some potentially serious problems if the `WHERE` clause is not right. 

The issue is not so much with the `WHERE` clause as with the `FROM` clause. Consider the following query, which omits the `WHERE` clause entirely: 
```
SELECT *
FROM TableA, TableB
```
This is a so-called **cross join**, which we will revisit in detail in Lesson 4. A cross join matches each row in the first table (`TableA`) with each row in the second table (`TableB`). The total number of rows in the result is then given by the product of the row counts for the two tables. 

Let's try this ourselves. The following code does a cross join of two of the smaller tables in our baseball database:
```
SELECT nameLast, teamid
FROM Master, Teams     -- note: draws rows from two tables
```



We can easily determine the rows in the two tables: 

In [16]:
%%sql
-- How many players are there?
SELECT count(nameLast) 
FROM Master;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(nameLast)
19105


In [17]:
%%sql
-- How many teams are there?
SELECT count(*)
FROM Teams;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


count(*)
2835


The total number of rows in the cross join would then be 19105 x 2835 = ...

In [18]:
%%sql
SELECT 19105 * 2835;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
1 rows affected.


19105 * 2835
54162675


That's a little over 5.4 million rows. And that was with two modestly sized tables! Imagine if one of the tables had a million rows? It would take virtually forever to complete. (Actually the query would eventually die, having exhausted all storage and possibly locking you out, but really wants that?)

While you may be thinking "But I would never forget the `WHERE` clause. I'm not that careless!" it is better never to take the chance of accidentally crashing a server due to a SQL bug. 

**So, while we include implicit joins here for completeness, they are inherently dangerous and to be avoided whenever possible.**

---
## **Explicit Joins**

Explicit joins use the `JOIN` operator to merge tables. They were added to SQL so many years ago to lessen the risk and potentially improve the speed of performing table joins. 

There are several kinds of explicit joins, but the most common form is:
```
SELECT * 
FROM TableA JOIN TableB ON (TableA.columnX = TableB.columnX)
``` 

You'll notice that it includes the same basic information as the implicit JOIN (two tables plus a join condition that must be met) except that there is now way to forget the join condition inside the parentheses. If you leave the parentheses out or don't specify the join condition then SQL will throw an error every time. 

We'll look at all of the various types of explicit joins a little later in the lesson. 


### **Natural Joins, Equijoins, and Theta Joins**

DON'T FORGET TO TALK ABOUT SURROGATE KEYS

### **`JOIN ... USING (...)`**



### **`JOIN ... ON`**



### **`INNER JOIN`, `LEFT JOIN`, `RIGHT JOIN`, and `OUTER JOIN`**

---
## **Subqueries**

A subquery is an entire `SELECT` query used as an expression inside another query. To convert any query into a query expression, just wrap it in parentheses like this:   
```(SELECT nameLast FROM Master)```  
For short queries it is okay to leave everything on one like but for longer queries it is better to start each clause on a new line:
```
( SELECT nameList
  FROM Master)
```
Notice how the clauses of the subquery are left-aligned (via spaces) they form a solid left vertical line when you read them. That makes it easier to tell when a subquery starts and end. Anything with the same indentation is in the same subquery. If we embed another subquery inside of another (making the queries three deep), then we indent a little more to the right to keep the alingment clean. 

### **Subqueries in the `SELECT` Clause**

### **Subqueries in the `FROM` Clause**

### **Subqueries in the `WHERE` Clause**

### **Subqueries in the `WITH` Clause**

### **A Note about *Correlated* Subqueries**






---
## **Usage: Joins vs Subqueries vs Views**

---
## **PRO TIPS: How to write queries correctly the first time, every time**

---
## **SQL AND BEYOND: Google BigQuery**


## **Congratulations! You've made it to the end of Lesson 3.**

You now know pretty much everything you need to know about `SELECT` queries. If there is anything else you need to know, then a least you have a solid foundation on which to build. 

Quiz 2 will test your understanding of the relevant theory and your ability to write short `SELECT` queries *without the ability to run them in Jupyter*.