<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 2: Basic SELECT Statements** 
_Retrieving data from a single table._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The sequence and purpose of each clause of SQL select queries (SELECT, FROM, WHERE, GROUP BY, etc.)
- How SELECT queries compare to similar operations in Excel or Pandas 

### **Skills / Know how to ...**
- Write basic SQL SELECT / FROM / WHERE queries
- Apply functions and conditionals where required
- Calculate aggregate quantities like AVG, SUM, etc.
- Group records using column selectors

--------
## **LESSON 2 HIGHLIGHTS**

In [None]:
#@title Run this cell if video is does not appear TODO REPLACE WITH NEW VIDEO
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/YkDLv6CtEnc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

## **BIG PICTURE: SQL as a universal data access language** 

Among programming skills, coding in SQL is just not like the others. Consider, for example, the January 2021 *Tiobe Index* of popular programming languages, where SQL is listed as #12. Every other language on the list can be used to build apps and systems. We call them general purpose languages. SQL, however, is explicitly a special purpose language, designed for data management and only data management. Of the others on the list the most similar in that respect is R. Though it is certainly possible to create apps in R, the focus is on data analysis and only data analysis. Everything else is strictly general purpose.  So, if SQL is so out of step with the rest of the programming language universe, why is it considered a critical data analytics skill? Because it does its job extremely well, of course, but there is more to it than that. 

![Tiobe Index](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L2_tiobe_index.png)

SQL is an ancient programming language. Of the others on the list, the only one that could be considered a contemporary is C. (Assembly Language is technically not a single language but a kind of language, so it does not have an age *per se*.) Both C and SQL were developed in the late 1960s and released in the early 1970s. The others came along *decades* later. By then C and SQL were already ubiquitous, in use by millions of programmers around the world. 

Age and "first mover advantage" are not all of the SQL story though. Recall the three tiered model from Lesson 1? Each of the tiers are treated by the others as a black box, with internal implementation details hidden behind a *public interface* (often called an API). As long as a given tier can handle certain requests, phrased in such and such a way and producing results in specified format, then nobody has to care *how* it works. That is extremely powerful! SQL just happens to provide just such an interface. For basic CRUD operations in the data tier, SQL provides a complete and standard set of requests and responses that any application or system can rely on. Further, if a better implementation of the standards embodied by SQL comes along, we can upgrade to the newer technology without modifying *any* existing code. 

So, when asked "Why SQL?" the best answer is "Why not SQL?" because that is the right way to think of it. SQL works so well for many, many applications supported by millions of programmers. Thus, the burden of proof for an application programmer to disqualify SQL is to show that SQL can't get the job done and then propose something that does. Even then, the smart boss will instinctively ask for the same data integrity protections provided by SQL. In the end, using anything else is like swimming upstream, where relatively gentle currents on a clear day can become deadly in a storm. In other words, never assume that what works for local storage on an isolated iPhone will work for the cloud servers it connects with. The volume of data and the likelihood of corruption is just too high. Some companies, usually startups, that made the wrong choice are having to live with their decisions and ... scaling up to SQL compliance. Why not start there from the beginning? 


 








## **SQL History, Standards, and Use Cases**

Structured Query Language (SQL) was developed by researchers at IBM in the early 1970s. The original name was Sequel, which is how many of us pronounce SQL to this very day. Unlike human language, programming languages have version numbers that track the evolution of the language over time. For many years, SQL was whatever IBM defined it to be, but in 1992 it was finally endorsed by the American National Standards Institute (as [SQL-92](https://en.wikipedia.org/wiki/SQL-92)). There have been newer versions over the many years since (e.g., SQL:2008 and SQL:2011) that extend the original SQL standards to include things like XML data and object-oriented concepts. We will come back to these newer features in the latter portions of this course.  

SQL is not a full-featured language like C, C#, Java, Python, etc. Instead, it is considered a *data sublanguage* for creating and processing *relational* data and metadata. As discussed for this purpose it is ubiquitous, with support by just any programming language on earth. 

While we think of SQL as a language, the SQL *standard* defines five different kinds of language, each of which represents a different *use case*:
- Data Definition Language (DDL)
- Data Manipulation Language (DML)
- SQL/Persistent Stored Modules (SQL/PSM)
- Transaction Control Language (TCL)
- Data Control Language (DCL)
We are going to focus on DML and DDL statements. The rest are mostly for SQL DB Administrators/Engineers. However, it is good to know what they are when the engineers bring them up in a meeting. 

The difference between data definition and data manipulation lies in what kinds of data are being addressed. DDL is used to create, describe, modify, or discard *metadata*: 
- Tables, columns, etc. 
- Relationships, keys, etc. 

DML is used to retrieve, add, update, or delete *data*:
- SELECT statements retrieve data from tables
- INSERT, UPDATE, and DELETE statements manage the data in the tables

In addition to DDL and DML, we will also consider a few relevant SQL/PSM, TCL, and DCL statements but always with an eye toward how they interact with DDL and DML. You're here for analytics, not engineering, after all. 

---


## **Preliminaries** 

In this class we will try to work with live data whenever possible. That means each lesson will start with:
- Boilerplate code to link in any needed software.
- Connection to the live database. If it is a database that we have not seen before then we will also take a look at the data model before running any queries. 

### **Software Prep** 

Run the cell below, which loads all the software we'll need for the rest of the lesson. 

In [7]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

# Install the Python to MySQL DBI connector
!pip install pymysql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### **The Database**
We'll be working with the Lahman 2016 dataset, which has baseball statistics for every Major League player since ... forever. It goes all the way back to the very beginning! 

The data model follows the so-called Snowflake Schema pattern we'll explore in more detail in Lesson 10. The database is huge, with a couple dozen tables. For now we will have to content ourselves with the simplified Entity Relationship Diagram below. As the name implies, the `Master` table is of central importance. It represents the players, with a smattering of other personnel (coaches, owners, radio/TV announcers) mixed in for good measure. 

![Lahman 2016 ERD](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L2_baseball_stats_schema.png)

The database itself is located in the cloud. Run the cell below to open a connection. 

In [9]:
%sql mysql+pymysql://buan6510student:buan6510@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016

'Connected: buan6510student@lahman2016'

Take note of the connection strings, which are SQL standard. The general format is 

`protocol :// username : password @ server / database`

Let's take this one apart:
- `protocol` $\rightarrow$ `mysql+pymysql`  (`mysql` DBMS using the just installed `pymysql` connector) 
- `username` $\rightarrow$ `buan6510student (a dummy account with read-only access to the database)
- `password` $\rightarrow$ `buan6510` 
- `server` $\rightarrow$ `database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com` (notice the `rds.amazonaws` in the URL)
- `database` $\rightarrow$ `lahman2016`

The database resides in a virtualized RDS server hosted by Amazon Web Services. The DBMS is MySQL, a very popular choice explored more in the **SQL AND BEYOND** segment at the end of the lesson. 

Heads up: Normally, one would not disclose account credentials in this way but since we are using a read-only account the risk is basically nothing. However, normally we would take great pains to store database credentials in a separate file in a non-internet accessible location.

**Now that we have everything set up we can continue on to learning about SQL SELECT statements.**

---


## **SQL SELECT Statements ... one clause at a time**

By far the most common use case for SQL is retrieving data from a relational database. For many of you, learning how to do just that is why you are taking this course. 

The good news is that we only need to consider one kind of SQL statement to do it. Every data retrieval query has the following structure:
```
SELECT columns
FROM tables
WHERE row-conditions
GROUP BY columns
HAVING group-conditions
ORDER BY columns
LIMIT max-rows;
```
Each of the words in CAPITALS are SQL keywords used to indicate each *clause* in a SQL statement. These clauses are always given in exactly the same order, regardless of what the data is used for. The only clause that is strictly required is `SELECT`; all of the others are optional, added on when needed. All of these have been in SQL since the beginning and will be covered one at a time below. In Lesson 3 we will also explore a recent addition, the `WITH` clause, that simplifies certain complex queries that draw from multiple tables. 

Heads up: the closing semi-colon `;` is technically required at the end of each SQL statement (query), regardless of the number of clauses. However, if there is only one query in a given cell then we can safely omit it. That said, you might as well get used to typing it each time.

### **The `SELECT` Clause**

```SELECT columns```
The `SELECT` clause is used to indicate which data columns we want. The columns are always provides as a comma-separated list. 

While *almost* always, the columns will be selected from tables, we can actually use `SELECT` as an expression calculator:

In [28]:
%%sql
SELECT 1+1, 2*10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
1 rows affected.


1+1,2*10
2,20


The result is either a scalar value (if there is just one column) or a tabular *resultset* if there are multiple columns. If we like we can give the columns names using the `AS` modifier. We'll look at some more advanced uses of `AS` later in this lesson.  

In [29]:
%%sql
SELECT 1+1 AS one, 2*10 as two;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
1 rows affected.


one,two
2,20


### **The `FROM` Clause**
If we want to use data from tables then we'll need to specify them in the `FROM` clause. The cell below uses the Lahman2016 database of career baseball statistics, where the `Master` table contains a list of every Major League Player since ... forever. (Yes, Tommie Aaron was Hank Aaron's brother. Baseball talent tends to run in families.)


In [13]:
%%sql buan6510student@lahman2016
SELECT nameFirst, nameLast
FROM Master
LIMIT 10;

10 rows affected.


nameFirst,nameLast
David,Aardsma
Hank,Aaron
Tommie,Aaron
Don,Aase
Andy,Abad
Fernando,Abad
John,Abadie
Ed,Abbaticchio
Bert,Abbey
Charlie,Abbey


The connection identifier after `%%sql` was created when we first connected to the database. We only need to use it when switching which database we want to work with. We also used a `LIMIT` clause to restrict the number of rows returned. There have been a lot of baseball MLB players over the years, too many to scroll through in this lesson. 

Heads up: Table names are case sensitive in MySQL, so the following results in an error:


In [15]:
%%sql
SELECT nameFirst, nameLast
FROM master
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
(pymysql.err.ProgrammingError) (1146, "Table 'lahman2016.master' doesn't exist")
[SQL: SELECT nameFirst, nameLast
FROM master
LIMIT 10]
(Background on this error at: http://sqlalche.me/e/13/f405)


If we want to select all columns from the `Master` table, we use the `*` wildcard:


In [16]:
%%sql
SELECT *
FROM Master
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
10 rows affected.


playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2016-09-25,abadf001,abadfe01
abadijo01,1850,11,4,USA,PA,Philadelphia,1905.0,5.0,17.0,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
abbated01,1877,4,15,USA,PA,Latrobe,1957.0,1.0,6.0,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
abbeybe01,1869,11,11,USA,VT,Essex,1962.0,6.0,11.0,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
abbeych01,1866,10,14,USA,NE,Falls City,1926.0,4.0,27.0,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01


Generally, **there is no significance to the column order** when using a wildcard. The database uses whatever order it finds most efficient. If we want the columns in a particular order, then we need to list them that way in the `SELECT` clause. 

We can also use the wildcard to count the number of rows in the table.

In [27]:
%%sql
SELECT COUNT(*)
FROM Master;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
1 rows affected.


COUNT(*)
19105


The `COUNT()` function does exactly what it appears to do. It counts the number of rows. In this case we are counting entire rows, but we can also just count the number of unique values in a given column.

In [None]:
%%sql
SELECT DISTINCT(lastName)
FROM Master; 

As we shall see in Lesson 3, **we can retrieve data from multiple tables using a `JOIN` operator.** The query below produces a list of people in the National Baseball Hall of Fame (a.k.a, "Cooperstown").

In [22]:
%%sql
SELECT nameFirst, nameLast
FROM HallOfFame
    JOIN Master USING (playerID)
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
10 rows affected.


nameFirst,nameLast
Hank,Aaron
Jim,Abbott
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams
Babe,Adams


Oops. The query was supposed to list 10 people. Instead it lists just three, with Babe Adams listed multiple times. Actually, that's by design. We can see what's going on by including the `yearid` column from the `HallOfFame` table. 



In [23]:
%%sql
SELECT nameFirst, nameLast, yearid
FROM HallOfFame
    JOIN Master USING (playerID)
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
10 rows affected.


nameFirst,nameLast,yearid
Hank,Aaron,1982
Jim,Abbott,2005
Babe,Adams,1937
Babe,Adams,1938
Babe,Adams,1939
Babe,Adams,1942
Babe,Adams,1945
Babe,Adams,1946
Babe,Adams,1947
Babe,Adams,1948


The `HallOfFame` table lists a player each year the player's induction was considered. An all time great player like Hank Aaron will be considered just once and immediately inducted. Most are not so lucky. While Jim Abbott was inducted into the College Baseball Hall of Fame, he was considered only once (in 2005) for his professional career and did not meet the standard for further consideration. Babe Adams was considered repeatedly in the 1930s and 1940s and is eligible for consideration again in 2021. 

**If we want to just see each name just once, we can use the `DISTINCT` modifier:**

In [24]:
%%sql
SELECT DISTINCT nameFirst, nameLast
FROM HallOfFame
    JOIN Master USING (playerID)
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
10 rows affected.


nameFirst,nameLast
Hank,Aaron
Jim,Abbott
Babe,Adams
Bobby,Adams
Sparky,Adams
Tommie,Agee
Rick,Aguilera
Jack,Aker
Doyle,Alexander
Pete,Alexander


That's better. That is includes people not actually in enshrined in the Hall of Fame is a bug that we can will fix in a moment. We can (and will) deal with presence of multiple years per player in the `GROUP BY` section. 

### **The `WHERE` Clause**

The `WHERE` clause is used to place conditions (restrictions) on which rows we want from the specified tables. SQL conditions are phrased as "boolean expressions" that resolve to either True or False. The following query returns only players (i.e., not coaches, media, etc.) that were ultimately inducted. 



In [30]:
%%sql 
SELECT nameFirst, nameLast, yearID as induction_year
FROM HallOfFame
    JOIN Master USING (playerID)
WHERE inducted='Y' and category='Player'
LIMIT 10;

 * mysql+pymysql://buan6510student:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/lahman2016
   mysql+pymysql://dealsuser:***@database-01202.c55qjoeogr2p.us-east-2.rds.amazonaws.com/deals
10 rows affected.


nameFirst,nameLast,induction_year
Hank,Aaron,1982
Pete,Alexander,1938
Roberto,Alomar,2011
Cap,Anson,1939
Luis,Aparicio,1984
Luke,Appling,1964
Richie,Ashburn,1995
Earl,Averill,1975
Jeff,Bagwell,2017
Home Run,Baker,1955


Don't you just love it when players go exclusively by their nicknames? Or do you really suppose somebody named their son "Home Run" just to be aspirational?

We'll come back to boolean expressions like `inducted='Y' and category='Player'` in a separate section later in this lesson.

### **The `GROUP BY` Clause**





### **The `HAVING` Clause**



### **The `ORDER BY` Clause**



---
## **Aliases and Views**

### **Column Aliases**

### **Expression Aliases**

### **Views (Query Aliases)**


---
## **Boolean Expressions**

## **Complex Calculations with SQL Functions and CASE Expressions**

## **Grouping and Aggregation Quirks**

---
## **PRO TIPS: How to write queries that work the first time, every time.**

---
## **SQL AND BEYOND: MySQL DBMS**

## **Congratulations! You've made it to the end of Lesson 2.**

In Lesson 3 we will continue on to more advanced SELECT statements that involve multiple tables or subqueries. That's about as far as most data scientists go ... but we will of course continue on with even more advanced topics. There is *a lot more* to SQL than just retrieving data. 