# Intro to SQL -- Clauses and Joins

In [None]:
!wget https://github.com/gt-cse-6040/bootcamp/raw/main/Module%201/Session%204/example.db

## We're going to do a high level introduction to SQL.

#### This is not intended to be a comprehensive introduction, but we instead will be covering topics that students have historically had issues with in the class.

### First, just a little bit about SQL and relational databases.

SQL stands for Structured Query language. It is generally pronounced "es-queue-el" or "see-kwell". SQL is the industry standard for communicating with relational databases.

And while the Web is a great source of "new" data, most real-world business data appears, arguably, in more traditional relational database systems. These databases are "tuned" for the task of managing tabular data (e.g. tibbles) with complex relationships.

Data is stored in tables, which is made up of rows and columns. Each row represents a different entity (of whatever is in that table) and each column represents a different attribute about that entity.

For a good visual, think of an Excel spreadsheet, with rows and columns.

#### So what do we do with relational databases?

Our task in relational database reporting is to take a requirement for information and translate that into a/an SQL query that returns the requested data.

1. Which tables do we need?

2. Which columns do we need?

3. How will we use the columns to get the information requested?

## Be patient on the below, we need to introduce the terminology and concepts.

## We will show examples of everything in subsequent notebooks.

### What makes up an SQL query?

**SQL query clauses -- Order of appearance in the query**

#### Homework NB9, Part 2 has a more in-depth discussion of the order of execution, so we refer the students to review there for more detail on this topic.

There are 6 possible clauses in an SQL query.

They must appear in the query in the following order:

1. ***SELECT*** -- Which columns/data elements are to be included in the result set.


2. ***FROM*** -- The tables which are the source of the data to be returned


3. ***WHERE***
    
    a. The columns from different tables that are equivalent and define how the tables are joined together.
    
    b. Any filtering criteria for the query, to return a subset of the data. Note that this filtering is done PRIOR to any aggregations.


4. ***GROUP BY*** -- If aggregating, these are the columns that the aggregations are based on.


5. ***HAVING*** -- Filtering on data after aggregations have been performed.


6. ***ORDER BY*** -- Sorting the data.


There is a 7th clause, **which is not universal to all databases,** but it is available in SQLite. It is the **LIMIT** clause, which tells the database how many rows to return. It would be last in the order of the query, and would also execute last.

**SQL query clauses -- Order of execution**

The 6 clauses execute in the following order:

1. ***FROM*** -- Which tables are in scope for where the data will come from.


2. ***WHERE*** -- How are the tables related to each other (joins), and any filtering that is to be done. At the conclusion of this step, all of the detail rows that make up the dataset to be returned will be in memory.
    
      a. The result set will include all eligible rows to be returned, from all tables. It does not include any aggregating or filtering of aggregates.
        
      b. The result set includes all of the columns in all of the tables at this point.


3. ***GROUP BY*** -- Perform any groupings that need to be done for the aggregations. Each grouping represents a separate entity at this point in the process. Remember "split-apply-combine" from the pandas groupby() function last week? This is the "split" step.


4. ***HAVING*** -- Filter the groupings from the last step, keeping only those that meet the criteria.


5. ***SELECT*** -- Choose only the columns that are required to be returned. Also perform and data manipulations (string manipulation, for example) that are required.


6. ***ORDER BY*** -- When the final set of rows to be returned remain, they are sorted in whatever order is specified.

### Now let's look at joins

Notebook 9, Part 0 contains the link to an outstanding resource for visualizing joins, which are typically represented by various Venn diagrams.

https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

The joins we will deal with in this class are inner, outer, left and right (these are the same as the pandas merge "how" parameter", from last week).

Also, remember in Pandas how we discussed the "left" and "right" tables? Well the same applies here. The first listed table is the "left" table and the second listed is the "right" table.

**Inner join**

This is the most common join, and the easiest to understand. The query using this join will return only the rows that are in both tables.

Additionally, this is the default join in SQL, so if a join is not specified (as above), then the INNER JOIN is assumed.

![inner%20join.png](https://github.com/gt-cse-6040/bootcamp/blob/main/Module%201/Session%204/inner%20join.png?raw=1)

**Outer join**

This is also known as FULL OUTER JOIN or FULL JOIN. The query will return all of the rows from both tables, whether or not there is a match in the other table. All columns will be populated for the rows that have a match, and for those rows from either table that do not have a match, a NULL value will be returned for the non-matching columns.

![outer%20join.png](https://github.com/gt-cse-6040/bootcamp/blob/main/Module%201/Session%204/outer%20join.png?raw=1)

**Left join**

This join will return all of the rows from the left table, whether or not there is a match from the right table. Any records that match from the right table will also be included, and for those left table rows that do not have a match, the right table columns in the SELECT statement will return a NULL value.

![left%20join.png](https://github.com/gt-cse-6040/bootcamp/blob/main/Module%201/Session%204/left%20join.png?raw=1)

**Right join**

Opposite of the Left join.

This join will return all of the rows from the right table, whether or not there is a match from the left table. Any records that match from the left table will also be included, and for those right table rows that do not have a match, the left table columns in the SELECT statement will return a NULL value.

![right%20join.png](https://github.com/gt-cse-6040/bootcamp/blob/main/Module%201/Session%204/right%20join.png?raw=1)

## What are your questions on SQL clauses and joins?