<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/tables.png" style="width: 400px;"/>

----
By the end of this session, you should be able to
----

- Draw the general architecture of a RDBMS system
- Explain the connection between SQL and Relational Algebra
- Construct a SQL query for a business user, syntax parser, and execution engine
- Use EXPLAIN and INDEXES
- Use some SQL tips n' tricks

---
Data Systems Architecture
----

Jupyter Notebook:
<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/jupyter.png" style="width: 400px;"/>

Hightlights:

- Decouple elements:
    - Browser (aka, client) is separate from backend
    - Server is separate from storage
    - Everything is separate from the kernel
    
The kernel is the central core of an operating system. It handles input/output requests from software, translating them into data processing instructions for the central processing unit.

Jupyter Notebook can use [many kernels](https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages) (e.g., Python, Bash, Scala, ...)

RDBMS also has separation of concerns
-----

<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/sql_server.gif" style="width: 400px;"/>

Given that architecture there can be many clients, each of which can __not__ directly access the database. The server receives queries, processes them, and then finds data on disk.

In non-dysfunctional systems 😜, the server is big and beefy and thus can handle big, complex requests.

Remember __hardware is cheaper than engineer time___.

The software on the server, aka the SQL execution engine, is highly optimized

> It takes 10 years to build a good SQL engine

Most SQL engine are smarter than you!

> Math is cheaper than hardware (or engineer time)

Which is faster for-loops or vector operations?

<img src="https://s3-us-west-2.amazonaws.com/dsci/6007/assets/architecture.png" style="width: 400px;"/>

---
Relational Algebra 101
----

Basic operations:

- Selection (σ): Selects a subset of rows from relation.
- Projection (π): Deletes unwanted columns from relation.

<img src="http://gerardnico.com/wiki/_media/algebra_of_tables.jpg" style="width: 400px;"/>

Combinations:

- Cross-product (X): Allows us to combine two relations, each row of one relation is paired with each row of another relation.
- Set-difference (-): Tuples in relation 1, but not in relation 2.
- Union (U): Tuples in relation 1 and in relation 2

Other operations: intersection, join, division, renaming(ρ)

Since each operation returns a relation from a relation, operations can be composed (aka tables "all the way down").

[Source](http://www.cs.montana.edu/~halla/csci440/n6/n6.html)  
[Source](http://www3.cs.stonybrook.edu/~kifer/Courses/cse532/slides/ch5.pdf)  
[Source](http://www.cs.cornell.edu/projects/btr/bioinformaticsschool/slides/gehrke.pdf)  

Relational algebra is operational; An internal representation for query evaluation plans.

Given these primitives the optimizer does it best to efficiently return the materialized view.

```select name as rookies
from employee
where hire_date > current_date - interval '365' day;
```

Equivalent relational algebra expression:
    ρrookies(πname(σhire_date>current_date - interval '365' day(employee)))

---
SQL Parser
---

The main tasks for a SQL Parser are:

1. Check that the query is correctly specified
2. Resolve names and references
3. Convert the query into the internal format used by the optimizer
4. Verify that the user is authorized to execute the query

[Source](https://blog.acolyer.org/2015/01/20/architecture-of-a-database-system/)

----
3 kinds of "order of operations"
----

1. Order for formulating a SQL query based on business question (inside -> outside)
2. Order for SQL language syntax
3. Order for execution plan 

---
Formulating a SQL query
----

### Klangs' conjecture
> If you can't solve a problem without programming;
> you can't solve a problem with programming.

Therefore, draw your DB tables and relations, set diagrams, and materialized views before you start coding.

Build it up piece-by-piece. Make sure each piece is giving what you expect

[Source](http://programmers.stackexchange.com/questions/144602/how-do-i-make-complex-sql-queries-easier-to-write)

---
SQL language syntax
----

```sql
WITH <common_table_expression>
SELECT select_list INTO new_table
FROM table_source
JOIN table_source
ON join_condition
WHERE search_condition
GROUP BY group_by_expression
HAVING search_condition
ORDER BY order_expression
```

[Source](http://stackoverflow.com/questions/4654608/what-is-the-correct-order-of-these-clauses-while-writing-a-sql-query)

### SQL Order of Operations

SQL does not perform operations "top to bottom".  

Rather it executes statements in the following order:

1. FROM, JOIN
2. WHERE
3. GROUP BY 
4. HAVING
5. SELECT
6. ORDER BY

Thus heavy filtering will optimize a query since it happens early

---
SQL's EXPLAIN
----

EXPLAIN -- show the execution plan of a statement 

The execution plan shows how the table(s) referenced by the statement will be scanned (e.g., full table or index) and if multiple tables are referenced, what join algorithms will be used to bring together the required rows from each input table.

The most critical part of the display is the estimated statement execution cost, which is the planner's guess at how long it will take to run the statement.

Look at the percentage of time spent in each subsection of the plan, and consider what the engine is doing. 

[Source](https://www.postgresql.org/docs/9.0/static/sql-explain.html)

Let's check out an example on [SQL fiddle](http://sqlfiddle.com/#!7/2a3f7/3)

---
Indexes
----

What is going with this query? It is going to be fast or slow? Why?

```sql
SELECT *
FROM phone_book
WHERE last_name = 'Zuckerberg'
```

If this was physical phone book, would it be fast or slow? Why?

An index is another data structure that is sorted.

```sql
ALTER TABLE phone_book
ADD INDEX (last_name)
```

Indexes prevent full table scans because only needed data is found and retrieved.

Remember, we are dealing with transactional system where the most common use case is an row append.

Indexes are a trade-off between space and time (like most computer science algorithms)

Indexes take up more disk space (essentially you are making a sorted data structure that points to the orginal data) and slow down INSERT, UPDATE, AND DELETE queries.

---
Check for understanding
---

<details><summary>
Why are those queries slower?
</summary>
You have to change the data and the metadata.
</details>

<details><summary>
What is the time advantage of INDEX?
</summary>
Specific queries will be faster because they will not need full table scan.
</details>

[Source](http://shop.oreilly.com/product/0636920022343.do)

----
Practical Optimizations
---

1. Always LIMIT
2. Avoid ORDER BY
3. Avoid *
4. Think about your COUNT functions

__ALWAYS PUT LIMITS ON REQUESTS__

__Forgot LIMIT in my SQL request__
![](http://tclhost.com/4xeBSQk.gif)

[Source](http://thecodinglove.com/post/121672594790/forgot-limit-in-my-sql-request)

---
Check for understanding
---

In [1]:
from string import ascii_lowercase as alphabet
from random import choice

a = [choice(alphabet) for _ in xrange(20)]
print(a)
print(sorted(a))

['b', 'm', 'k', 'e', 'a', 'x', 'f', 'u', 'p', 'n', 'l', 'a', 'u', 'i', 'o', 'a', 'p', 'a', 'y', 'a']
['a', 'a', 'a', 'a', 'a', 'b', 'e', 'f', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'p', 'u', 'u', 'x', 'y']


Which is faster print(a) or print(sorted(a))?

Why?

Which is faster?

```sql
SELECT * 
FROM customers
ORDER BY country;
```

```sql
SELECT * 
FROM customers;
```

----
Fun with COUNT()
----

<img src="https://media.giphy.com/media/N1pMn2QOtG8Q8/giphy.gif" style="width: 400px;"/>

count(important_column) is fast. 🐰
count(*) is slow. 🐢  

<details><summary>
Why?
</summary>
count(*) is doing an Index Scan (or a table scan), not an Index Seek which is why the SELECT COUNT(*) takes so long. <br>  The reason for this is that the COUNT(*) function needs to look at every record in the table.
</details>

COUNT(important_column) is fast.
COUNT(DISTINCT(important_column)) is slow. 🐢

<details><summary>
Why?
</summary>
COUNT() can just look at how many rows (metadata).  <br>
COUNT(DISTINCT()) has to do a full table scan
</details>

What is a Data Analyst to do?

https://www.periscopedata.com/blog/use-subqueries-to-count-distinct-50x-faster.html

---
OYO
---

- Design your database
    - Hash strings (e.g., create a state table then use state id instead of state abbreviation)
    - Reduce normalization (more about that tomorrow)
    - Have fewer rows (favor wide tables over tall tables)
- Use a column database for analytics (more about than during the NoSQL) day
- Cache frequent queries
    - Temporary tables are golden
    - Buy your DBA 🍻 and have the person create "special" tables for you
- Use functions
    - Postgres has sooooooo many functions. Use them.
    - Create UDFs (user defined functions). You every write [in Python in Postgres](https://www.postgresql.org/docs/9.0/static/plpython.html)

---
Summary
----

- RDBMS are designed to have many concurrent users and balance their needs.
- Optimizers are better at SQL than you.
- A bit of Relational Algebra goes a long way in understanding and optimizing SQL
- There are 3 ways of looking at query: requester, writer, and doer
- EXPLAIN may help you to understand queries better (so you can go home at night)
- INDEXES are powerful (but use responsibly)
- Think before you hit RUN!

<br>
<br> 
<br>

----