# A Trip Down SQLane: 13 Tips and Tricks For SQL
-------------------------------

## I. Introduction
------------------

In this post I want to take a trip down SQL memory lane, or just SQLLane for short. [Structured Query Language (SQL)](https://en.wikipedia.org/wiki/SQL) is something that I have been using in many different forms for years. On this blog I have written prior posts about [SQLite and PostgreSQL](https://michael-harmon.com/posts/sqlwars/), [NoSQL](https://michael-harmon.com/posts/sentimentanalysis1/) and [DuckDB](https://michael-harmon.com/posts/polarsduckdb/). Elsewhere  I have used [Postgres](https://www.postgresql.org/), [Teradata](https://www.teradata.com/), [Snowflake](https://www.snowflake.com/en/), [Impala](https://impala.apache.org/), [HiveQL](https://hive.apache.org/) and [SparkSQL](https://spark.apache.org/sql/). SparkSQL and [Apache Spark](https://spark.apache.org/) more generally holds a special place in my heart. The ability to switch between SQL statements and dataframe operations as well as incorporate arbitrary transformation and actions using Python, Scala or Java make Spark an incredibly powerful tool. 

In this post, I will go over techniques that have been extremely helpful in the past. These wont be introductory techinques or queries; the internet is littered with those. I'll go over some more intermediate and lesser known querying techniques. The main topics I'll cover are:

1. [Conditional Statements](https://www.geeksforgeeks.org/sql/sql-conditional-expressions/)
2. [Window Functions](https://www.geeksforgeeks.org/sql/window-functions-in-sql/)
3. [Array Operations](https://www.postgresql.org/docs/current/functions-array.html)
4. [Special Types of Joins](https://www.w3schools.com/sql/sql_join.asp)

I'll also make a few notes of specifics to Spark that are useful in practice. One thing to note is that I use SparkSQL for both SQL queries as well as dataframe operations. The Spark API is exteremely well written and the syntax mirrors SQL so closely I usually just think of the two as interchangable. To some degree they are, but I have found in a few specific cases using the dataframe API provides advantages that I will call out. The last thing I will say is that I will try to be a little more succinct in this post and realistic examples from online platforms such as [LeetCode](https://leetcode.com/) and [DataLemur](https://leetcode.com/) to give you an idea of how to apply these techniques.

## II. Conditional Expressions
-------------------------
Conditional expressions are queries that involve actions which are dependent on certain conditions being met. These are called "if-else" statements in other languages. I'll start out with simple functions for text that require if-then staetements under-the-hood. 

### 1.  TRIM, LOWER, And Regular Expressions 
These functions are extremely helpful when it comes to text. The [TRIM](http://w3schools.com/sql/func_sqlserver_trim.asp) function removes extra white spaces around text. There are versions which only remove extra spaces on the left side [LTRIM](https://www.w3schools.com/sql/func_sqlserver_ltrim.asp) and right side [RTRIM](https://www.w3schools.com/sql/func_sqlserver_rtrim.asp). The [LOWER](https://www.w3schools.com/sql/func_sqlserver_lower.asp) function converts all text to lower case (or [UPPER](https://www.w3schools.com/sql/func_sqlserver_upper.asp) if you prefer upper case). Lastly, regular expressions are extremely helpful in SQL since they are optimized operations. One particularly helpful technique is [REGEX_REPLACE](https://duckdb.org/docs/stable/sql/functions/regular_expressions#regexp_replacestring-pattern-replacement-options) which searches for text that meets a pattern and replaces with specified text. Let's go through a simple example.

Say I am searching for all records of "Michael Harmon" in the database shown below:

In [7]:
import duckdb
query1 = open('queries/create_names.sql', 'r').read()
duckdb.query(query1)
duckdb.query("SELECT id, name FROM names")

┌───────┬──────────────────────┐
│  ID   │         name         │
│ int32 │       varchar        │
├───────┼──────────────────────┤
│     1 │ Michael Harmon       │
│     2 │ Dr. Michael Harmon   │
│     3 │ mr. michael harmon   │
│     4 │  Michael Harmon      │
│     5 │ David Michael Harmon │
└───────┴──────────────────────┘

I should return expect to get records 1-4. If I write a simple naive query using `name = "Michael Harmon' I would only get the first result:

In [8]:
duckdb.query("SELECT id, name FROM names WHERE name = 'Michael Harmon'")

┌───────┬────────────────┐
│  ID   │      name      │
│ int32 │    varchar     │
├───────┼────────────────┤
│     1 │ Michael Harmon │
└───────┴────────────────┘

Instead I'll use `TRIM(LOWER(name))` to make everything the same case and remove extra-spaces to capture record 4. Now I could use a wildcard for records 2 and 3,

In [15]:
duckdb.query("SELECT id, name FROM names WHERE TRIM(LOWER(name)) LIKE '%michael harmon'")

┌───────┬──────────────────────┐
│  ID   │         name         │
│ int32 │       varchar        │
├───────┼──────────────────────┤
│     1 │ Michael Harmon       │
│     2 │ Dr. Michael Harmon   │
│     3 │ mr. michael harmon   │
│     4 │  Michael Harmon      │
│     5 │ David Michael Harmon │
└───────┴──────────────────────┘

But that would be a mistake since it would capture record 5! Instead let's use regular expression to remove Dr. and mr. and replace that text with blanks:

In [23]:
duckdb.query("""
SELECT 
    id,
    name
FROM 
    names
WHERE
    TRIM(
        REGEXP_REPLACE(
                LOWER(name),'(mr.|dr.)', ''
             )
        ) = 'michael harmon'
""")

┌───────┬────────────────────┐
│  ID   │        name        │
│ int32 │      varchar       │
├───────┼────────────────────┤
│     1 │ Michael Harmon     │
│     2 │ Dr. Michael Harmon │
│     3 │ mr. michael harmon │
│     4 │  Michael Harmon    │
└───────┴────────────────────┘

Now we can move on to truely conditional statements!

### 2.  Conditional Statements
In modern programming languages "if-else" statements are pretty common statements. In SQL the equivalent is "CASE WHEN ... THEN ... ELSE ..". You can enumerate any number of cases and the ELSE statement covers the case that dont match any of the ones specified.

As a simple example let's say we want to know the number names that are less than 16 characters in the `names` table, we could add a new column to the table with this information as shown below:

In [30]:
duckdb.query(""" 
SELECT 
    id, 
    name, 
    CASE WHEN LENGTH(name) < 16 THEN TRUE
         ELSE FALSE 
    END AS less_than_15_chars
FROM 
    names
""")

┌───────┬──────────────────────┬────────────────────┐
│  ID   │         name         │ less_than_15_chars │
│ int32 │       varchar        │      boolean       │
├───────┼──────────────────────┼────────────────────┤
│     1 │ Michael Harmon       │ true               │
│     2 │ Dr. Michael Harmon   │ false              │
│     3 │ mr. michael harmon   │ false              │
│     4 │  Michael Harmon      │ true               │
│     5 │ David Michael Harmon │ false              │
└───────┴──────────────────────┴────────────────────┘

You can add more conditions by adding more `CASE WHEN .. THEN ...` statments, but the line must always end with `END` statement.

### 3.  Conditional Statements With Aggregations

Conditional satments can also be used in conjunction with aggregation functions to create more complex queries. For example, you might want to count the number of names that are shorter than a certain length such as shown below,

In [33]:
duckdb.query(""" 
SELECT 
    SUM(CASE WHEN LENGTH(name) < 16 THEN 1 END) AS count_less_than_15_chars
FROM 
    names
""")

┌──────────────────────────┐
│ count_less_than_15_chars │
│          int128          │
├──────────────────────────┤
│                        2 │
└──────────────────────────┘

An exmaple of where I have used conditional statments with aaggregations is [Monthly Transcations I](https://leetcode.com/problems/monthly-transactions-i/description/) problem on Leetcode; the [solution](https://github.com/mdh266/SQL-Practice/blob/master/leetcode/monthly-transactions-i.sql) is on my GitHub page. :-)

## III. Window Functions
----------------------
[Window functions](https://en.wikipedia.org/wiki/Window_function_(SQL)) are another extremely important concept in SQL. A window is a function which uses values from one or multiple rows that are related to one another through a so-called partition to return a value for each row. This is a little abstract, so an example would a company table that has an employee, their department and their salary.  Like below,



In [10]:
import duckdb
duckdb.query(open('queries/create_employees.sql', 'r').read())
duckdb.query("SELECT employee_id, employee_name, department, salary FROM employees")

┌─────────────┬───────────────┬────────────┬────────┐
│ employee_id │ employee_name │ department │ salary │
│    int32    │    varchar    │  varchar   │ int32  │
├─────────────┼───────────────┼────────────┼────────┤
│           1 │ Alice Johnson │ Sales      │  90000 │
│           2 │ Bob Smith     │ Marketing  │  75000 │
│           3 │ Charlie Brown │ Sales      │  95000 │
│           4 │ Diana Prince  │ Sales      │  70000 │
│           5 │ Ethan Hunt    │ Marketing  │  20000 │
└─────────────┴───────────────┴────────────┴────────┘

We could find the average salary per deparment with the aggregation,

In [11]:
duckdb.query("""
SELECT 
    department,
    AVG(salary) dept_avg_salary
FROM
    employees
GROUP BY 1
""")

┌────────────┬─────────────────┐
│ department │ dept_avg_salary │
│  varchar   │     double      │
├────────────┼─────────────────┤
│ Marketing  │         47500.0 │
│ Sales      │         85000.0 │
└────────────┴─────────────────┘

But what if we want to assign the employee with their department average? We could do a nested query where we perform the aggregation and then join on the department, but this is kind of sloppy. Instead we can partition employees by their department and aveage over deparment as shown, 

In [13]:
duckdb.query("""
SELECT 
    employee_id,
    employee_name,
    department,
    salary,
    AVG(salary) OVER(PARTITION BY department) AS dept_avg_salary
FROM
    employees
""")

┌─────────────┬───────────────┬────────────┬────────┬─────────────────┐
│ employee_id │ employee_name │ department │ salary │ dept_avg_salary │
│    int32    │    varchar    │  varchar   │ int32  │     double      │
├─────────────┼───────────────┼────────────┼────────┼─────────────────┤
│           1 │ Alice Johnson │ Sales      │  90000 │         85000.0 │
│           3 │ Charlie Brown │ Sales      │  95000 │         85000.0 │
│           4 │ Diana Prince  │ Sales      │  70000 │         85000.0 │
│           2 │ Bob Smith     │ Marketing  │  75000 │         47500.0 │
│           5 │ Ethan Hunt    │ Marketing  │  20000 │         47500.0 │
└─────────────┴───────────────┴────────────┴────────┴─────────────────┘

Interestingly DuckDB returns the results in an order that reflects the partitioning instead of the original ordering!

In [None]:
An example, would be to add a new column which is the  but you could think of it as rank each employees in a department by their salary. This woudl be translated to SQL as

SELECT 
    employee,
    department,
    salary,
    AVG(salary) OVER(PARTITION BY department) AS dept_avg_salary
FROM
    employees

The `PARTITION BY` statement defines which 

### 4. RANK, DENSE_RANK, ROW_NUMBER 
* Rank, 
* DENSE_RANK: https://github.com/mdh266/SQL-Practice/blob/master/datalemur/sql-top-three-salaries.sql
* ROW_NUMBER: https://github.com/mdh266/SQL-Practice/blob/master/datalemur/sql-third-transaction.sql
* Why you need partition by key for large tables

### 5. QUALIFY

### 6. LEAD & LAG
* Year over year growth
https://github.com/mdh266/SQL-Practice/blob/master/datalemur/yoy-growth-rate.sql

### 7.  Moving Averages
* https://github.com/mdh266/SQL-Practice/blob/master/datalemur/rolling-average-tweets.sql

### 8. Multiple Expressions in SparkSQL

## IV. Array Operations
-------------------

### 9. COLLECT_SET/ARRAY_AGG

** https://github.com/mdh266/SQL-Practice/blob/master/datalemur/frequently-purchased-pairs.sql

### 10. EXPLODE


## IV. Special Joins
----------------
### 11. Cross Joins
* https://github.com/mdh266/SQL-Practice/blob/master/datalemur/repeated-payments.sql

### 11. Filtering by Join
* Repartitioning after with SparkSQL

### 12. Conditional Assignments With A Join

### 13. Broadcast Join in SparkSQL

### 14. Left Anti Join with SparkSQL
* https://github.com/mdh266/SQL-Practice/blob/master/leetcode/CustomersDontOrder.sql
* With a left join and filter (needs to be outside the nested join)!

<!-- 14. Conditional Joins -->

## V. Conclusion