# PostgreSQL NOTES

These are notes on SQL syntax. While these notes focus on syntax with PostgreSQL, the idea of the querying syntax may apply to other SQL languages. For example, MySQL is almost identical to PostgreSQL in terms of syntax, with the exception of a couple of minor points (of course).

## Preface

### Sample Datasets to Work with

As an initial matter, we can use [this SQL IDE](https://www.db-fiddle.com/) for playing around with SQL datasets.

[Sample Dataset 1: Novels](https://raw.githubusercontent.com/ephs08kmp/sql_workshop_schema/master/sql_example.txt) <br>
[Sample Dataset 2: Train Stations](https://raw.githubusercontent.com/ephs08kmp/sql_workshop_schema/master/sql_workshop.txt)
***
#### Good Practice

One cannot say that s/he is "proficient" at PostgreSQL without [completing these exercises in full!](https://pgexercises.com/) These exercises are by the same people who do the PostgreSQL documentation, so it's legit.

***
***
# TABLE OF CONTENTS

### I. Loading Data in SQL
1. Loading from .sql file
2. Loading from .csv and other files
3. Manual Creation of Tables (in a PostgreSQL Database)
4. Altering a Table
    
### II. SQL Syntax
1. Basic Searching & Selection
2. Working with Strings and Datetimes
3. "Case When" Statements
4. Aggregation/Grouping and Subquerying
5. Joins and CTEs
6. Modifying Data
    
### III. Miscellaneous SQL Notes
1. Syntax of Arithmetic Operations
2. Union, Intersect, and Except Functionality (Basic/10)

***
***
# I. LOADING DATA INTO PostgreSQL

Three main ways:
1. Load data into a database from a _.sql file_.
    - The one that will be referred to in these notes is the open-source __PgAdmin4__
2. Load data into a database from a different tabular file, such as a _.csv file_.
3. Create the table parameters manually within the database.    

***
##  1. NORMAL LOADING FROM .sql FILE

_How to do in __PgAdmin__:_

1. (Right click on) Database - Create -> Database
2. (Right click on) (Newly Created Database) - Query Tool...
3. On the query tool toolbar, open file

### What is a 'Database File' and What is a 'Query Tool'?

- __Database File__ = folder containing all relevant \*\*\*\*\*\*.sql\*\*\*\*\*\* files
<br>
- __Query Tool__ = A tool used to 'query' information from files - the main thing of SQL (see below section)

***
## 2. IMPORT DATA FROM .csv FILE

__Two Main ways to import data from non-.sql files: via (a) the Query Tool or via (b) the database program itself.__

### A. Create Table Manually under Query Tool

   - `CREATE TABLE` (follow Create Table syntax above, i.e., listing column name and type)
   - Populating the newly created table via the query tool requires:
      ```SQL
      COPY table_name(column_name1, column_name2, column_name3...)
      FROM 'C:\Users\jafon\Documents\PythonMaterials\Data\AirBnB_Listings\listings.csv' 
      WITH DELIMITER ',' csv HEADER;
      ```
    
### B. Import Data into Newly Created Table

(Under Browser): Database - Schemas - public - Tables (right click) -> Create -> Table
 - Name the Table (something different than the create Table under the query)
 - Under Columns - populate the column names by clicking on the dropdown, selecting the table you just created via the query (or I guess you could manually write out the columns here - You can skip step 1 if you do this)
 - Click save and create the table 
 - Access this newly table under the browser -> (right click) -> Import/Export -> Choose Impport
 - Select location of dataset file
 - Header YES or ON, DELIMITER ','
 - Save and Import

***
***

### C. POTENTIAL ERRORS TO LOADING DATA (and their solutions):

- ___"There is no such file or directory"___ or ___"permission denied":___
     - There is a permission error from Windows blocking pgAdmin from accessing the file folder
     - Right-click on the folder that contains the desired file -> Properties -> Change user permissions
        <br><br>    
- ___"extra data after last expected column":___
     - The number of columns being imported does not match up with the number of columns listed in the `COPY table_name(...)` query<br><br>
    - This addresses an important point of data loading for PostgreSQL, __IMPORTANT!!!!:__ You MUST import ***ALL*** of the columns from the csv; you cannot pick and choose!
    
      - There are two ways to fix this: <br>
      1. Deleting unnecessary columns in the original csv dataset - saving a new copy and loading from there
          - This is called "preprocessing" the data - and it is the easiest way
      2. Import ALL columns FIRST from csv, THEN modify the table
          - a. 
          ```SQL 
          ALTER TABLE table_Name DROP COLUMN column5
          ALTER TABLE table_Name DROP COLUMN column2 (etc.)
          ```
          - b. This may be too process-heavy/cumbersome for super large datasets
              - You can skip this by creating and intermediary table AND THEN picking and choosing from there: <br>
              ```SQL
              CREATE temporary table t (x1 integer, ... , x10 text)
              -- Copy from the file into it:
              COPY t (x1, ... , x10)
              FROM '/path/to/my_file'
              WITH  (format csv)
              
              INSERT INTO my_table (x2, x5, x7, x10)
              SELECT x2, x5, x7, x10
              FROM t
              
              DROP TABLE t;
              ```   
              
__NOTE!!!: This is a good method for re-ordering columns too!__
<br><br>
Overall,
<br><br>
The failure to easily choose column order and modify tables is a severe limitation of postgreSQL/pgAdmin. Other sql databse programs (e.g., MySQL database(s)) allow you to pick and choose columns normally.

***
## 3. TABLE CREATION - THE MAIN WAY TO DO BUSINESS!

### A. Overview

Instead of importing data from an existing table, we can alternatively create own table! ___However,___ in doing so, we need to do more than just set dimensions and input values. We need to name _everything_, set data types, add constraints if necessary, and of course, do it all without getting errors!<br>

Below is an example of a simple table creation. Keep in mind this only creates the table - it does not have any values inputted into the newly created table.

```SQL
CREATE TABLE table_name (
    some_column_name TYPE column_constraint,
    column_2_name TYPE, 
    column_3_name TYPE column constraint
    );
```

***

### B. What is TYPE?

__TYPE__ is the defined data type of the values within that column - yes, all values within the same column MUST be the same data type (unless you define it as "VarChar" (variable characters), which is unadvised and should only be used (1) as a placeholder or (2) if the data type isn't important for that column and want to avoid errors during importing). 
<br><br>
Examples of type are: 
- TEXT 
- INTEGER 
- NUMERIC 
- VARCHAR(100) _(number of max characters, regardless of type)_
- DATE. 
    - Check the internet for more data types!

#### 1. Two Notes on VARCHAR(x)...

1. VARCHAR is a good placeholder, but unless you plan on _not_ performing any querying operations on that column that require a certain data type (numerical operations, DATE-time operations, etc.), you should go with a more specific datatype
    - One good reason to use VARCHAR is to avoid type-specific errors during importation of the data...which will LIKELY happen for data that is not 100% clean (which is always the case)!!!!!
<br>

2. SQL cannot handle arrays - it mimics the conceptual formatting of csv data (which is not in arrays). _Sometimes_, VARCHAR data may be interpreted as an array for SQL, which will invariably trigger an error! If this happens, choose a different data format, or preprocess the data to steer away from arrays.

***

### C. What are Constraints?

A `column_constraint` can be:<br><br>
1. A __boolean__ (constrain by "true", "false", or "null" values). Two examples:
    - `price numeric CHECK (price > 0)` ---- (for 'checking' positive prices)
    - (Can be multiple booleans, with a final "CHECK" not corresponding to any one particular column):<br>

```SQL
price numeric CHECK (price > 0), 
discounted_price numeric CHECK (discounted_price > 0), 
CHECK (price > discounted_price);
```

***    
2. __NOT NULL__ (constrain by only non-null's)
    - Example:
```SQL
name text NOT NULL,
price numeric NOT NULL CHECK (price > 0) 
```

    - The second example can also be defined in a query, such as:
    ```SQL
    select * from calendar
    where price is not null
    order by 4 DESC;
    ```
***
3. __UNIQUE__ (constrain by no duplicate values within that column)
    - Example: `product_no integer UNIQUE`

    - Can also combine uniques, such as making sure there are no _combinations_ of "a _and_ c values". This means that for row values for columns a and c: <br>
    12 24 (1) <br>
    12 12 (2) <br>
    12 24 (3) <br>
    22 24 (4) <br>
    Row (3) would be taken out, and all other rows would be kept.
<br>
***
4. __PRIMARY KEY__ (equivalent to UNIQUE + NOT NULL on a column)
    - can also do PRIMARY KEY (a, c) too
<br><br>
Check the internet for the other constraints that can be done!
***
***

### D. Adding Data to a Newly Created Table

Two Ways:<br>
1. Load an entire dataset into the table. _See_ Section I.B.b., _supra_, for more information on how to do that.
2. Manually entering in the data. We can do this via the `INSERT INTO` call! <br> 
__Example__:
```SQL
insert into cd.facilities -- table name (the "FROM" name)
    (facid, name, membercost, guestcost, initialoutlay, monthlymaintenance) -- The columns to insert 
    values -- (You don't need the above column name line if adding values to all columns in order...)
        (9, 'Spa', 20, 30, 100000, 800),
        (10, 'Squash Court 2', 3.5, 17.5, 5000, 80); -- Two rows of data are added here. Add as many as you like, which will be appended to the cd.facilities table ***in order*** of the rows listed!!!
```

For modifying/overwriting and deleting data, _see_ Section II.6 ("Data Update").
***

## 4. Altering an Already Created Table

__IMPORTANT!:__ <br>
Once a table is created, the order of the columns CANNOT be modified. What you _can_ do is rename all of the columns before populating with data by right clicking on each column and modifying until you are satisfied with the order...

### How to Alter Columns

__Answer:__ The `ALTER TABLE` call
```SQL
ALTER TABLE table_name
ALTER COLUMN column_name TYPE new_desiredcolumn -- (and new_desired type, if necessary)
```

If that doesn't work, do this
```SQL
ALTER TABLE table_name
ALTER COLUMN column_name TYPE new_desiredcolumn type
USING column_name::new_desired column type
```

Example:
```SQL 
ALTER TABLE calendar
ALTER COLUMN price TYPE MONEY NOT NULL
USING price::MONEY
```

***
***
# II. QUERYING TOOLS & SYNTAX

The heart and soul of SQL; otherwise, it would be called "Structured Just-Kinda'-Looking-At-The-Data Language", which doesn't have the same kind of ring and pithiness as does "Structured Querying Language".<br>

### SECTION CONTENTS:
1. Basic Searching & Selection
2. Working with Strings and Datetimes
3. "Case When" Statements
4. Aggregation/Grouping and Subquerying
5. Joins and CTEs
6. Modifying Data
***
***

## 1. Basics of Querying: Searching & Selection

This is a mile high overview example of what the format of a basic search - i.e., a search with no grouping, no joins, and no subquerying from a single data table - would look like:

```SQL
SELECT -- what you want to select
    column1,
    column5,
    column4 AS newcolumnname -- you can rename stuff (see II.1.A. infra)
FROM datatablename -- i.e., the source of the columns. If they are located in another table, you ***need*** to do a JOIN
WHERE 
    column5 > 0 -- limiting boolean(s) applied to the ***entire*** table output
    AND
    column4 > 0
    OR 
    (column5 < 0 AND column4 < 0)
    
ORDER BY column5 DESC -- Organizing the data by column data, and the direction of the order
LIMIT 10; -- Limiting output of query to number of lines
```

### Miscellaneous Notes Before We Jump In

- __Proper Terminology of SQL__
    - According to [PostgreSQL documentation](https://www.postgresql.org/files/documentation/pdf/12/postgresql-12-US.pdf), the list of columns to be selected is a "__Select List__", and the table name in the FROM section is called a "__Table Expression__". To avoid confusion, we will use more intuitive terms such as "list of columns" and "table name", respectively.<br><br>
- __How to Comment in SQL__ 
    - If you haven't figured it out, a double dash ('--') adds a comment into the query. 
        - In MySQL, it's `/* comments */`, with _both_ ends of the comment designation required.
    - As any programmer will tell you, __COMMENTING IS KEY. COMMENTING IS reQUIRED.__ <br><br>
- __How to Format SQL Code__
    - The line breaks, capitalization, spacing, and indentation of the query is __ALL__ stylistic - it won't cause an error during compilation if you ignore these things (you can even do the above query all in one line!). The purpose of all these things is for ease of reading.
        - The indentation and formatting presented here is typical SQL formatting (well..._I_ like to think it's pretty legit...).
        - The optional capitalization also applies to names of things to which you are referring as well as the SQL commands.
    - The terminal semicolon is also a relic adopted from other programming languages to signal the end of the query. Echoing the previous bullet, it is optional.
    
<br>

Subsections are ordered in the syntactical order of a PostgreSQL query.

### A. SELECT

Using __`SELECT`__ is often the first thing you want to do. You "select" which columns you would like to output from the query, with each column separated by a comma (listing "things" in a series in SQL requires commas, or else you'll get an error!). 
- __The order of the SELECT is important!!!__ What is outputted is a table with the columns in the order in which you listed.
- If you want to select the entire dataset, you use a asterisk (\*), such as `SELECT * FROM datatable1`. 
- If you only want to obtain _unique_ values, you would add __`DISTINCT`__ to SELECT: `SELECT DISTINCT ... FROM datatable`. <br>
<br>
You can also manipulate what you are selecting in line: <br>

```SQL
SELECT
    column1 AS Total_View,
    column2 * 1.06, -- Yes, you can do any arithmetical operations in line in SQL!
    column5 * column 7 AS Total_Revenue
FROM 
    datatablename1;
```

As you can see, the selected data can be manipulated to output something different, such as a dual-column multiplication to output a column of the multiplication result. Sometimes, you will want to manipulate data based on data _already_ manipulated. To do this two-step manipulation process, we will need to do a ___subquery___ and then select from that. _See_ Section II.4.D, _infra_.
- This two-step concept is also used for joining multiple tables - you do a "sub-join" (whose result is a "Common Table Expression", _see_ Section II.5, _infra_), and then join a third table to that.

***

#### Sidenote: What is 'AS'???

The __`AS`__ call is used to ___rename___ _any_ data object, whether it be a column name, table name (though not advised), or to assign a name to a full conditional/loop, subquery, or CTE. If renaming a column, the new name will appear in the query output.
- Using the word 'AS' in the query is optional. The query would produce the same result for column1 if it was just `column1 Total_View`.
- If you want multiple words as the new name, you need double quotes: `column1 AS "Total View"`. _See_ II.2., _infra_ for more information on the syntax of strings.
- If `column5*column7` was not renamed, the output name would be..."Column5\*Column7". Ugly.
    - The name of column2 here is _not_ changed because column2 is clearly the thing being selected, whereas it is ambiguous with two columns multiplied together
***

### B. WHERE

#### (i) Non-Aggregate Queries - The Basics of WHERE

__`WHERE`__ is the process to screen out undesired data. This is done via a boolean expression, and the expression _can_ include a column not specified in the select clause!<br>
__Examples:__
- `WHERE column2 > 0` (Simplest example) <br><br>
- `WHERE column2 * 1.06 > 0` (Simplest example plus in line arithmetic) <br><br>
- `WHERE column2 NOT NULL` (Excluding null values)<br><br>
- `WHERE column3 LIKE 'Mr.%'` (Use "LIKE" for string booleans (i.e., column3 value = 'Mr.%'). _See_ II.2. for more information)<br><br>
- `WHERE column3 > 0 AND (column4 < 0 OR column5 > column6)` (Using boolean operators to combine multiple conditions to be satisfied)<br><br>
- `WHERE column4 > '2012-09-01'` (Datetime data values natively understands inequality operators as "column4 value must be _after_ YYYY-MM-DD date)<br><br>
- `WHERE column1 BETWEEN 20 and 50` (Equivalent to column1 >= 20 AND column1 <= 50)
    - `WHERE column1 NOT BETWEEN 20 and 50` (Equivalent to column1 < 20 OR column1 > 50)<br><br>
- `WHERE column3 IN (SELECT column2 FROM datatable2)` ("If the column3 value is also found anywhere in column2 of the second dta table, it will satisfy this WHERE and will be outputted")<br><br>
- `WHERE columncountry NOT IN ('Germany', 'UK', 'France')` ("Output every value in columncountry that is not found in that created list").

#### (ii) WHERE Functionality in Aggregate Queries - "HAVING"

(_See_ Section II.4 for more information on Aggregation in SQL queries)<br><br>

When using an aggregated column in a WHERE clause, the correct syntax is to use the call __`HAVING`__, exemplified below:

```SQL
SELECT
	facid,
	SUM(slots) as "Total Slots"
FROM
	cd.bookings
GROUP BY 1
HAVING SUM(slots) > 1000

ORDER BY facid;
```

The practical effect of HAVING is the same as WHERE - it limits the desired values based on a condition. The theoretical distinction is more technical: WHERE screens out input values before outputting query results, whereas HAVING accepts all input values of the column first and _then_ screens out the aggregated results based on the condition before outputting query results. The more you know!

***
### C. ORDER BY

__`ORDER BY`__ organizes the data by the defined column(s). Unlike the WHERE section, the ORDER BY section requires the results to be ordered by something identified in the SELECT section.<br><br>
- This section can include multiple ORDER BY columns. For example, if it was ORDER BY column1, column3, SQL would order the output by column1 values, then once that is done, if there are any ties in the rows of data, those ties would be ordered by values in column3.<br><br>
- Ordering can be done either by ASCending or DESCending, such as ORDER BY column1 DESC. ASC is the default, so it need not be specified in the section (i.e., "ORDER BY column1" = "ORDER BY column1 ASC").<br><br>
- Instead of re-writing the entire column names in the ORDER BY section, one can simply refer to the outputted columns by the column number. See below for an example.
    - With that said, if using column names in the ORDER BY section, _it must be the original name_, not the new name assigned by AS. That means it would be ORDER BY column1 (_not_ ORDER BY Total_View), or ORDER BY column5\*column7. Similarly, if you were to do an aggregate function, like SUM(column2), you would have to ORDER BY SUM(column2).
<br>

__Full Example:__

```SQL
SELECT
    column1 AS Total_View,
    column2 * 1.06,
    column5 * column 7 AS Total_Revenue
FROM 
    datatablename1
ORDER BY 3 DESC, 1;
```

This example translates to:
> _Select column1 (1), column2 (2), and column5\*column7 (3). Organize these values by first listing the values in column5\*column7 (identified by the third-listed column selection, i.e., "3") by largest values to smallest, a/k/a "descending" order. If there are remaining unordered rows (by virtue of there being ties in the values in (3)), then order by column1 (1) from smallest to largest values, a/k/a "ascending" order._

***
### D. LIMIT

The simplest of the sections: __`LIMIT`__ "limits" the output of a query to a specified number of lines. For example, LIMIT 1000 is the first 1000 rows of data. LIMIT 1 is just one row of data.<br><br>
- ORDER BY is important in these instances!
    - To do a janky way of getting a max value of something, you can search for a max value, ORDER BY that column DESC, and LIMIT 1. That one row yields the max value of the ordered column. _See_ Aggregation, Section II.5.C, _infra_, for the correct way of getting a max value (or any other aggregated value).<br><br>
- After LIMIT is the rarely used __`OFFSET`__, which is used to skip X number of rows before yielding values. For the same obvious reasons as ORDER BY in the context of LIMIT, ORDER BY is extremely important when using OFFSET; you don't want to offset rows that you want to see!    

***
***
## 2. WORKING WITH STRINGS IN SQL

### A. Using Strings in a SQL Query

SQL can handle strings (including datetime datatype objects), which are always represented in single quotes (' ').
- The only exception to the single quote rule is using double quotes for AS. I'm not sure why - that must be a PostgreSQL specific thing.
<br>

For WHERE and CASE WHEN sections, where strings can be placed into boolean checks, we use `LIKE`. \[Three cells\] below are some examples.
<br>

Separately, one can use __`SIMILAR TO`__ instead of LIKE to utilize regular expression syntax. Regex syntax is not covered in these notes [but can easily be found on the internet](https://www.petefreitag.com/cheatsheets/regex/).

__Interposed Footnote:__ <br>The following HTML code is for Jupyter markdown table formatting. Aligning the text in the markdown table is a 4-year old problem that is unlikely to be fixed due to (1) low priority and (2) higher weight placed on parity of consistent user experience between Jupyter builds (and JLabs too). For this reason, manual HTML is required, as opposed to integrated CSS formatting.

In [13]:
%%html
<style>
table td, table th, table tr {text-align:left !important;}
</style>

| QUERY | BOOLEAN<br>RESULT | EXPLANATION | 
| :----: | :--------: | :------: |
| `'abc' LIKE 'abc'` | TRUE | Exact match between the string ('abc')<br>and what is being checked ('abc') | 
| `'abc' LIKE 'c'` | FALSE | Needs to be exact match for it to be True. |
| `'abc' LIKE 'a%'` | TRUE | The % sign signals to match with<br>"0 or more characters following the 'a' character".<br>This means '%a%' would still be true. |
| `'abc' LIKE '%a'` | FALSE | The LIKE is looking for any string _ending_ in 'a' here,<br>which does not match with 'abc'. |
| `'abc' LIKE '_b_'` | TRUE | The '\_' character means any one single character. |
| `'abc' LIKE '%ab_%'` | TRUE | Incorporates both above '%' and '\_'<br>special characters in one boolean check. |
| `'abc' LIKE 'a[bcde]c'`| TRUE | \[ \] designates any letter within the bracketed list<br>in that position. Equivalent to `a[b-e]c`. |
| `'abc' LIKE 'a[^bonp]c'`| FALSE | ^ designates any letter _not_ within the bracketed list<br>in that position. Adding another bracket would be<br>interpreted as a new character; use only one bracket. |

### B. String Concatenation

If you have two columns, First_Name and Last_Name, and you wanted to output the full name in a single column, how would you do that?<br>
__Answer:__ _String Concatentation!_ This is done via two vertical pipes || ||.
<br>

For example, if you had "Roy" and "Robinson" in the respective columns, you would do:
```SQL
SELECT
    First_Name ||' '|| Last_Name AS Name -- Need the space (' ') so the output isn't RoyRobinson
FROM namelist;
```

***
### C. A Note on Datetime

Datetime data objects are different than strings but are treated the same way: they are identified by a YYYY-MM-DD format enveloped by single quotes, such as `'2012-09-01'`. Such datetime values are typically encoded with a 24-hour timestamp as well (HH:MM:SS.nnnnnn), such as `'2012-09-01 08:44:01'`. Identifying a date without a timestamp is equivalent to YYYY-MM-DD 00:00:00.
<br>

Limiting by time can be done via simple inequality. For example, if you were looking for values within the month of September 2012, it would be:
```SQL
SELECT joindate, [...]
FROM datatable
WHERE
    joindate >= '2012-09-01' AND
    joindate < '2012-10-01';
```
#### (i) Additional Datetime Functionality

Not surprisingly, datetime data objects have additional functions that string data objects do not have. Examples are:
- __`EXTRACT`:__ 
    - EXTRACT desired level of timing (year, month, hour, etc.) from a timestamp or manually created time interval. 
    - Example: `SELECT EXTRACT(hour FROM timestamp '2001-02-16 20:38:30')` yields '20'.
    <br><br>
- __`DATE_TRUNC`:__
    - Round down a timestamp to a desired level of timing. 
    - Example: `SELECT DATE_TRUNC('hour', timestamp '2001-02-16 20:38:30')` yields '2001-02-16 20:00:00'.
    <br><br>
- __`AGE`:__ 
    - Subtracts CURRENT_DATE from timestamp (current_date automatically determined from database clock).
    - Does not yield time precision - subtraction is done from midnight of current_date.
    - Example: `SELECT AGE(timestamp '2001-02-16')` yields '18 years 10 months 12 days' (from 2019-12-28).
<br><br>
Example in an actual query:
```SQL
select facid, extract(month from starttime) as month, sum(slots) as "Total Slots"
	from cd.bookings
	where
		starttime >= '2012-01-01'
		and starttime < '2013-01-01'
	group by facid, month
order by facid, month;   
```

***
***
## 3. CASE WHEN STATEMENTS

A __`CASE WHEN`__ statement is nothing more than a loop applied to the table. These are commonly used as classification techniques.  The general syntax is:
> `CASE WHEN condition1 THEN value 1 ELSE value 2 END AS newcolumnname`

__Example:__

```SQL
SELECT player_name, -- While in this example, weight is listed as a column,
       weight,      -- it is not required to be listed for the CASE WHEN (similar to WHERE in this regard)
       CASE WHEN weight > 250 THEN 'over 250'
            WHEN weight > 200 THEN '201-250' -- This is how you do multiple conditions in one CASE
            WHEN weight > 175 THEN '176-200' -- Don't need BETWEEN's since the query is analyzed top-down!
            ELSE '175 or under' END AS weight_group
FROM benn.college_football_players
```

Again, the line breaks, indentation, and other stylistic functions are for ease of code reading - the query can be compiled all in one line, with or without parentheses, as seen below:

```SQL
SELECT
    (CASE WHEN dockcount > 20 THEN 'large' ELSE 'small' END) station_size,
    COUNT(*) as station_count
FROM 
    stations
GROUP BY 1;
```

As a final note, CASE WHEN results can be incorporated into further arithmetic operations, as shown below:

```SQL
SELECT 
    name, 
    SUM(slots * CASE -- here, the results of a CASE WHEN are being multiplied by the value of another column
        WHEN memid = 0 THEN guestcost
        ELSE membercost
		END) AS revenue
FROM cd.bookings
GROUP BY name
ORDER BY revenue;   
```

***
***
## 4. AGGREGATION, GROUPING, & SUBQUERYING

### A. Definition

__Aggregation__ is a fancy word for combining all values in a column, and picking a ___single___ value from that grouping, whether it be the maximum value of the group, the average value of the group, the number count of values of the total group, sum, median, etc. <br>

### B. Syntax of Aggregation and the Reasoning Behind it

Because aggregating things results in a single aggregated value, we need to create a __`GROUP BY`__ section to yield a full table. For example, SELECT MAX(weight) FROM vehicles will give one value: the maximum weight listed in the entire vehicles table. The moment we select something else, such as SELECT vehicle_type, MAX(weight) FROM vehicles, SQL won't know how to reconcile a single value column with a column of rows of different types of vehicle values spanning the entire length of the vehicles dataset. By doing SELECT vehicle_type, MAX(weight) FROM vehicles GROUP BY vehicle_type, SQL now knows to output a table of the maximum weight of the vehicle in _each_ vehicle type category. 

Based on this methodology, there are __TWO AGGREGATION RULES THAT MUST BE FOLLOWED:__
1. An aggregated column __must NOT__ be in the GROUP BY section.
2. __ALL non-aggregated columns MUST__ be found in the GROUP BY section.<br>

If either of these two rules is not followed, you will get an error in compilation. Here is an example of an aggregated query:

```SQL
SELECT
    city, 
    lat AS latitude,
    longt AS longitude,
    MAX(time),
    COUNT(*) AS station_count -- While it is optional, it's a good idea to rename aggregated columns
FROM
    stations
GROUP BY 1, 2, 3; -- Like ORDER BY, you can use column numbers instead of full names in this section
```

***
### C. Common Pitfalls During SQL Aggregation

1. Order of operations is important. If you wanted to find [the total number of members who have made at least one booking](https://pgexercises.com/questions/aggregates/members1.html), you would need to do `SELECT COUNT(DISTINCT memberid) FROM cd.bookings` and not `SELECT DISTINCT COUNT(memberid) FROM cd.bookings`. The latter makes no sense if you think about it.<br><br>

2. If you are trying to output a _single row_ of an aggregated column alongside an _entire non-aggregated column_, well, you can't really do that. For example, `select firstname, surname, max(joindate) from cd.members` will throw an error, since SQL can't reconcile the aggregated single value with entire database rows. To address this, you can do one of two things:
    - a. To get a max value, ORDER BY DESC the column containing the desired value, and LIMIT 1 for that single row. _See_ Section II.1.D., _supra_. It's a cheap workaround, but it gives you what you want. This obviously doesn't work for averages or other non-extrema values.
        - This also doesn't account for the possibility that there be multiple rows with the same max value for that column! <br><br>
        
    - b. You do a ___SUBQUERY___! See below subsection for more information. 

***
### D. Subquerying

__Subquerying:__ As the name implies, subquerying is a query within another query. It is identified by a parenthetical expression with a separate SELECT inside.

Applying subquerying to the example in Section II.4.B., the correct syntax is: 
```SQL
    SELECT 
        firstname, surname, joindate
	FROM cd.members
	WHERE joindate = 
		(SELECT MAX(joindate) 
			FROM cd.members);      
```

The psuedocode explanation is:<br>
> _Select the three columns from the table cd.members. Limit that output to only those values within the joindate column that equal the value of MAX(joindate) from the table cd.members._


As seen in the above example, a subquery is used as a limitation to the output that you desire. In addition to aggregation queries, this limiting functionality can be applied to [the intersection of joining tables](https://pgexercises.com/questions/joins/sub.html).

***
***
## 5. JOINS AND COMMON TABLE EXPRESSIONS ("CTEs")

### A. Definition

In the opening paragraph to Section II, I said that querying was the "heart and soul" of SQL. Joining is the heart and soul of the heart and soul. __Joining__ is combining two data tables via a column in common between the two. The syntax is as follows:

> ```SQL
SELECT [...] 
FROM table1 
    JOIN table2 
    ON table1.mutualcolumn = table2.mutualcolumn
```

### B. Basic Joining of Two Data Tables

Here is an example of a __`JOIN`__:

```SQL
SELECT
	bks.starttime as start,
	fac.name as name
FROM
	cd.bookings bks
	LEFT JOIN cd.facilities fac
	ON bks.facid = fac.id
WHERE
	bks.starttime > '2012-09-21' AND
	bks.starttime < '2012-09-22' AND
	fac.name LIKE 'Tennis Court %';
```
#### Example Breakdown:

1. The joining column does not need to be in the SELECT section.


2. The joining column does not need to have the same column name between the two data tables (here, it's facid vs. id). All that matters when it comes to the joining column is that the column _values_ have to be the same between the two tables.


3. The preceding shorthand of "bks" and "fac" identifies from which table the column exists. Shorthand designations are made in the FROM section, where you can make the designation anything you want.

    - Shorthand designations aren't required. You can do cd.bookings.starttime AS start and cd.facilities.name. They are almost always used (1) for ease of reading and (2) to avoid redundant tedious typing of table names.
    <br><br>
    - Regardless of whether you use shorthand designations or not, you still MUST designate where every column is coming from!

4. You _can_ join a table [onto itself!](https://pgexercises.com/questions/joins/self.html)

***
### C. Types of Joins

There are four types:


1. __`INNER JOIN`__
    - Join only those rows with an overlap in values in the join column between the two tables. Non-overlap rows are discarded. 
    - This is the default (and most common) join: saying `JOIN ___ ON___` is equivalent to `INNER JOIN ____ ON ____`.
    
    
2. __`LEFT OUTER JOIN`__
    - This join produces the entirety of the data table "to the left" of the JOIN call. Anything from the data table to the right of the JOIN whose values do not overlap ("intersection of values between the two tables") with those in the left data table is listed as NULL for those produced rows.
    - Functionally equivalent to `LEFT JOIN ___ ON___`.
    
    
3. __`RIGHT OUTER JOIN`__
    - Same thing as LEFT JOIN, but everything on the right side of the JOIN is produced, and the left data table values are NULL if no overlap with the right data table.
    - Functionally equivalent to `RIGHT JOIN ___ ON___`.
    - If possible, use LEFT JOIN's because humans read left to right, and no one wants to wade through NULL's before getting to what they want!
    
    
4. __`FULL OUTER JOIN`__
    - This treats both data tables as optional data; _everything_ is produced, and anything without an overlap is NULL.
    - Functionally equivalent to `OUTER JOIN ___ ON___`.

***
### D. Joining Multiple Tables

This is simpler than it sounds: it is simply doing multiple JOIN lines! <br>
__Example:__

```SQL
SELECT DISTINCT
	mems.firstname ||' '|| mems.surname as member,
	facs.name as facility

FROM cd.members mems
	JOIN cd.bookings bks on mems.memid = bks.memid
	JOIN cd.facilities facs on bks.facid = facs.facid
WHERE
	facs.name LIKE 'Tennis Court%'
```

In this example, cd.bookings and cd.members have a column with common values: memid. Once that join is finished, that joined entity (the "new table") is then joined on cd.facilities. The order and conceptualization of the joins is important if you intend to do a left or right join.

### E. Performing a Join via a Subquery

This is a little more advanced but nonetheless important enough to include in these notes.

__[Question:](https://pgexercises.com/questions/joins/sub.html)__ Output a list of all members, including the individual who recommended them (if any), without using any joins.

__Answer:__

```SQL
select distinct mems.firstname || ' ' ||  mems.surname as member,
	(select recs.firstname || ' ' || recs.surname as recommender 
		from cd.members recs 
		where recs.memid = mems.recommendedby
	)
	from 
		cd.members mems
order by member;
```

__Psuedocode Explanation:__
> a


Straight JOINS



CTEs (Documentation 7.8, p. 140)


Subquerying

***
***

## 6. MODIFYING OR UPDATING EXISTING DATA

## JOINS

In [None]:
# JOINS join MULTIPLE TABLES together
# In this instance, MUST identify the table from which a specific column comes from
SELECT
    trips.trip_id,
    trips.start_station,
    stations.lat,
    stations.long
FROM
    trips 
JOIN
    stations
ON
    trips.start_station = stations.name; # where to do the join
# note that the values of the two ONs ***MUST BE THE SAME TO DO THE JOIN***
# (the column names need not be the same - just the values of those columns)
# See example in CTEs below on how to get around this (do multiple joins across tables)
    
# Can also use "aliases" to shorten table names. The above is the same as:
SELECT
    t.trip_id,
    t.start_station,
    s.lat,
    s.long
FROM
    trips t
JOIN
    stations s
ON
    t.start_station = s.name;
    
# TYPES OF JOINS
1. INNER JOIN  - returns only matching columns, dropping everything else 
# SQL default^^^, i.e., same as JOIN
2. OUTER JOIN - returns everything, leaving non-matches as null values
# not recommended, as may choke up computer when viewing large databases
3. LEFT JOIN - returns everything on left table, leaves non-matches on 
right table as null values
4. RIGHT JOIN - returns everything on right table, leaves non-matches on 
left table as null values
# more difficult to read since we read left to right - 
# better to flip around the tables in the query and do a left join instead


## COMMON TABLE EXPRESSIONS (CTEs)

In [None]:
# CTE is joining a table on a previously processed query,
# (as opposed to joining two tables)

# Important to do because aggregation functions happen AFTER joins occur

CTE = WITH 
    Name of CTE to refer to later
AS (
    first query
)
SELECT
    second query
);

# EXAMPLE to break down:
WITH
    locations
AS (
    -- A simple query to get the averages of lat and long on a city level.
    SELECT
        city,
        AVG(lat) lat,
        AVG(long) long
    FROM
        stations
    GROUP BY 1
)

-- Joining the locations table we created with the trips table to count trips.
SELECT
    l.city,
    l.lat,
    l.long,
    COUNT(*)
FROM
    locations l

-- We need an intermediate join to go from locations to stations 
-- because the trips table does not have a "city" column.
JOIN
    stations s
ON
    l.city = s.city
JOIN
    trips t
ON
    t.start_station = s.name
GROUP BY 1,2,3;

# BREAKDOWN OF EXAMPLE:
'''
Under *locations*, after finding the averages of lat and long coordinates
for the stations of every city,
I want to join those values onto trips table.
HOWEVER, because trips table does not have a city name column,
I must join locations.city onto stations.city,
THEN DO A SECOND JOIN of trips.start_station onto stations.name.

Then, due to an inner join, the cities will be grouped with the averages of 
the coordinates, i.e., averages of lat, long, and count of number of trips
per city will be outputted.

'''

# III. MISCELLANEOUS SQL FUNCTIONS AND NOTES

## 1. Arithmetic 

Round (Ceil/Floor)

## 2. UNION vs. INTERSECTION vs. EXCEPT

select surname
from cd.members
union
select name
from cd.facilities;

The UNION operator does what you might expect: combines the results of two SQL queries into a single table. The caveat is that both results from the two queries must have the same number of columns and compatible data types.

UNION removes duplicate rows, while UNION ALL does not. Use UNION ALL by default, unless you care about duplicate results.

Add Subsection names in any _see_ infra or supra references