# Training: SQL (Light-users) 👩‍💻
Welcome to the training notebook on using SQL.

This notebook is pitched at light-users who perform basic querying operations to retrieve lightly-wrangled data from SQL.

They will have **READ-ONLY** or higher access to the database.

***

## What is SQL? 🗃
SQL stands for *structured-query-language* and amongst bedroom geeks like myself, is commonly referred to verbally as *'sequel'*, a shorthand for S-Q-L. (Two syllables instead of three). 

It is mainly used for the following cases:
- Effectively storing data (especially at a larger scale)
- Efficiently reshaping large-scale data in a format suitable for analysis
- Rapidly performing simple computations of large-scale data
    + Such as addition, subtraction, multiplication and division 

It is not so good for:
- Customising the formatting of table outputs (use thangs like Excel or R for that) 
- Creating plots (use thangs like Excel or R for that)
- Writing reports (use thangs like Word or R for that)
- Nicely drawing maps (use thangs like R for that)
    + Technically, can have maps displayed in SQL but they can look basic

*Note when we say "large-scale data", we mean data that exceeds the amount of in-memory/RAM the computer in which you are doing your analysis has.*

> **Tip:** - SQL is not case sensitive

![SQL meme woman at cat image](https://live.staticflickr.com/65535/49136979022_bcbf5443aa_z.jpg "SQL meme woman at cat image")

***

## How do I use SQL? 💃
Whereas this training session is being delivered in a **Azure Data Studio** notebook, it is recommended that you do most of your SQL-related work in Microsoft's **SQL Server Management Studio (SSMS)**.

This is because SSMS offers a better graphical user interface (GUI) to enable you to easily explore databses, tables, Views, stored procedures etc. whereas Azure Data Studio does not have the same functionality.

In SSMS, you can visually navigate around the Server; by expanding sub folders in the Object Explorer pane you can explore the databases you have access to. You can right click on the objects to perform particular actions, such as generating code to preview a table or viewing table properties.

The following details will need to be entered to access the data:
- Server: **<*your_sql_server_name*>**
- Database: **AdventureWorks**

The AdventureWorks database is a collection of tables which contain a sample dataset for us to play with. 

***

## What will this session cover? 👁
This session will show you how to do the following things in SQL:
1. Basic relational database theory
1. Basic query to retrieve all records from a table - `SELECT...`
1. Choosing unique entries in a column - `DISTINCT`
1. Filtering data from a table - `WHERE...`
1. Sorting/Arranging/Ordering data from a table - `ORDER BY...`
1. Grouping your data to aggregate it - `...<aggregatation_function> ... GROUP BY...`
1. Joining data from different tables next to each other - `<join_type>...ON...` 
1. Attaching data from similarly structured tables on top of each other - `...UNION...`
1. Creating new columns in your data - `...[<created_column_name>]`
1. Changing columns to a different data type - `CAST([<column_name> AS <data_type>])...` 
1. Using conditional if-else statements - `CASE WHEN [<column_name>] = x THEN y...ELSE *<value_to_change_to> END...`


***

## Basic relational database theory and T-SQL 🧠

Relational databases are the basis for SQL, and for all modern database systems such as MS SQL Server, IBM DB2, Oracle, MySQL, and Microsoft Access.

A relational database is a collective set of multiple data sets organised by tables, records and columns. RDBs establish a well-defined relationship between database tables. Tables communicate and share information, which facilitates data search-ability, organisation and reporting.

A core concept of relational databases is **reducing redundancy in data**. This means that it is aimed that data is stored once and only once. If data is redundant you risk the chance of contradiction which reduces data integrity. 

> **Example:** There are multiple different spellings of Local Authority names across the Department. A database should aim to have one table linking LA Numbers to LA Names and then use LA Numbers in any other related data sets. This should be applied accross all tables.

**Tables** 🛏

In SSMS, within the tables folder you will find all the raw data that is stored in the database.

You can further expand the tables with the + symbol to take a look at its columns, keys etc. 

Columns are assigned a data type that they can not violate and if they are allowed to be blank or not. Most commonly used data types are as follows:

- nvarchar/varchar - store text of different types
- int - Integers (whole numbers)
- decimal - for decimals of a specified precision
- float - Floating point numbers

Each table should have a *Primary Key* that all other fields in the table are related to. If this is set for a table then it will be viewable under the *Keys* section under tables. 🗝

**Schemas** 👩‍👩‍👦

Objects within a database can be grouped together into different Schema. 

Permissions on tables are generally inherited from permissions on the schemas, i.e. if you have permission to use a schema, you have permissions to use all the tables within it. 

**Transact-SQL**

Transact-SQL (T-SQL) is the language we use to retrieve data from Microsoft SQL Databases. 

This section outlines the basic commands associated with this.

> **Tip:** T-SQL is Microsoft specific and other variations of SQL will not always work on Microsoft SQL Databases due to differences in syntaxes.  

This is important as if you are searching for solutions to common SQL problems online it will need to be a T-SQL solution to work with Microsoft databases.


In [2]:
/* Set database to query from */
USE [AdventureWorks];

# 1. How do I retrieve data? 👓 
A basic query in SQL composes of the `SELECT` statement. It enables you to retrueve records from one or more tables. Syntax as follows:

```
SELECT 
[column_name_1],
...
[column_name_n]
FROM [schema_name].[table_name]
```
Replacing column names with * will return **all** of the columns in the data set.

> **TIP:** `TOP` - Type and execute `SELECT TOP 1000 * FROM <table_name>` to get a feel for what a table *looks* like. Returning only the top 1000 rows will be quicker to load and allows you to briefly see what sort of values are in your chosen table. 

> **TIP FOR FORMATTING:** Putting the comma on a new line in front of the [column_name] allows you to quickly comment out columns you don't want to view.

> **TIP FOR COMMENTING CODE:** Inserting `--` in front of code comments it out. The hot-key for commenting out a line or bulk selection in SQL is typically `Ctrl-K` then `Ctrl-C`. To uncomment, do `Ctrl-K` then `Ctrl-U`.



In [3]:
-- Conduct humble select statement
-- to retrieve records on people's addresses
SELECT 
    [AddressID]
    ,[AddressLine1]
    ,[City]
    ,[StateProvinceID]
    ,[PostalCode]
FROM [Person].[Address];

# 2. What can I do to only return unique values in a column? 🥢
Sometimes, you are only interested in seeing all the unique values in a column, such as to understand what are all the possible values that column can take. To do so, use the `DISTINCT` command.

**USE CASE:** *As Head of Sales of a socially-irresponsible, run-of-the-mill global coffee chain that sells bland coffee and even blander customer service,  I want to see all the types of discounts we offer so that I can come up with new sale strategies that mask the quality of our products to improve sales.*

In [4]:
-- retrive all unique types of discounts offered
SELECT DISTINCT [Description]
FROM [Sales].[SpecialOffer];

# 3. How do I retrieve specific data? 👽
To retrieve/filter for specific records, rather than all records, you will need to use the `WHERE` clause in addition to the `SELECT` statement.

> **USER STORY:** *As an veteran police investigator with a long and decorated track-record of using a range of persuasion techniques and tools to draw information out of criminals, I want to see all the address data for my city, so that I can see where my suspects live.*

> **TIP:** To retrieve records based on a partial piece of text, you can utilise the `LIKE` operator and `%` wildcard. For instance, if you want to look for all cities beginning with "Phe", then you type and execute `SELECT * FROM [Person].[Address] WHERE [City] LIKE 'Phe%'`. Alternatively, if you want to look for all cities ending with "nix", then type and execute `SELECT * FROM [Person].[Address] WHERE [City] LIKE '%nix'`. Or if you want to look at all cities with "eoni" in them, then type and execute `SELECT * FROM [Person].[Address] WHERE [City] LIKE '%eoni%'`.

In [5]:
-- Retrieve all the people's addresses data
-- who live in the city of 'Pheonix'
SELECT 
    [AddressID]
    ,[AddressLine1]
    ,[City]
    ,[StateProvinceID]
    ,[PostalCode]
FROM [Person].[Address]
WHERE [City] = 'Phoenix';

-- Retrieve all the people's addresses data
-- who live in the city of 'Pheonix' or 'Cambridge'
SELECT 
    [AddressID]
    ,[AddressLine1]
    ,[City]
    ,[StateProvinceID]
    ,[PostalCode]
FROM [Person].[Address]
WHERE [City] IN ('Phoenix', 'Cambridge');


# 4. How do I arrange the data in a specific order? 🐊
To retrieve records and organise them in ascending or descending order, you will need to use the `ORDER BY` clause in addition to the `SELECT` statement.

> **USER STORY:** *As an analyst for the central government of a surveillance society, I want to see all the address data for all cities, so I can see who I can focus my attentions on.*

In [5]:
-- Retrieve all the people's addresses data
-- and sort them by city in ascending order
SELECT 
    [AddressID]
    ,[AddressLine1]
    ,[City]
    ,[StateProvinceID]
    ,[PostalCode]
FROM [Person].[Address]
ORDER BY [City] ASC;

# 5. How do I perform aggregations on grouped variables? 🍹
To retrive aggregated grouped records, such as a count of all the records belonging to a city, then you will need to combine the `GROUP BY` clause with an aggregation function, `SUM()`, `COUNT()`, `AVERAGE()`, `MEDIAN()`, `MODE()`, `MAX()` or `MIN()` 

> **USER STORY:** *As a third-rate hacker with no grudges nor malicious intent, I want to see all the records held for each state and city, so I can see how complete the secret service's individualised data is against population data that's helpfully published by the state.*

In [6]:
-- Retrive a count of all the records belonging to each state and city combination
SELECT
    [StateProvinceID]
    ,[City]
    ,[CountRecords] = COUNT(*)
FROM [Person].[Address]
GROUP BY 
    [StateProvinceID]
    ,[City];

# 6. How do I join multiple tables together? 🤺
To join together multiple tables to bring in additional columns to work on, then you will need to include the `FROM <table_name_one> LEFT/RIGHT/INNER/FULL JOIN <table_name_two> ON <table_name_one_column_for_join> = <table_name_two_column_for_join` 

![SQL Join Image](https://www.dofactory.com/Images/sql-joins.png "SQL Join image")

> **USER STORY:** *As an "intelligence" analyst for a rapidly-growing start-up seeking to aggresively expand into new markets within a highly-competitive industry, I want to see all the address records for each individual and business, so that I can see who owns a business within my industry and take gentle action to improve my company's success in entering the market.*

> **TIP:** What's not often said is that sometimes, when you perform a `LEFT/RIGHT JOIN`, you get more records returned than what you initially had with the *left* or *right* table that you are joining on to - more than what you might expect. This happens because whilst you have a unique row for each column you are joining on to in the *left* or *right* table you are joining on to, you have "duplicate" rows in the *right* or *left* table you are joining in. "Duplicate" here meaning that you have duplicate rows based on the columns that you are joining on; the rows may be unique if more columns are included.

In [7]:
-- Retrieve all records of people's addresses alongside their business addresses
SELECT 
    table_address.[AddressID]
    ,table_address.[AddressLine1]
    ,table_address.[City]
    ,table_address.[StateProvinceID]
    ,table_address.[PostalCode]
    ,table_business.[BusinessEntityID]
    ,table_business.[AddressTypeID]
FROM [Person].[Address] AS table_address
LEFT JOIN [Person].[BusinessEntityAddress] AS table_business
    ON table_address.[AddressID] = table_business.[AddressID];

# 7. How do I attach multiple tables together? 🏖
To append mutilple tables to increase the number of rows to work on, then you'll need to include the `UNION` operator between several `SELECT <column_name_one>...<column_name_n> FROM <table_name_i>` statements.

>**USER STORY:** *As a grizzly consultant for the governors of Phoenix and New York, I want to see all the address records from an address table on the state "Phoenix" and an address table on the state of "New York", so that I can legitmise my falsified population growth forecasts of these two states and thereby improve my profile to reach new clients and swindle money from them.*

>**NOTE:** You can only `UNION` tables that have the same columns.

> **TIP:** `UNION` only brings together the different rows between the several tables whereas `UNION ALL` brings together all rows between the several tables, including duplicate ones.

In [6]:
-- Retrieve all the people's addresses data
-- who live in the city of 'Pheonix' and 'New York'
SELECT 
    [AddressID]
    ,[AddressLine1]
    ,[City]
    ,[StateProvinceID]
    ,[PostalCode]
FROM [Person].[Address]
WHERE [City] = 'Phoenix'
UNION
SELECT 
    [AddressID]
    ,[AddressLine1]
    ,[City]
    ,[StateProvinceID]
    ,[PostalCode]
FROM [Person].[Address]
WHERE [City] = 'New York';

## EXERCISE: `UNION`-ing tables together
**Question:** Based on what we covered so far, can you think of an alternative way of writing the above query in (7.)? Please write your answer in the cell below.

In [0]:
-- please write your qeury here

# 8.0 Is there a way to create my own column to the data? 🥘

To create an additional column to your queried table, you write the name of the new column within the list of columns specified in your `SELECT` statement.

> **USER STORY:** *As a business analyst with questionable morals for the State Department of Defence, I want to include some contextual information that I know from experience in working with the state of Phoenix, so that senior military strategists can focus recruitment efforts for the next war to plunder another nation's natural resources.*

In [10]:
-- Include contextual information as extra column
SELECT 
    [AddressID]
    ,[AddressLine1]
    ,[City]
    ,[StateProvinceID]
    ,[Context] = 'Renowned for high-end spa resorts, Jack Nicklaus–designed golf courses and vibrant nightclubs'
    ,[PostalCode]
FROM [Person].[Address]
WHERE [City] = 'Phoenix';

# 8.1 Can I change the data type of my data? 🔊
You may need to change the data type of columns in your data to perform certain operations such as:

1. Changing to a `INT` or `FLOAT` datatype so you can use aggregation functions like `SUM` and `AVERAGE`.
1. Changing to a `VARCHAR(n)` or `NVCARCHAR(n)` datatype so you can concatenate with other string/text columns.
1. Changing the two columns that you're joining two tables together to the same datatype so they can join!

To change the data type, you will need to build on (8.0) by creating or overwriting an existing column and utilise the `CAST()` command.

> **USER STORY:** *As the Director of Strategy for a shady financial institution with clients of questionable standing, I want to see all credit cards with similar numbers recorded in our system to identify whose cards need replacing so they can evade the regulatory authorities.*

In [12]:
-- Retrieve records of all credit card numbers that are similar
-- where we mean similar by not having zero's at the front.
-- Currently, the column of interest is treated as a nvarchar one,
-- so '093455' is valid.
-- By converting the column to an int one, it will then become '93455'.

SELECT 
    [CardType]
    ,[CardNumber] = CAST ([CardNumber] AS BIGINT)
    ,[ExpMonth]
    ,[ExpYear]
FROM [Sales].[CreditCard];

## EXERCISE: `CAST`-ing column to a specified datatype
**Question:** Using the `GETDATE()` function to get today's date and time, can you cast this to only get the date instead? Please write your answer in the cell below. 

**Note:** *A `SELECT` statement need not always end with `FROM <table_name>`.*

In [0]:
-- please write your query here

# 8.2 How can I return a value based on another value in a different column? 🕹
If you ever wanted to do the equivalent of an *if-else* statement, then the SQL equivalent is `[new_column_name] = CASE WHEN ... THEN ... END`. This builds on (7.0) as we need to create a new column.

> **USER STORY:** *As a pencil-pusher for a large, faceless multinational corporation that sells a variety of wares, I want to see what the currency rates are so I can convert the value of our sales into our home currency.*

> **RANT:** Really long and winding `CASE WHEN` statements are often undesirably seen in SQL code that is attempting to do what a lookup table is meant to do. Don't do a long and winding `CASE WHEN` statement, they're hard to read, and are the equivalent of nested *ifelse()* statements in Excel. Please, don't. One's eyes will bleed. 


In [7]:
-- first and bad attempt to do a lookup
-- from currency code to currnecy name
SELECT 
    [FromCurrencyCode]
    ,[FromCurrencyName] = CASE
        WHEN [FromCurrencyCode] = 'AED' THEN 'Emirati Dirham'
        WHEN [FromCUrrencyCode] = 'AFA' THEN 'Afghani'
        WHEN [FromCurrencyCode] = 'ALL' THEN 'Lek'
        -- and so on....damn, this is really tedious to type out
        ELSE NULL
        END
    ,[ToCurrencyCode]
    ,[AverageRate]
    ,[EndOfDayRate]
FROM [Sales].[CurrencyRate]
;

## EXERCISE: Avoiding long and winding `CASE WHEN` statements
**Question:** Using the `[Sales].[Currency]` as your look-up table, find a way to do what you aim to do in (8.2) but without using `CASE WHEN` statements. Please write your answer in the cell below.

**Hint:** You might want refer to what we covered earlier.


In [0]:
-- please write your query here

## EXERCISE: `CAST`-ing columns so they can be concantenated
**Question:** Using the `[ExpMonth]` and `[ExpYear]` columns, and assuming all expriation dates start on the first of each month, *e.g. 2018/05/01*, can you create a new column that has the full date of expiry, ensuring that this is of the date datatype? Please write your answer in the cell below. 

**Hint:** This exercise brings together the concepts learnt so far of:
- Creating a new column
- Casting columns to the right datatype
- Using conditional if/else (`CASE WHEN`) statements

You will also need to search how to concatenate columns and may possibly need to use the `LENGTH()` function.

**Note:** This is actually quite a difficult exercise.

In [0]:
-- please write your query here

## EXERCISE: Returning first non-`NULL` entry across multiple columns 🎲
**Question:** Using the `COALESCE()` function, can you rewrite the below query?

**Note:** This exercise shows you a shorthand way of creating a new column that takes the first non-`NULL` entry across several existing columns.

```
SELECT [Name]
    ,[ProductNumber]
    ,[ProductSubcategoryID]
    ,[ProductModelID]
    ,[ProductId] = CASE
        WHEN [ProductSubcategoryID] IS NULL THEN [ProductModelID]
        ELSE [ProductSubcategoryID]
        END
FROM [Production].[Product];
```

In [0]:
-- please write your query here