# NEW SQL PROGRAMMER SESSION - DSCI 513


## Import and Configuration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import csv

%matplotlib inline
%load_ext sql
%config SqlMagic.displaylimit = 30
%config SqlMagic.autolimit = 30

In [None]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

You can choose to use the sql magic, or run the queries with psycopg2, or pandas `read_sql_query`. You'd need to change some the codes accordingly. Instructions are given in lab1 and lab2.

In [None]:
from sqlalchemy import create_engine, text
conn = create_engine(f'postgresql://{username}:{password}@{host}:{port}/world')

In [None]:
%sql postgresql://{username}:{password}@{host}:{port}/world

## Introduction to SQL and Database Terminologies

Welcome to your first steps in SQL and databases! Here's a quick rundown of some essential terms you'll need to be familiar with.



- **Database:** A collection of structured data stored electronically. Think of it as a digital filing cabinet where your data is organized and accessible.

- **Relational Database:** A type of database that stores data in tables that can relate to each other through common fields. It's like having multiple filing cabinets where files are interlinked.

- **RDBMS (Relational Database Management System):** Software that helps you manage relational databases. It lets you create, manipulate, and query the data in those databases. Examples include PostgreSQL, MySQL, and SQLite.

- **PostgreSQL (Postgres):** A powerful, open-source RDBMS known for its robustness and high performance. It's like a brand of filing cabinet known for its durability and advanced features.

- **pgAdmin:** A popular and feature-rich open-source administration tool for PostgreSQL. Think of it as the key to your filing cabinet, allowing you to open, organize, and find documents.

Remember, "relational" here refers to how data can be split into different tables for efficiency, but then related together (like a customer to their orders).

## Relational Database

Imagine a scenario where we have a database containing *student_info*, *course_info*, and *instructor_info* tables, each potentially with many fields, the most memory-efficient way to manage enrollments and class assignments is to use linking tables. This avoids duplication of data and allows for a flexible design where many students can be enrolled in many classes, and many classes can be assigned to each instructor.

Here's a description of how the database could be structured:

- **student_info Table:** This table stores all relevant information about students. Each student has a unique student_id which is the primary key.

- **course_info Table:** Contains details about courses with a unique class_id as the primary key.

- **instructor_info Table:** Holds data on instructors, with each having a unique instructor_id as the primary key.

**\$\$\$ Million Dollar Question: What can you do to link these tables without repeating data?**

You need linking tables:

- **enrollments Table:** A junction table to manage the many-to-many relationship between students and classes. It has its own primary key (enrollment_id) and includes student_id and class_id as foreign keys.

- **class_instructors Table:** To handle the one-to-many relationship between instructors and classes, where one instructor can teach many classes. It contains class_id and instructor_id, with each pair being unique together.

By using these linking tables, you can efficiently enroll students in classes and assign instructors without having to repeat student or instructor information for each class they're associated with. This not only saves memory but also ensures data integrity and simplifies updates to the database since each piece of information is only stored once.

Example Tables with Relations:

**<center>Students Table</center>**
| student_id | student_name |many other info!|
|------------|--------------|----------------|
| 1          | Daniel       |Blah Blah Blah! |
| 2          | James        |Blah Blah Blah! |

<br>

**<center>Classes Table</center>**
| class_id | class_name   |many other info!|
|----------|--------------|----------------|
| 1        | DSCI513      |Blah Blah Blah! |
| 2        | DSCI522      |Blah Blah Blah! |

<br>

**<center>Instructors Table</center>**
| instructor_id | instructor_name |many other info!|
|---------------|-----------------|----------------|
| 1             | Gittu           |Blah Blah Blah! |
| 2             | Tiff            |Blah Blah Blah! |

<br>

**<center>Enrollments Table (Junction Table for Many-to-Many Relationship)</center>**
| enrollment_id | student_id | class_id |
|---------------|------------|----------|
| 1             | 1          | 1        |
| 2             | 1          | 2        |
| 3             | 2          | 1        |

<br>

**<center>Teaching Assignments Table (For One-to-Many Relationship)</center>**
| class_id | instructor_id |
|----------|---------------|
| 1        | 1             |
| 2        | 2             |


## Basic SQL Commands and Syntax
When working with SQL and databases, it's crucial to start with the basic commands that allow you to retrieve and manipulate data. In the context of a 'world' database, which contains information about countries, their cities, and other related data, the SELECT, FROM, and WHERE clauses are fundamental. 

### SQL Syntax: SELECT, FROM, and WHERE

The most basic, yet powerful commands are:

- **SELECT**: This command is used to specify the columns that you want to retrieve from a database table.

- **FROM**: This clause specifies the table from which to retrieve the data.

- **WHERE**: This clause allows you to filter the query based on specific conditions.

When used together, these commands form the backbone of most SQL queries.

#### The `*` Operator
- The asterisk (`*`) is a wildcard character that tells SQL to select all columns for each row that meets the query criteria.

In [None]:
%%sql
SELECT * FROM country;

#### Retrieving Specific Columns

In [None]:
%%sql
SELECT name, continent FROM country;

Here, instead of `*`, we specify that only the *name* and *continent* columns of each country should be returned.

#### Using the `WHERE` Clause

In [None]:
%%sql
SELECT name, continent FROM country WHERE population > 50000000;


This query fetches the name and continent of countries with populations greater than 50 million.

Remember, the `SELECT` statement can be combined with various other clauses to refine your queries even further.

In [None]:
%%sql
SELECT name, continent, to_char(population,'9,999,999,999')
FROM country WHERE population > 50000000;

### Aliasing in SQL: Using AS and Implicit Aliases

Aliasing is useful for improving the readability of your SQL queries or when you need to avoid naming conflicts. Here are two ways to use aliasing:

#### Using `AS` for Aliasing
You can use the `AS` keyword to give a table or a column a temporary name during the execution of a query.


In [None]:
%%sql
SELECT co.name AS country_name
FROM country AS co

In this example, country is aliased as co. Also, the column name is given an alias for clarity in the output.

#### Aliasing Without `AS`

SQL also allows you to alias tables and columns without explicitly using the `AS` keyword.

In [None]:
%%sql
SELECT co.name country_name
FROM country co

While both methods are correct, using `AS` can make it clearer that you are defining an alias, especially for those who are new to SQL.

### SQL Aggregation: COUNT and GROUP BY

Aggregation functions compute a single result from a set of input values. The `COUNT` function is one of the simplest aggregate functions, returning the number of rows that match the query criteria.


In [None]:
%%sql
SELECT COUNT(*) FROM country;

This returns the total number of rows in the country table.

### Understanding `GROUP BY`

The `GROUP BY` clause groups rows that have the same values in specified columns into summary rows. It's often used with aggregate functions like `COUNT`, `MAX`, `MIN`, `SUM`, and `AVG`.

When you use GROUP BY, the result set includes one row for each group. **Only the columns that are included in the GROUP BY clause or are used with an aggregate function can be used in the SELECT statement.**

In [None]:
%%sql
SELECT countrycode, COUNT(*) 
FROM city 
GROUP BY countrycode;


This query groups the cities by their `countrycode` and then counts how many cities are in each country.

#### Detailed Explanation of GROUP BY

When you group by a column, SQL essentially partitions the rows according to the unique values in that column. After partitioning, it **aggregates** the data in each partition *according to the functions you specify (like `COUNT`)*.

Consider the following points for GROUP BY:

- It collects all the rows with the same value in the grouping column into a single group.
- Then, for each group, it performs the aggregation.
- You ***CANNOT*** use non-aggregated columns (those not mentioned in the `GROUP BY` clause) in your `SELECT` statement **unless** they are used with an aggregate function.

For instance:

In [None]:
%%sql
SELECT countrycode, COUNT(name), ROUND(AVG(population)) 
FROM city 
GROUP BY countrycode;


In this query, we group the cities by `countrycode` and then count the cities and calculate the average population for each group. The city `name` can be counted because `COUNT` is an aggregate function, but you cannot simply select column `name` because it's not a part of the `GROUP BY` clause and it's not being used with an aggregate function. **PAUSE AND PONDER WHY WE CAN'T DO THIS**

**HINT** : This is because, after grouping, there are potentially multiple values for each non-grouped column within each group, and SQL doesn't inherently know how you want to handle those multiple values unless you tell it explicitly with an aggregate function.

### SQL Filtering: HAVING vs WHERE

In SQL, `WHERE` and `HAVING` both filter the data returned by a query, but they do so at different stages.

- **WHERE**: Filters rows before grouping is done.
- **HAVING**: Filters groups created by `GROUP BY`.

In [None]:
%%sql
SELECT countrycode, COUNT(*) AS city_count
FROM city
GROUP BY countrycode
HAVING COUNT(*) > 5; --Try this with `WHERE` and see what happens!

This query will return only those countrycode values that have more than 5 cities. The `HAVING` clause is used here because it filters the groups based on the count of cities, which is an aggregated value.

#### Difference from WHERE

You cannot use `WHERE` to filter based on aggregated values like the count of rows, sum, average, etc. `WHERE` does not have access to aggregate data **because it's executed before the `GROUP BY` operation (double BAM!).**

In [None]:
%%sql
SELECT countrycode, COUNT(*) AS city_count
FROM city
WHERE population > 100000 -- Filters rows where city population is more than 100000
GROUP BY countrycode
HAVING COUNT(*) > 5; -- Filters groups having more than 5 cities


In this query, `WHERE` is used to filter individual cities with populations greater than 100,000 before grouping, and `HAVING` is applied after grouping to ensure we only get groups where there are more than 5 cities **that meet the `WHERE` condition.**

Remember that `WHERE` is for *individual rows* and `HAVING` is for *aggregated data*. This distinction is **crucial** and often a source of confusion for those new to SQL.

### SQL Joins: Combining Data from Multiple Tables

Joins in SQL are used to retrieve data from multiple tables by creating a logical connection between them. This is typically done using a common column such as IDs.

#### Important Aspects of Joins

- **Selecting the Right Columns to Join On**: It's crucial to join tables on columns that have a logical relationship, usually a foreign key in one table that references a primary key in another table. 

- **Avoiding Cross Joins**: If you don't specify the join condition correctly, you might end up with a cross join, which is a Cartesian product of the rows from the tables involved in the join. This means every row of one table is paired with all rows of the other table, which can result in a massive and usually incorrect result set.

- **Using More Than One Join**: In complex queries, you may need to join more than two tables, or sometimes same table multiple times. In such cases, it's essential to keep your code clear and understandable.

- **Aliasing for Clarity**: When joining multiple tables, especially when they have columns with the same name, or when joing the same two tables multiple times, aliasing becomes important to avoid confusion and to specify which column from which table you're referring to.


In [None]:
%%sql
SELECT *
FROM country, city; --Incorrect Join Leading to Cross Join

This query does not specify how `country` and `city` are related. It will return the Cartesian product of the two tables, which is probably not what you want. The correct version would look like:

In [None]:
%%sql
SELECT *
FROM country
JOIN city ON country.code = city.countrycode;

Here, we're joining country and city using a column they both have in common, which correctly associates each city with its country.

#### Joining Multiple Tables with Aliasing


In [None]:
%%sql
SELECT c.name AS country_name, ci.name AS city_name, l.language
FROM country AS c
JOIN city AS ci ON c.code = ci.countrycode
JOIN countrylanguage AS l ON c.code = l.countrycode
WHERE ci.population > 100000 AND l.isofficial = 'T';

In this query, we join three tables: `country`, `city`, and `countrylanguage`. Aliases like `c` for country, `ci` for city, and `l` for country_language help differentiate columns and make the query more readable.

**Fun Fact:** Replace `l.language` in the `SELECT` statement with `STRING_AGG(l.language, ', ') AS languages` and see what happens!

#### Joining Same Tables Multiple Times

Sometimes you need to join the same tables more than once to get different pieces of information. In such cases, aliasing is not just helpful, it's necessary to distinguish the different roles the same table is playing in each join.

Let's say we want to list countries with their capitals and also include information about other major cities.


In [None]:
%%sql
SELECT 
    ctry.name AS country_name, 
    capital.name AS capital_name, 
    city.name AS city_name
FROM 
    country AS ctry
JOIN 
    city AS capital ON ctry.capital = capital.id -- First join to get the capital
JOIN 
    city ON ctry.code = city.countrycode AND city.id != ctry.capital -- Second join to get other cities

In this example, we are:

- Joining country to city first to get the capital city. We alias city as `capital` for clarity.
- Then we join country to city again to get other cities within the country, making sure to exclude the capital city by ensuring city.id is not the same as the country.capital.

This query will return a list of countries with the names of their capitals and other major cities, clearly differentiating between the two types of city information.

**Fun Fact:** Try the following code! Can you guess what's it doing?

In [None]:
%%sql
SELECT 
    ctry.name AS country_name, 
    capital.name AS capital_name, 
    STRING_AGG(city.name, ', ') AS city_num
FROM 
    country AS ctry
JOIN 
    city AS capital ON ctry.capital = capital.id -- First join to get the capital
JOIN 
    city ON ctry.code = city.countrycode AND city.id != ctry.capital -- Second join to get other cities
GROUP BY
    ctry.name, capital.name
HAVING
    COUNT(*)>10