<img src="img/dsci513_header2.png" width="600">

# Lecture 1: Introduction to relational databases

**Arman Seyed-Ahmadi, November 2021**

## Course announcements


## Lecture outline

- Why using databases
- The relational model
- Query languages, SQL, Postgres
- How to run SQL
- Basic SQL queries

## Why not spreadsheets?

At this point in MDS, you have a good idea of why spreadsheet software like Excel or its equivalents are not suitable for most data science purposes. 

Pandas:
- reproducible
- range of functionalities
- scalable
- fast
- can be automated

## Why not Pandas?

But Pandas was pretty nice and powerful, wasn't it? let's see.

---
**Question:**

What kind of problems do you think you might run into using a pandas dataframe like the above?

---

Think about what happens if:

- Your dataframe is 100 GB in size
- Multiple people want to use and make changes to the dataset simultaneously
- You want to be able to manage what each user can do
- You want to be able to let different users see different parts of the dataset
- You don't want to store everything at one place
- You want to restrict the kind of data to be stored
- The dataset file is corrupted
- The system crashes half way through making a change
- You want to optimize access to your data
- ...


## Databases and database management systems

You guessed it right! A **database management system (DBMS)** addresses all of the above problems.

**What is a database?**
A database is an organized collection of related data

**What is a database management system?**
A DBMS is a collection of programs that enables users to create, query, modify and manage a database in an optimized and efficient manner. A DBMS relieves us from worrying about storing a manage

Using a DBMS ensures:
- Data independence
- Efficient data access
- Data integrity
- Data security
- Concurrent access
- Crash recovery

There are different types of DBMS for different kinds of data

- Relational (most widely used)
- Document
- Hierarchical
- Network
- Object-oriented
- Graph

---

**Remember:**
    
database $\ne$ database management system

---

### Data model

A data model is the way we choose to represent data. We usually try to model the data in a way that is closer to how we think about the data.

You probably remember from when we talked about tidy data, that we like to see
- each observation or measurement as a **row**
- each variable or attribute as **column**


Is that the only way to represent data? No! But that's the one the makes sense for a variety of applications.

That particular way of representing the data is called a **data model**.

---

**Example:**

In a graph database for a social media application, people may be represented as the nodes of a graph, whereas graph edges may define the relationship of each person to another person.

---

---

**Example:**

In a document database, we may choose to store information about students of a university as individual documents. Inside each document, attributes are stored as key-value pairs.

---

In this course, we'll talk mostly about **relational** DBMSs (RDBMS) and briefly about **non-relational** DBMSs.

## The relational model

### Why the relational model?

Take a moment and think about the kind of problems that you may run into if you choose to store data in a single table.

<img src="./img/lecture1/table.png" width="800">

The most famous data model today is the relational model, while other models have also gained traction in the past few years.

The relational model works with **entities** and **relationships**. It is based on the set theory in mathematics was introduced by by Edgar Codd (IBM) in 1970 ([more details here](https://en.wikipedia.org/wiki/Relational_model)). It's foundations in **set theory** is the reason you will here words like "tuples", "domain", "union", "cross product", etc.

---

**Example:**

Entities:
- students in a school
- employees of an organization
- cars of a rental company
- houses in a city

Relations:
- students to a department
- purchases to customers
- movies to actors
- customers to a bank
    
---

In a relational model, entities and relationships are both **sets of tuples** called **relations**. These relations are represented as **tables** with rows and columns.

**What is a relational database?** A collection of relations

**Relations**: made up of two parts: A schema and an instance

**Schema**: specifies
1. Name of a relation
2. Name and domain of each attribute

**Domain**: A set of constraints that determines the type, length, format, range, uniqueness and nullability of values stored for an attribute.

---

**Example:**

Student (**sid**: _string_, **name**: _string_, **login**: _string_, **age**: _integer_, **gpa**: _real_)

---

**Instance**: a particular relation that follows a certain schema

**Relational Database Schema**: collection of schemas in the database

**Database Instance**: a collection of instances of its relations

### Anatomy of a table

<img src="img/lecture1/table_anatomy.png" width="700">

### Properties of a table (or relation)

- Contains data about a particular entity or relationship between entities.
- Has a unique name in a database.
- Each row-column intersection stores a single value, that is to say values should be atomic.
- Has at least one column and zero or more rows.

### Properties of rows (or tuples, records)

- The order of rows are not important. Therefore, there is no index-based method to retrieve rows like in Pandas.
- Contains information about an instance of the entity/relationship which the table represents
- Each row is identified by its _primary_ key (we'll learn more about this in future lectures)

### Properties of columns (or attributes)

- Has a unique name in a table.
- Represents a particular property of the entity/relationship which the table represents
- The order of columns are not important. Therefore, there is no index-based method to retrieve columns like in Pandas.
- Is valued according to a domain (rules for values)

(Schema is similar to a class, table doesn't have a counterpart, rows are like object instances)

---

**Question:**
Can you name a few differences between tables in a database and spreadsheets? How about between tables and Pandas dataframes?

---

## Query language in a DBMS

**What is a query?** A question that we ask about the data. The result of a query is a new relation.

In order to talk to the database and ask questions, we need to speak its language. A DBMS
- provides a specialized language for us to write our queries
- optimizes how our queries are executed

The data query language (DQL) is part of a bigger set of languages for working with data in a relational database, which consists of
- data definition language (DDL) for creating, altering and deleting tables
- data manipulation language (DML) for inserting new data, updating values, etc.
- data query language (DQL) for querying and retrieving data
- data control language (DCL) for management and controlling user access, rights and privileges

## What is SQL?

Well, it's finally time to learn about SQL!

- SQL stands for Structured Query Language ([or... does it?](https://en.wikipedia.org/wiki/SQL#History)).
- It is a programming language that we use to talk to a relational DBMS.
- Originally developed by IBM in 1970s to manipulate and retrieve data stored in their DBMS, System R.
- SQL ≠ relational model ≠ database ≠ DBMS

### A peak at SQL queries

Suppose that we have the following table (relation) in our database, and 

> we want to retrieve the names and GPAs of students older than 25.

|  sid  | name      | login      | age | gpa |
|-------|-----------|------------|-----|-----|
| 23792 | Arman     | arman@mds  | 28  | 2.5 |
| 82347 | Varada    | varada@mds | 29  | 2.9 |
| 11238 | Tiffany   | tiff@mds   | 23  | 2.8 |
| 87263 | Mike      | mike@mds   | 19  | 3.8 |
| 13298 | Joel      | joel@mds   | 25  | 3.2 |
| 91287 | Florencia | flor@mds   | 20  | 3.3 |

We can write this as the following SQL query:

```sql
SELECT
    name, age, gpa
FROM
    Students
WHERE
    age > 25;
```

Running the above query should return this relation:

| name   | age | gpa |
|--------|-----|-----|
| Arman  | 28  | 2.5 |
| Varada | 29  | 2.9 |

### SQL syntax

Let's dissect the different parts of the our SQL query here:

```sql
SELECT
    name, age, gpa
FROM
    Students
WHERE
    age > 25;
```

A SQL statement consists of keywords, clauses, identifiers, terminating semi-colon and sometimes comments which together form a complete executable and independent piece of code.

```sql
SELECT
```
- The keyword `SELECT` is the **keyword** that exists in every SQL query. It is used to select and return data from columns, given the conditions that follow it.

```sql
name, age, gpa
Students
```

- `SELECT` is very powerful, but not dangerous: A `SELECT` statement never changes any values or tables in the database.

- The fact that we select only a few columns (instead of all of them) is called **projection** in database terms.
  
- These are called **identifiers**, and refer to the labels of columns and tables that exist in the database.

```sql
FROM
```
- This is another keyword that tells SQL which relation (i.e. table) to retrieve the columns from.

```sql
WHERE
```
- Yet another SQL keyword that is used to place a condition on the returned values.

- We can also have comments in a SQL query by preceding text with `--`:

```sql
-- Hey, I'm a comment!
-- ===========================
SELECT
    name, age, gpa  -- column names
FROM
    Students        -- table name
WHERE
    age > 25;       -- condition
```

- Block comments are also possible by enclosing comment lines in `/*` and `*/`:

```sql
/*
This is our first SQL query, and we
are learning about the following keywords:
SELECT
FROM
WHERE
*/

SELECT
    name, age, gpa
FROM
    Students
WHERE
    age > 25;
```

- Don't forget that every SQL statement needs to be terminated with a `;`.

- SQL keywords are traditionally written in upper case letters, but that is not a requirement. I prefer to follow this tradition because it makes the query more readable.

- A Keyword together with identifiers, expressions, etc that follow them are collectively a clause. For example:

```sql
SELECT
    name, age, gpa  -- columns are chosen here
FROM
    Students        -- table is specified here
WHERE
    age > 25;       -- filter is applied here
```

- It is common to put each clause or each keyword on a different line, but there is no generally agreed-upon style.

- In general, it doesn't matter whether the entire SQL statement is on one line or broken over several lines. Anything that comes before a `;` belongs to the same statement.

There are many other keywords that we will use throughout DSCI 513. The ones that you just saw are a few that other usually used when querying data.

> Note that SQL **is not imperative** (like Python or C++); it is a **declarative** language: We don't tell SQL **how** to retrieve data, but **what** to retrieve. For instance, we didn't write a for loop to retrieve the data from each row according to a certain condition. We told SQL what we wanted, and SQL did it for us.

### Flavours of SQL

- SQL is not owned by a particular company or organization
- It became a database language standard by the American National Standards Institute (ANSI) in 1986, and the International Organization for Standardization (ISO) in 1987.
- However, there are various SQL flavors and implementations, such as Oracle SQL, MySQL (open source), PostgreSQL (open source), IBM DB2, Microsoft SQL Server, Microsoft Access, SQLite (open source)
- These implementations have slightly different syntax and various additional features.
- In DSCI 513, we use **PostgreSQL**

<img src="img/lecture1/flavours_sql.png" width="700">

## What is PostgreSQL?

[PostgreSQL](https://www.postgresql.org/about/) (also known simply by its nickname _Postgres_) is an open-source, cross-platform DBMS that implements the relational model. PostgreSQL is very reliable with great performance characteristics, and is equipped with almost all features of the commercial and proprietary DBMSs.

PostgreSQL appeared in 1980s as a research project in University of California, Berkeley. It was meant to improve an earlier prototype relational DBMS called INGRES, which explains the name Postgres, which is short for PostINGRES. [Here](https://medium.com/launch-school/a-brief-history-of-postgresql-36d8d392c611) is an informative blog post about PostgreSQL's history if you're interested!

## The client-server model

Similar to most other DBMSs, Postgres works based on a **client-sever** model. In this model

- The DBMS along with its databases and data are all stored on a host computer where the database server resides. This is typically a powerful machine with high processing power and large storage
- Client hosts are usually personal computers with GUIs that can connect to a database server to access the data.

In this model, the clients and the server are connected over a network. The heavy-lifting of processing, managing and storing large amounts of data is done by the server host, and clients only retrieve the data that they need.

<img src="img/lecture1/client-server.png" width="500">

> Although sometimes used interchangeably, there is a difference between a **client/server** and a **client/server host**. A host is a device, whereas a client/server is a piece of software. For example, you can simultaneously have multiple client programs connected to a remote database. Similarly, a remote host is a device (i.e. a computer) that might have several server programs running concurrently.

The idea of client-server models for databases has become the standard of computation and storage today, known as **cloud computing**:

- Today we rarely store movie or music files on our computers. This is why most of us have laptops with only 256/512 GB of space, because most of that takes up space is already provided as a cloud service (e.g. Netflix, Spotify, Youtube), or is stored on cloud storage spaces (e.g. One Drive, Dropbox, Google Drive).
- We rarely run production-stage computation-intensive jobs on our own computers. All such computations are done on cloud-computing services (e.g. Google Cloud Platform, Amazon Web Services, Microsoft Azure). I personally haven't run a single simulation code on my own computer, neither ever stored any raw data locally. I use my computer mainly as an interface to access the services that I want.

> Note that there are certain situations where one might want to **locally** benefit from the advantages of storing data in a database. A relational database engine that works only with local databases is SQLite. If you're curious to find out the use cases for **SQLite**, take a look [here](https://www.sqlite.org/whentouse.html).

Whenever we use Postgres (or any other client-server DBMS), the first step before anything else is to **connect** to the database server. This is why we will talk about _host address_, _port_, _username_, and _password_ when we try to use a database.

## How to run SQL in Postgres?

Well, we have a variety of options to run our SQL statements in PostgreSQL:

- pgAdmin is the official web-based GUI for interacting with PostgreSQL databases
- `psql` is PostgreSQL's interactive command-line interface
- `%sql` and `%%sql` magic commands in Jupyter notebooks, which are provided by the `ipython-sql` package
- `psycopg2` is the official Python adapter for PostgreSQL databases
- Using `.read_sql_query()` method in Pandas

I will demonstrate the usage of all these interfaces here.

### pgAdmin

I will demo this in the lecture.

> I like to use a `Shift + Enter` keyboard shortcut to run my queries. You can configure this too by going Preferences -> Query Tool -> Keyboard shortcuts -> Change "Execute query" to `Shift + Enter`.

### `psql`

This is PostgreSQL's command-line tool that allows us to interactively run SQL statements as well as "meta" commands. I introduce a couple of useful `psql` meta commands here, but you can find all the other ones in Postgres documentations [here](https://www.postgresql.org/docs/current/app-psql.html) or a shorter version in this [cheatsheet](http://www.postgresonline.com/downloads/special_feature/postgresql83_psql_cheatsheet.pdf).

| Command | Usage                                         |
|---------|-----------------------------------------------|
| `\l`    | list all databases                            |
| `\c`    | connect to a database                         |
| `\cd`   | change directory                              |
| `\!`    | execute shell commands                        |
| `\i`    | execute commands from file                    |
| `\d`    | list tables and views                         |
| `\d+`   | list tables and views with additional info    |
| `\dt`   | list tables                                   |
| `\dt+`  | list tables with additional info              |
| `\h`    | view help on SQL commands                     |
| `\?`    | view help on psql meta commands               |
| `\q`    | quit interactive shell                        |

> Note that you don't need to terminate meta commands with `;`.

### `ipython-sql` (`%sql` and `%%sql`)

`ipython-sql` is a package that enables us to run SQL statements right from a Jupyter notebook. This package is included in the `dsci513env.yaml` environment file, so you should have it installed in your conda environment. In order to use it, we should load it first:

In [1]:
%load_ext sql

Now we need the host address of where the database is stored, along with a username and a password.

It is always a bad idea to store login information directly in a notebook or code file because of security reasons. For example, you don't want to commit your sensitive login information to a Git repo.

In order to avoid that, we store that kind of information in a separate file, like `credentials.json` here, and read the username and password into our IPython session:

In [2]:
import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

And also make sure to add your file name (e.g. `credentials.json`) to your `.gitignore` file, so you don't accidentally commit it.

Now we can establish the connection to the `world_dsci513` database using the following code:

In [3]:
%sql postgresql://{username}:{password}@{host}:{port}/world_dsci513

'Connected: postgres@world_dsci513'

Note that we have used the `%sql` line magic to interpret the line in front of it as a magic command. This is similar to the `%timit` magic that we used in DSCI 511.

We can also use `%%sql` cell magic to apply the magic to an entire notebook cell.

A limited number of `psql` meta commands (e.g. `\l`, `d`) can also be executed here. This is made possible through the `pgspecial` package. For example, let's list all databases that exist on our PostgreSQL server:

In [5]:
%sql \l

 * postgresql://postgres:***@localhost:5432/world_dsci513
4 rows affected.


Name,Owner,Encoding,Collate,Ctype,Access privileges
postgres,postgres,UTF8,C,C,
template0,postgres,UTF8,C,C,=c/postgres postgres=CTc/postgres
template1,postgres,UTF8,C,C,=c/postgres postgres=CTc/postgres
world_dsci513,postgres,UTF8,C,C,


Or list the relations (i.e. tables) in the current database:

In [6]:
%sql \d

 * postgresql://postgres:***@localhost:5432/world_dsci513
3 rows affected.


Schema,Name,Type,Owner
public,city,table,postgres
public,country,table,postgres
public,countrylanguage,table,postgres


Let's run some SQL statements now. Let's retrieve the `name` and `population` columns from the `country` table:

In [6]:
%sql SELECT name, population FROM country;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Afghanistan,22720000
Netherlands,15864000
Netherlands Antilles,217000
Albania,3401200
Algeria,31471000
American Samoa,68000
Andorra,78000
Angola,12878000
Anguilla,8000
Antigua and Barbuda,68000


#### Limiting returned and displayed rows

As you can see, all rows are returned and displayed by default. This behaviour can be problematic if our table is very large for two reasons:
1. Retrieving large tables can be slow, and maybe not necessary
2. Displaying a lot of rows clutters our Jupyter notebook

We can modifying `ipython-sql` configuration to limit the number of returned and displayed rows. For example, here we change the display limit:

In [8]:
%config SqlMagic.displaylimit = 20

In [9]:
%sql SELECT name, population FROM country;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Afghanistan,22720000
Netherlands,15864000
Netherlands Antilles,217000
Albania,3401200
Algeria,31471000
American Samoa,68000
Andorra,78000
Angola,12878000
Anguilla,8000
Antigua and Barbuda,68000


Looks good. Let's apply the magic to an entire cell so that we can break the lines:

In [10]:
%%sql

SELECT
    name, population
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Afghanistan,22720000
Netherlands,15864000
Netherlands Antilles,217000
Albania,3401200
Algeria,31471000
American Samoa,68000
Andorra,78000
Angola,12878000
Anguilla,8000
Antigua and Barbuda,68000


We can use `*` to retrieve all columns:

In [11]:
%%sql

SELECT
    *
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


code,name,continent,region,surfacearea,indepyear,population,lifeexpectancy,gnp,gnpold,localname,governmentform,headofstate,capital,code2
AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,22720000,45.9,5976.0,,Afganistan/Afqanestan,Islamic Emirate,Mohammad Omar,1,AF
NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,15864000,78.3,371362.0,360478.0,Nederland,Constitutional Monarchy,Beatrix,5,NL
ANT,Netherlands Antilles,North America,Caribbean,800.0,,217000,74.7,1941.0,,Nederlandse Antillen,Nonmetropolitan Territory of The Netherlands,Beatrix,33,AN
ALB,Albania,Europe,Southern Europe,28748.0,1912.0,3401200,71.6,3205.0,2500.0,Shqipëria,Republic,Rexhep Mejdani,34,AL
DZA,Algeria,Africa,Northern Africa,2381741.0,1962.0,31471000,69.7,49982.0,46966.0,Al-Jazair/Algérie,Republic,Abdelaziz Bouteflika,35,DZ
ASM,American Samoa,Oceania,Polynesia,199.0,,68000,75.1,334.0,,Amerika Samoa,US Territory,George W. Bush,54,AS
AND,Andorra,Europe,Southern Europe,468.0,1278.0,78000,83.5,1630.0,,Andorra,Parliamentary Coprincipality,,55,AD
AGO,Angola,Africa,Central Africa,1246700.0,1975.0,12878000,38.3,6648.0,7984.0,Angola,Republic,José Eduardo dos Santos,56,AO
AIA,Anguilla,North America,Caribbean,96.0,,8000,76.1,63.2,,Anguilla,Dependent Territory of the UK,Elisabeth II,62,AI
ATG,Antigua and Barbuda,North America,Caribbean,442.0,1981.0,68000,70.5,612.0,584.0,Antigua and Barbuda,Constitutional Monarchy,Elisabeth II,63,AG


#### Assigning returned rows to Python variables

Single line queries:

In [12]:
query_output = %sql SELECT name, population FROM country;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


Multi-line queries:

In [13]:
%%sql query_output <<

SELECT
    name, population
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.
Returning data to local variable query_output


In [14]:
query_output

name,population
Afghanistan,22720000
Netherlands,15864000
Netherlands Antilles,217000
Albania,3401200
Algeria,31471000
American Samoa,68000
Andorra,78000
Angola,12878000
Anguilla,8000
Antigua and Barbuda,68000


Although this looks like a dataframe, `query_output` is not a dataframe:

In [15]:
type(query_output)

sql.run.ResultSet

But we can easily convert it to a Pandas dataframe:

In [16]:
df = query_output.DataFrame()

In [17]:
type(df)

pandas.core.frame.DataFrame

In [18]:
df['name']

0                                       Afghanistan
1                                       Netherlands
2                              Netherlands Antilles
3                                           Albania
4                                           Algeria
                           ...                     
234                  British Indian Ocean Territory
235    South Georgia and the South Sandwich Islands
236               Heard Island and McDonald Islands
237                     French Southern territories
238            United States Minor Outlying Islands
Name: name, Length: 239, dtype: object

In [19]:
df.loc[df['name'] == 'Canada', 'population']

88    31147000
Name: population, dtype: int64

If you want the result of every query to be automatically converted to a Pandas dataframe, there is an option for that in `ipython-sql`:

In [20]:
%config SqlMagic.autopandas = True

In [21]:
new_query = %sql SELECT name, population FROM country;
new_query

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


Unnamed: 0,name,population
0,Afghanistan,22720000
1,Netherlands,15864000
2,Netherlands Antilles,217000
3,Albania,3401200
4,Algeria,31471000
...,...,...
234,British Indian Ocean Territory,0
235,South Georgia and the South Sandwich Islands,0
236,Heard Island and McDonald Islands,0
237,French Southern territories,0


In [22]:
type(new_query)

pandas.core.frame.DataFrame

In [23]:
%config SqlMagic.autopandas = False

#### Embedding variables

Much like when we embed variables in strings in Python using f-strings, we can do the same in `ipython-sql` by preceding the variable name with a `:`, i.e. a colon:

In [24]:
loc1 = 'Canada'

%sql SELECT name, population FROM country WHERE name = :loc1

 * postgresql://postgres:***@localhost:5432/world_dsci513
1 rows affected.


name,population
Canada,31147000


## More SQL commands

### `DISTINCT`

The `DISTINCT` keyword is used to return only distinct rows from a table, and ignore duplicates:

```sql
SELECT
    DISTINCT column1, column2, ...
FROM
    table1;
```

Note that `DISTINCT` is applied to **all columns** that we list in front of `SELECT`, and returns all distinct combinations of values stored in those columns. In the above code snippet, columns other than `column1` and `column2` can still have duplicate values.

In [25]:
%%sql

SELECT
    DISTINCT continent
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
7 rows affected.


continent
Asia
South America
North America
Oceania
Antarctica
Africa
Europe


In [26]:
%%sql

SELECT
    DISTINCT continent, region
FROM
    country
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
25 rows affected.


continent,region
Oceania,Melanesia
Oceania,Australia and New Zealand
North America,Central America
Africa,Northern Africa
Asia,Eastern Asia
Oceania,Polynesia
Europe,Nordic Countries
Asia,Middle East
Oceania,Micronesia/Caribbean
Europe,Baltic Countries


### `DISTINCT ON`

`DISTINCT ON` is not standard SQL, but a useful Postgres extension which allows us to return distinct rows based on the value of a **single** column (`DISTINCT` applies to all columns).

```sql
SELECT
    DISTINCT ON (column1), column2
FROM
    table1
;
```

Note that only the first row of each duplicate group is returned. It's not predictable which row in the duplicate group is returned as the first row!

In [27]:
%%sql

SELECT
    DISTINCT ON (countrycode) countrycode, name
FROM
    city
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
232 rows affected.


countrycode,name
ABW,Oranjestad
AFG,Mazar-e-Sharif
AGO,Luanda
AIA,The Valley
ALB,Tirana
AND,Andorra la Vella
ANT,Willemstad
ARE,Abu Dhabi
ARG,San Salvador de Jujuy
ARM,Yerevan


### `ORDER BY`

The `ORDER BY` keyword sorts the results according to one or particular set of columns:

```sql
SELECT
    column1, column2, ...
FROM
    table1
ORDER BY
    column1, column2, ...;
```

The rows are sorted in **ascending** order by default.

In [28]:
%%sql

SELECT
    name, population
FROM
    country
ORDER
    BY population
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
Heard Island and McDonald Islands,0
United States Minor Outlying Islands,0
South Georgia and the South Sandwich Islands,0
Antarctica,0
Bouvet Island,0
British Indian Ocean Territory,0
French Southern territories,0
Pitcairn,50
Cocos (Keeling) Islands,600
Holy See (Vatican City State),1000


We can also sort the returned rows in descending order by adding the keyword `DESC` keyword after the column names. In fact, there is a `ASC` keyword as well for ascending sorting, which is optional:

```sql
SELECT
    column1, column2, ...
FROM
    table1
ORDER BY
    column1 [ASC|DESC], column2 [ASC|DESC], ...;
```

In [29]:
%%sql

SELECT
    name, population
FROM
    country
ORDER BY
    population DESC
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
239 rows affected.


name,population
China,1277558000
India,1013662000
United States,278357000
Indonesia,212107000
Brazil,170115000
Pakistan,156483000
Russian Federation,146934000
Bangladesh,129155000
Japan,126714000
Nigeria,111506000


### `LIMIT`

We've already talked about how we can limit the number of returned rows from the database using `ipython-sql`'s configuration, but that is specific to `ipython-sql` extension. With SQL in general, we can use the `LIMIT` keyword to limit the number of returned rows:

```sql
SELECT
    column1, column2, ...
FROM
    table1
LIMIT
    N_ROWS;
```

In [30]:
%%sql

SELECT
    name, continent
FROM
    country
LIMIT
    5
;

 * postgresql://postgres:***@localhost:5432/world_dsci513
5 rows affected.


name,continent
Afghanistan,Asia
Netherlands,Europe
Netherlands Antilles,North America
Albania,Europe
Algeria,Africa


It is also possible to skip the first `n` rows by supplying the optional `OFFSET` keyword:

```sql
SELECT
    column1, column2, ...
FROM
    table1
LIMIT
    N_ROWS OFFSET N_OFFSET;
```

In [31]:
%%sql

SELECT
    name, continent
FROM
    country
LIMIT
    5 OFFSET 10;

 * postgresql://postgres:***@localhost:5432/world_dsci513
5 rows affected.


name,continent
United Arab Emirates,Asia
Argentina,South America
Armenia,Asia
Aruba,North America
Australia,Oceania


---
    
**Remember:**

The order of SQL keywords does matter: `SELECT`, `FROM`, `JOIN`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, `LIMIT`
    
---

## SQL-Pandas similarity

|SQL Command|Functionality|Example|Pandas Equivalent|
|---|---|---|---|
|`SELECT`|Extracts data from a database|`SELECT surname, email FROM info;`|`df[["surname", "email"]]`|
|`LIMIT`|Limits the number of rows returned|`SELECT * FROM info`<br>`LIMIT 5;`|`df.head()`|
|`COUNT()`|Counts how many rows returned, note the parentheses because it's a function|`SELECT COUNT(*) FROM info;`|`df.shape[0]`|
|`SELECT DISTINCT`|Returns only unique values|`SELECT DISTINCT city FROM info;`|`df.drop_duplicates()`|
|`WHERE`|Filters data based on a condition(s) like `>`, `<`, `=`, `!=`, etc.)|`SELECT * FROM info` <br>`WHERE city='Vancouver';`|`df.query("city == 'Vancouver'")`|
|`ORDER BY`|Sorts returned data in ascending (default) or descending order|`SELECT * FROM info` <br> `ORDER BY stu_id` <br>`SELECT * FROM info`<br>`ORDER BY stu_id DESC`|`df.sort_values(by="stu_id")`|
|`MIN()`, `MAX()`, `AVG()`|Performs specified operation on selected data|`SELECT MIN(dsci_511) FROM grades`|`df["dsci_511"].min()`|

More resources on comparison of SQL and Pandas for data retrieval
- [Pandas documentation](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html)
- [Pandas cheatsheet for SQL people from Kaggle](https://www.kaggle.com/adilaliyev/pandas-cheatsheet-for-sql-people)