# Lecture 2: Data types, filtering, functions

## Announcements

### About labs 
- Lab 1 this week. For the deadline, check MDS calendar.
- ***Make sure everyone at least gets your database (with data loaded) and environment ready before coming to the lab*** so that you can start working on the lab questions and be more productive during lab. Utilize lab hours and OH to get help. 
- Watch the video [here](important) if you want to see how to create and load database. I mentioned 1 video as a must watch, and please watch it.
- Remember to check [slack etiquette](https://pages.github.ubc.ca/MDS-2024-25/DSCI_513_database-data-retr_students/README.html#slack-etiquette) 

```{important}
You might not be getting responses to your questions during the weekends, holidays, and off hours. ***So please make sure you start working on it early***. Make use of lab hours and office hours.
```

### Other announcements
- Lecture notes have more information. Make sure you read them. Themes are just a summary of what we did in the lecture (there could be things that I haven't added to the themes). You can ignore things that are mentioned as optional in lecture notes.
- [For all timelines, please check here](https://pages.github.ubc.ca/MDS-2024-25/DSCI_513_database-data-retr_students/README.html#timelines). If you don't see anything within the timeline I mentioned there, please post in slack.

```{margin}
<img src="img/recap.png" width="400px">
```

## Recap
- WHY and WHEN to use databases. It's benefits.
- Relational databases. Breaking the table to reduce redundancy (we don't want anomalies).
- Recipe for a table: Name of relation + Attributes (name and domain)
- DDL vs DML
- Basic keywords: `SELECT`, `FROM`, `DISTINCT`, `WHERE`, `ORDER BY`, `LIMIT`

```{admonition} Recap iclicker: Following is a "student" table. 

| student_id | name   | age |
|------------|--------|-----|
| 1          | Emily   | 20  |
| 4          | Emily  | 19  |
| 8          | Emily  | 19  |
| 2          | Jane   | 22  |
| 3          | Tom    | 21  |
| 5          | Alex   | 23  |
| 6          | Alex    | 23  |
| 7          | Alice  | 40  |


What is the output of the following query?

    SELECT DISTINCT name, age
    FROM student
    ORDER BY age
    LIMIT 2;

A.  
| name   | age |
|--------|-----|
| Emily   | 19  |
| Emily    | 20  |

B.  
| name   | age |
|--------|-----|
| Emily   | 19  |
| Tom    | 21 |

C.  
| name   | age |
|--------|-----|
| Emily   | 19  |
| Emily    | 19  |

D.
| name   | age |
|--------|-----|
| Alice   | 40  |
| Alex    | 23  |

```

```{toggle}
> A

- Brush up on `DISTINCT` and `LIMIT` keywords.
- By default, `ORDER BY` is ascending. 
```

## ***Todays theme:***
- Data types (go quickly through them)
- Filtering data (using `WHERE`)
    - Use of `LIKE`,`ILIKE`, `IN`, `BETWEEN`, `IS NULL` (you can also prefix it with `NOT`)
    - Combining conditions using `AND`, `OR`, `NOT` - also use of `()` to group conditions, and order of precedence
- Column aliases using `AS`
- Derived columns
- Use of `CASE` keyword
- Some inbuilt functions (go quickly through them)

In [2]:
%load_ext sql
%config SqlMagic.displaylimit = 20

In [3]:
# This is how you deal with credentials in a notebook
import json
import urllib.parse
## replace with your own credentials
with open('../lectures/data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
# Just printing the logo information to see I am getting the correct file and 
# information
login

{'host': 'localhost', 'port': 5432, 'user': 'postgres', 'password': 'postgres'}

This time I am connecting to the `imdb` database

In [4]:
%sql postgresql://{username}:{password}@{host}:{port}/imdb

## Data types (go quickly through them)

>This is mostly used when creating a table, which typically follows the database design and modeling phase. As data scientists, you might not often create tables, but you will query them. However, it is beneficial to know the data types available in the database you are working with.

Postgres supports

- boolean
- character
- number
- datetime
- binary

###  Why do I need to know about datatypes?

- Of course, we want it to create a table, and that is part of our schema ( remember the recipe and Powerpuff Girls??). The database designers decide them.

<img src="../lectures/img/lecture1/table_anatomy.png" width="600">

```{margin}
<img src="img/lecture4.png" width="600">
```

Like in the above example, we have a table called Students with 5 columns. Each column has a name and a datatype. Following how a database admin would create a table in Postgresql. (***We will learn more about DDL in lecture 4. You will know about constraints like "PRIMARY KEY"***)

```sql
CREATE TABLE Students (
    sid VARCHAR(255) PRIMARY KEY,
    name VARCHAR(255),
    login VARCHAR(255),
    age INTEGER,
    gpa REAL
);
```
> Since our focus is more on data retrieval. Should we need to know about data types?

YES!! Sometimes, we want to apply `casting` to some columns to perform certain operations. Many of these castings are performed implicitly by the Postgres engine (remember the `DBMS` engine). But there might be a few that we need to explicitly specify. So, if we want to explicitly convert a datatype, we use the syntax below.

```sql
CAST(<column> AS <data_type>)

# specific to Postgres

<column>::<data_type>
```

For example, see a couple of cases below. DON'T WORRY IF YOU DON'T GET IT NOW. We will learn about it in this lecture and revisit this code at the end.

In [5]:
%%sql

SELECT

-- 555 ilike '5%',
555::TEXT ilike '5%',

-- '5' + '55', -- fails, since it doesn't do implicit casting

'5'::INTEGER + '55'::INTEGER,

5/2, -- NO implicit casting, and gives integer result 2

5/2::float, -- explicit casting to float gives 2.5, so make sure you do this

lower(SUBSTR('5555555', LENGTH('5555555'), 1)) IN ('5', 'a')

-- lower(SUBSTR(5555555, LENGTH('5555555'), 1)) IN ('5', 'a') -- need explicit casting with substr

-- lower(SUBSTR('5555555', LENGTH(5555555), 1)) IN ('5', 'a') -- need explicit casting with length
;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


?column?,?column?_1,?column?_2,?column?_3,?column?_4
True,60,2,2.5,True


From the above example, it is clear that we want to know about datatypes and how to cast them. So, let us get started.

So, let's learn about each of these datatypes.

### Boolean

In [6]:
%%sql
SELECT
    'TRUE'::BOOLEAN,
    'T'::BOOLEAN,
    '0'::BOOLEAN,
    'NO'::BOOLEAN
;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


bool,bool_1,bool_2,bool_3
True,True,False,False


### Characters

The character data type is used to represent fixed-length and variable-length character strings. This type can be defined using the following keywords:

- `CHAR(n)`: a string of exactly n characters padded with spaces
- `VARCHAR(n)`: a variable set of n characters
- `TEXT`, which is a Postgres-specific type for which there is practically no limit on the number of characters.

In [7]:
%%sql

SELECT
    'Arman'::CHAR(50),
    'Arman'::VARCHAR(50),
    'Arman'::VARCHAR(2), -- mostly used
    'Arman'::TEXT
;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


bpchar,varchar,varchar_1,text
Arman,Arman,Ar,Arman


```{seealso}
Wanna know more about the difference between CHAR and VARCHAR? Check out this [link](https://www.geeksforgeeks.org/char-vs-varchar-in-sql/)
```

```{admonition} Optional discussion: What is the difference between CHAR and VARCHAR?
<img src="img/discuss.png" width="120">
```

```{toggle}
- Char
    - Used to store character string value of fixed length.
    - faster than Varchar
    - static memory allocation

- Varchar
    - Used to store character string value of variable length.
    - slower than char
    - dynamic memory allocation
```

### Numbers

Numerical values in Postgres belong to the following general categories:
- Integers
- Floating-point numbers
- Arbitrary precision numbers

```{note}
Don't want to memorize these ranges.
```
**Integers:**

| Name     | Storage Size | Description                | Range                                        |
|----------|--------------|----------------------------|----------------------------------------------|
| `smallint` | 2 bytes      | small-range integer        | -32768 to +32767                             |
| `integer`  | 4 bytes      | typical choice for integer | -2147483648 to +2147483647                   |
| `bigint`   | 8 bytes      | large-range integer        | -9223372036854775808 to +9223372036854775807 |
| `serial`      | 4 bytes | auto-incrementing integer       | 1 to 2147483647          |
| `bigserial`   | 8 bytes | large auto-incrementing integer | 1 to 9223372036854775807 |

**Floating-point numbers:**

| Name     | Storage Size | Description                | Range                                        |
|----------|--------------|----------------------------|----------------------------------------------|
| `real`             | 4 bytes  | variable-precision, inexact     | at least 6 decimal digits (implementation dependent) |
| `double precision` | 8 bytes  | variable-precision, inexact     | at least 15 decimal digits (implementation dependent) |

**Arbitrary precision numbers**

| Name     | Storage Size | Description                | Range                                        |
|----------|--------------|----------------------------|----------------------------------------------|
| `numeric`          | variable | user-specified precision, exact | 131072 digits before and 16383 digits after the decimal point |
| `decimal`          | variable | user-specified precision, exact | 131072 digits before and 16383 digits after the decimal point |

In [None]:
%%sql
SELECT 44.7874::SMALLINT,
44.7874::INT,
5/2, -- missing precision
5/2::real,
'183.123456789'::real,
'183.123456789659986566656'::double precision,
'183.123456789659986566656'::numeric(25, 22); -- Total number of digits is 25 and 22 digits after decimal

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


int2,int4,?column?,?column?_1,float4,float8,numeric
45,45,2,2.5,183.12346,183.12345678966,183.12345678966


```{admonition} Discussion: Why do we want to use numeric over float (real and double precision)?
<img src="img/discuss.png" width="120">
```

```{toggle}
- Numeric values have a much larger range than floats.
```

```{admonition} Discussion: Why do we want to use float (real and double precision) over numeric?
<img src="img/discuss.png" width="120">
```

```{toggle}
- If the number's precision is within the float's range, then floating-point values will be faster to work with.
```

### Date and time

In [9]:
%%sql
SELECT
    '1/23/2021'::DATE,
    'today'::DATE,
    'tomorrow'::DATE,
    '2:24pm'::TIME,
    '2:24 PM PST'::TIME WITH TIME ZONE,
        'now'::TIME WITH TIME ZONE,
        '2021-11-18 8:30:00'::TIMESTAMP,
         '2021-11-18 8:30:00'::TIMESTAMPTZ;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


date,date_1,date_2,time,timetz,timetz_1,timestamp,timestamptz
2021-01-23,2024-11-19,2024-11-20,14:24:00,14:24:00-08:00,15:13:11.778485-08:00,2021-11-18 08:30:00,2021-11-18 08:30:00-08:00


In [10]:
%sql SHOW TIMEZONE;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


TimeZone
America/Vancouver


```{admonition} As you can see, the timezone value here is "America/Vancouver". From where do you think we are getting this timezone value?

<img src="img/discuss.png" width="120">
```

```{toggle}
The local server timezone is what it is getting
```

Now, let us explicitly set a timezone and see what happens.

In [11]:
%sql SET timezone = 'America/New_York';
# -- SET timezone = 'America/Los_Angeles'';

 * postgresql://postgres:***@localhost:5432/imdb
Done.


[]

In [12]:
%sql SHOW TIMEZONE;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


TimeZone
America/New_York


In [13]:
%%sql

SELECT
    '1/23/2021'::DATE,
    'today'::DATE,
    'tomorrow'::DATE,
    '2:24pm'::TIME,
    '2:24 PM PST'::TIME WITH TIME ZONE,
        'now'::TIME WITH TIME ZONE,
        '2021-11-18 8:30:00'::TIMESTAMP,
         '2021-11-18 8:30:00'::TIMESTAMPTZ;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


date,date_1,date_2,time,timetz,timetz_1,timestamp,timestamptz
2021-01-23,2024-11-19,2024-11-20,14:24:00,14:24:00-08:00,18:13:11.867711-05:00,2021-11-18 08:30:00,2021-11-18 08:30:00-05:00


```{important}
Postgres does not store timezone information. It always internally stores TIMESTAMPTZ in UTC value and does the back-conversion using the local time zone of the database server or the timezone specified by the user.
```

```{margin}
<img src="img/null.gif" width="600">
```
### Nulls

```{note}
You will see how to filter NULL values in the next section.
```

A null is a marker indicating that a column's value is unknown or not entered yet. A null is not equal to 0 or an empty string. In fact, a null is not even equal to another null!!!

How different environments show nulls:

- ipython-sql -> None
- psql -> blank space
- pgAdmin -> [null]

## More SQL keywords

>Following topics are quite important for Data scientists. Let's look into some common traps new SQL programmers can fall into.

```{margin}
<img src="img/trap.gif" width="400px">
```

### WHERE (Filtering rows)

In [14]:
%%sql

SELECT * -- * returns all columns
FROM movies
WHERE title = 'Lost Highway' ;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10116922,Lost Highway,,1997,,134,7.6,120549


```{admonition} iclicker: Based on the below 3 queries, answer the following options:

***Query 1:***

    SELECT * 
    FROM movies
    WHERE title = 'Lost Highway' ;

***Query 2:***

    SELECT * 
    FROM movies
    WHERE title = 'lost highway' ;

***Query 3:***

    SELECT * 
    FROM movies
    WHERE title = "Lost Highway" ;


The following are options (select the best that apply):

A) Only Query 1 and Query 2 will return the same result

B) Only Query 1 and Query 3 will return the same result

C) Queries 1, 2, and 3 will return the same result

D) None of the queries will return the same result

```


```{toggle}
Answer: D

```{important}
- Having "Lost Highway" instead of 'Lost Highway' won't work. So, it is a single quote that you want to use.
- Be careful. Whatever in the WHERE clause is case sensitive. So 'Lost Highway' is not equal to 'lost highway'
```

Here are things that usually go into a where clause

- ***Condition and operator***

| Condition        | Operator                        |
|------------------|---------------------------------|
| Comparison       | `=`, `<>`, `<`, `<=`, `>`, `>=` |
| Pattern matching | `LIKE`                          |
| Range            | `BETWEEN`                       |
| List             | `IN`                            |
| Null testing     | `IS NULL`                       |

- ***Combining conditions with logical operators***

We can combine multiple conditions logical/boolean operators AND, OR, and NOT.

When there are multiple logical operators, 
- ***NOT is evaluated first***, 
- ***then AND*** and 
- ***finally OR***. 

```{margin}
<img src="img/notandor.png" >
```
However, we can use parentheses to change the order of evaluation. 

```{caution}
Be extremely careful when using NOT, AND, and OR together in a single condition.
```

In [15]:
%%sql

SELECT * FROM movies
WHERE start_year = 2015 OR start_year = 2018 AND rating > 8;

 * postgresql://postgres:***@localhost:5432/imdb
1048 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10369610,Jurassic World,,2015,,124,7.0,547391
10420293,The Stanford Prison Experiment,,2015,,122,6.9,33319
10478970,Ant-Man,,2015,,117,7.3,517941
10790770,Miles Ahead,,2015,,100,6.4,8650
10884732,The Wedding Ringer,,2015,,101,6.6,67575
11533089,Tab Hunter Confidential,,2015,,90,7.8,2852
11596363,The Big Short,,2015,,130,7.8,318033
11598642,Z for Zachariah,,2015,,98,6.0,25985
11618448,Racing Extinction,,2015,,90,8.3,7042
11638355,The Man from U.N.C.L.E.,,2015,,116,7.3,245184


Here is how the execution order is

<img src="img/withoutpara.png" width="1200">

In [16]:
%%sql

SELECT * FROM movies
WHERE (start_year = 2015 OR start_year = 2018) AND rating > 8;

 * postgresql://postgres:***@localhost:5432/imdb
119 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
11618448,Racing Extinction,,2015,,90,8.3,7042
12096673,Inside Out,,2015,,95,8.2,550606
12473476,Be Here Now,,2015,,100,8.7,2863
12631186,Baahubali: The Beginning,Bahubali: The Beginning,2015,,159,8.1,94989
12865822,All the World in a Design School,,2015,,59,8.4,1270
13170832,Room,,2015,,118,8.2,326042
13270538,Requiem for the American Dream,,2015,,73,8.1,8061
13717510,The Drop Box,,2015,,79,8.1,604
13865286,My Lonely Me,,2015,,95,8.2,671
14112208,Kuttram Kadithal,,2015,,120,8.1,638


Here is how the execution order is

<img src="img/withpara.png" width="1200">

```{admonition} SQL query: Find the number of movies in the movie_genres table that are NOT listed as 'drama'?

<img src="img/discuss.png" width="120">

Here are the first few rows of the ***movie_genres*** table:

| movieid | genre  |
|---------|--------|
|   100   | drama  |
|   200   | comedy |
|   300   | action |
|   400   | drama  |

***Hint:*** "COUNT()" function to count the number of returned rows. Eg: ***COUNT(DISTINCT movie_id)***
```

```{admonition} iclicker: Who thinks the below query is going to give what the question is asking for (Find the number of movies in the movie_genres table that are NOT listed as 'drama')?

    SELECT COUNT(DISTINCT movie_id) -- COUNT() function to count the number of returned rows 
    FROM movie_genres
    WHERE genre <> 'drama';

A) YES

B) NO

```{toggle}
So here is an animation of this scenario with some sample rows.

<img src="img/dramaquerysplit1.png" width="600">
```


```{margin}
<img src="img/5.png" width="600">
<img src="img/edgecases.png" width="600">
```

```{admonition} iclicker: Now, let's look at some more rows. After looking at the following rows, do you think the SQL query is giving exactly the same what is asked for (Find the number of movies in the movie_genres table that are NOT listed as 'drama'?)?

| movieid | genre  |
|---------|--------|
|   100   | drama  |
|   200   | comedy |
|   300   | action |
|   400   | drama  |
|   400   | comedy |


    SELECT COUNT(DISTINCT movie_id)
    FROM movie_genres
    WHERE genre <> 'drama';

A) YES

B) NO

```{toggle}
<img src="img/dramaquerysplit2.png" width="600">

```{tip}
Make sure you consider any edge cases when writing your SQL query. We will see how to deal with this issue in our upcoming lectures. We will learn how to do this in lecture 5 when we learn about subqueries. We will revisit this question then and do this using subqueries.
```

### LIKE and ILIKE

LIKE is used for string matching:

- `_` stands for any one character and 
- `%` stands for 0 or more arbitrary characters

LIKE is case sensitive and ILIKE case insensitive.

In [17]:
%%sql
SELECT
    'Arman' LIKE 'A%',
    'Gittu' NOT LIKE 'g%',
    'Gittu' ILIKE 'g%',
    'UBC' LIKE '_B_',
    'MDS is Woohoo!' LIKE '%!_',
    'Hello' LIKE '% %'; --looking for something before and after a space

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


?column?,?column?_1,?column?_2,?column?_3,?column?_4,?column?_5
True,True,True,True,False,False


### IN
Something that is going to be pretty useful.

---

**Example:** Retrieve rows from the `movie` table that correspond to the movies `'Donnie Brasco'`, `'The Usual Suspects'`, `'Schindler''s List'`, `'Shutter Island'`, `'A Beautiful Mind'`.

---

In [24]:
%%sql
-- you can add NOT IN
SELECT
    *
FROM
    movies
WHERE
    title IN ('Donnie Brasco',
              'The Usual Suspects',
              'Schindler','s List',
              'Shutter Island',
              'A Beautiful Mind'
               )
;

 * postgresql://postgres:***@localhost:5432/imdb
4 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10114814,The Usual Suspects,,1995,,106,8.5,922333
10119008,Donnie Brasco,,1997,,127,7.7,258120
10268978,A Beautiful Mind,,2001,,135,8.2,784095
11130884,Shutter Island,,2010,,138,8.1,1027318


### BETWEEN (just a fancy keyword or use <= and >= )

In [None]:
%%sql

SELECT 
    5 BETWEEN 1 AND 10, -- Note: BETWEEN is inclusive of both ends of the interval.
    DATE '2021-11-01' BETWEEN DATE '2021-01-01' AND DATE '2021-11-10',
    'w' NOT BETWEEN 'e' AND 'm';

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


?column?,?column?_1,?column?_2
True,True,True


`BETWEEN` query what we wrote above is pretty much same as what is written below.

In [19]:
%%sql
SELECT 
    5 >= 1 AND 5 <= 10,
    DATE '2021-11-01' >= DATE '2021-01-01' AND  DATE '2021-11-01' <= '2021-11-10',
    'w' >= 'e' AND 'w' <= 'm';

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


?column?,?column?_1,?column?_2
True,True,False


---

**Example:** Retrieve the name, production year and rating of the top 5 movies from the movie table that are produced between 2018 and 2020, and have a rating of at least 8.5 with at least 100000 votes. Sort the results in descending order based on ratings.

---

In [20]:
%%sql

SELECT title, start_year, rating
FROM movies
WHERE
    start_year BETWEEN 2018 AND 2020
    AND
    rating >= 8
    AND
    nvotes >= 100000
ORDER BY rating
LIMIT 5 ;

 * postgresql://postgres:***@localhost:5432/imdb
5 rows affected.


title,start_year,rating
Once Upon a Time... in Hollywood,2019,8.0
Toy Story 4,2019,8.0
Bohemian Rhapsody,2018,8.0
Green Book,2018,8.2
Spider-Man: Into the Spider-Verse,2018,8.4


### IS NULL

This doesn't work.

In [21]:
%%sql

SELECT *
FROM movies
WHERE orig_title = NULL ;-- this will not work

 * postgresql://postgres:***@localhost:5432/imdb
0 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes


This is to check if a column has null values.

In [22]:
%%sql

SELECT *
FROM movies
WHERE orig_title IS NULL; -- type with orig_title = NULL and see what happens

 * postgresql://postgres:***@localhost:5432/imdb
17788 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10035423,Kate & Leopold,,2001,,118,6.4,74982
10042742,Mister 880,,1950,,90,7.1,1171
10041181,Black Hand,,1950,,92,6.4,666
10041387,Francis,,1950,,91,6.4,979
10042052,Woman in Hiding,,1950,,92,6.9,553
10042179,Abbott and Costello in the Foreign Legion,,1950,,80,6.6,2573
10042200,Annie Get Your Gun,,1950,,107,6.9,4050
10042206,Armored Car Robbery,,1950,,67,7.0,2077
10042208,The Asphalt Jungle,,1950,,112,7.9,22106
10042211,Atom Man vs. Superman,,1950,,252,7.0,579


In [23]:
%%sql

SELECT * FROM movies
WHERE orig_title IS NOT NULL ;

 * postgresql://postgres:***@localhost:5432/imdb
8270 rows affected.


id,title,orig_title,start_year,end_year,runtime,rating,nvotes
10041719,Orpheus,Orphée,1950,,95,8.0,9346
10041931,Stromboli,"Stromboli, terra di Dio",1950,,107,7.3,5239
10042355,Story of a Love Affair,Cronaca di un amore,1950,,98,7.1,2209
10042619,Diary of a Country Priest,Journal d'un curé de campagne,1951,,115,8.0,8621
10042692,Variety Lights,Luci del varietà,1950,,97,7.1,2416
10042804,The Young and the Damned,Los olvidados,1950,,85,8.3,16453
10042810,Operation Disaster,Morning Departure,1950,,102,7.0,668
10042876,Rashomon,Rashômon,1950,,88,8.2,138304
10042906,La Ronde,La ronde,1950,,93,7.6,4456
10043048,To Joy,Till glädje,1950,,98,7.2,2109


### Column Aliases with AS

In [24]:
%%sql

SELECT
    title AS movieTitle,
    orig_title AS 'original Title',
    runtime AS Duration

FROM movies AS m;

 * postgresql://postgres:***@localhost:5432/imdb
(psycopg2.errors.SyntaxError) syntax error at or near "'original Title'"
LINE 3:     orig_title AS 'original Title',
                          ^

[SQL: SELECT
    title AS movieTitle,
    orig_title AS 'original Title',
    runtime AS Duration

FROM movies AS m;]
(Background on this error at: https://sqlalche.me/e/14/f405)


```{note}
The keyword AS is optional. I usually choose to use it because it makes the query more readable. So the below also works.
```

In [25]:
%%sql

SELECT
    title movieTitle,
    /*Only situation where we refer to a column with double quotes, as there is a white space*/
    orig_title "original Title", -- See here we are using double quotes
    runtime Duration
FROM
    movies m;

 * postgresql://postgres:***@localhost:5432/imdb
26058 rows affected.


movietitle,original Title,duration
Kate & Leopold,,118
Mister 880,,90
Black Hand,,92
Francis,,91
Orpheus,Orphée,95
Stromboli,"Stromboli, terra di Dio",107
Woman in Hiding,,92
Abbott and Costello in the Foreign Legion,,80
Annie Get Your Gun,,107
Armored Car Robbery,,67


```{margin}
<img src="img/joinslec4.png" width="600">
```

```{note}
We will often use table aliases when we work on SQL joins in the upcoming lectures! You MUST use it in some cases, which we will see in lecture 3.
```

```{admonition} iclicker: Does the below query execute with no issues ?

    SELECT
        title AS movieTitle,
        orig_title AS "oringinal Title",
        runtime AS Duration
    FROM
        movies
    WHERE
        Duration > 100;

a) Yes

b) No
```

```{toggle}
No. Check the order of execution. Next topic...
```

## Order of execution/processing in SQL

> Check the PrairieLearn question

[https://us.prairielearn.com/pl/course_instance/167527/instructor/question/9265385/preview/?variant_id=108271439](https://us.prairielearn.com/pl/course_instance/167527/instructor/question/9265385/preview/?variant_id=108271439)

***VERY important:*** Keep the following in mind.

Order of execution/processing in SQL:

    FROM and JOIN
        |
      WHERE
        |
    GROUP BY
        |
      HAVING
        |
      SELECT
        |
    DISTINCT
        |
    ORDER BY
        |
      LIMIT

Contrast this with the order of SQL clauses in a statement:

[https://us.prairielearn.com/pl/course_instance/167527/instructor/question/9265384/preview](https://us.prairielearn.com/pl/course_instance/167527/instructor/question/9265384/preview)

      SELECT
        |
      FROM
        |
      JOIN
        |
      WHERE
        |
    GROUP BY
        |
      HAVING
        |
    ORDER BY
        |
      LIMIT

## Derived columns

In [26]:
%%sql

SELECT
    title,
    /*Here we derive column runtime_hours from runtime*/
    runtime / 60::REAL AS runtime_hours
FROM
    movies;


 * postgresql://postgres:***@localhost:5432/imdb
26058 rows affected.


title,runtime_hours
Kate & Leopold,1.9666666666666663
Mister 880,1.5
Black Hand,1.5333333333333334
Francis,1.5166666666666666
Orpheus,1.5833333333333333
Stromboli,1.7833333333333334
Woman in Hiding,1.5333333333333334
Abbott and Costello in the Foreign Legion,1.3333333333333333
Annie Get Your Gun,1.7833333333333334
Armored Car Robbery,1.1166666666666667


```{admonition} iclicker: Will the following query work ?

    SELECT
        title,
        runtime / 60::REAL AS runtime_hours
    FROM
        movies
        where runtime_hours > 2;

A) YES

B) NO

```

```{toggle}
NO. Look at the order of execution. 

So explicitly specify the derived column name as runtime / 60::REAL in WHERE Clause. Check below:

```sql
SELECT
    title,
    runtime / 60::REAL AS runtime_hours
FROM
    movies
    where runtime / 60::REAL > 2;
```

## Case statement

Concept just like the case statement that you learned in R. But the syntax is quite different.

In [27]:
%%sql

SELECT
    title,
    runtime,
    CASE
        WHEN runtime > 90 THEN 'long'
        WHEN runtime BETWEEN 30 AND 90 THEN 'normal'
        ELSE 'short'
    END AS duration
FROM
    movies
;

 * postgresql://postgres:***@localhost:5432/imdb
26058 rows affected.


title,runtime,duration
Kate & Leopold,118,long
Mister 880,90,normal
Black Hand,92,long
Francis,91,long
Orpheus,95,long
Stromboli,107,long
Woman in Hiding,92,long
Abbott and Costello in the Foreign Legion,80,normal
Annie Get Your Gun,107,long
Armored Car Robbery,67,normal


## Functions & operators (go quickly through them)

>Only go through it if time permits, more like functions in R. Also these might be different in different databases. No no much use in memorizing them. Maybe worth putting them in cheatsheet.

```{note}
- you might not need to memorize all things here.
- that's why you have the cheatsheet so that you can follow them.
```
> Going fast here! There are many! Listed a few here, but more in lecture notes.

### Math

In [28]:
%%sql

SELECT
    25 * 2,
    ABS(-2^10),
    ROUND(23.24545, 2),
    SQRT(25),
    PI()
;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


?column?,abs,round,sqrt,pi
50,1024.0,23.25,5.0,3.141592653589793


### Strings
>You will be using this in labs. So you will get practice. 

In [29]:
%%sql
-- || is used for string concatenation
SELECT
    'Hello' || title || 'runtime is' || ROUND(runtime / 60., 1) || ' with rating '|| rating || ' / 10.',
      SUBSTR(title, LENGTH(title) - 3, 3) AS "Last 3 characters"
FROM
    movies
;

 * postgresql://postgres:***@localhost:5432/imdb
26058 rows affected.


?column?,Last 3 characters
HelloKate & Leopoldruntime is2.0 with rating 6.4 / 10.,pol
HelloMister 880runtime is1.5 with rating 7.1 / 10.,88
HelloBlack Handruntime is1.5 with rating 6.4 / 10.,Han
HelloFrancisruntime is1.5 with rating 6.4 / 10.,nci
HelloOrpheusruntime is1.6 with rating 8 / 10.,heu
HelloStromboliruntime is1.8 with rating 7.3 / 10.,bol
HelloWoman in Hidingruntime is1.5 with rating 6.9 / 10.,din
HelloAbbott and Costello in the Foreign Legionruntime is1.3 with rating 6.6 / 10.,gio
HelloAnnie Get Your Gunruntime is1.8 with rating 6.9 / 10.,Gu
HelloArmored Car Robberyruntime is1.1 with rating 7 / 10.,ber


Following sometimes comes useful when you want to convert a number to a readable format with commas.

In [30]:
%%sql
select to_char(10000000, '9,999,999,999');

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


to_char
10000000


### Date

In [31]:
%%sql

SELECT
    CURRENT_DATE,
    NOW(),
    CURRENT_TIMESTAMP(0),
    CURRENT_TIME(0),
    LOCALTIMESTAMP(0),
    LOCALTIME(2),
    EXTRACT(year FROM '2021-11-15 8:00:00'::TIMESTAMP),
    age(NOW(), '1987-01-05'::TIMESTAMP)
;

 * postgresql://postgres:***@localhost:5432/imdb
1 rows affected.


current_date,now,current_timestamp,current_time,localtimestamp,localtime,extract,age
2023-11-16,2023-11-16 23:57:03.044226-05:00,2023-11-16 23:57:03-05:00,23:57:03-05:00,2023-11-16 23:57:03,23:57:03.040000,2021,"13451 days, 23:57:03.044226"


## Wrap up

<img src="img/wrapup.gif" width="300">

Now let's look again at the "Fully Loaded" SQL whatever we started this lecture with.

```{margin}
<img src="img/loaded.png" width="600">
```

```sql
SET timezone = 'America/New_York';
SELECT
44.7874::INT,
-- 555 ilike '5%', - fails, since 555 is not a string, and it doesn't do implicit casting
555::TEXT ilike '5%',
-- '5' + '55', -- fails, since it doesn't do implicit casting
'5'::INTEGER + '55'::INTEGER,
5 + '55', -- Implicit casting
'2' between 1 AND '3',
'2021-11-01', -- Type character
DATE '2021-11-01', -- type date
'2021-11-01'::DATE, -- type date
DATE '2021-11-01' BETWEEN DATE '2021-01-01' AND DATE '2021-11-10',
'2021-11-01' BETWEEN '2021-01-01' AND '2021-11-10', --Implicit casting
'w' BETWEEN 'e' AND 'm',
5/2, -- NO implicit casting, and gives integer result 2
5/2::float, -- explicit casting to float gives 2.5, so make sure you do this
round('1.333', 2), -- Implicit casting, of character '1.333' to float 1.333
-- round('hello', 2), -- try implicit casting, fails since not possible
CURRENT_DATE, -- this gives current date in date type, pg_typeof() to know type
EXTRACT(year FROM CURRENT_DATE) - '1985', -- implicit casting with 1985 character to number 1985
-- EXTRACT( year FROM CURRENT_DATE::VARCHAR) - '1985', --implicit casting doesn't happen with date in character type
EXTRACT(year FROM CURRENT_DATE::VARCHAR::date)::INT - '1985'::INT, -- explicit casting
'today'::TIMESTAMPTZ,
lower(SUBSTR('5555555', LENGTH('5555555'), 1)) IN ('5', 'a')
-- lower(SUBSTR(5555555, LENGTH('5555555'), 1)) IN ('5', 'a') -- need explicit casting with substr
-- lower(SUBSTR('5555555', LENGTH(5555555), 1)) IN ('5', 'a') -- need explicit casting with length
;
```

```sql
SELECT
    'Title is ' || title AS "Title space", -- using || to concatenate, use of Alias, remember double quotes is used only in this situation
    'w' NOT BETWEEN 'e' AND 'm', -- between and not between, inclusive of both ends of the interval
    round(runtime / 60, 2) AS runtime_hours, -- derived column and use of some functions
    CASE                                    -- CASE WHEN same like what we learned in 523
        WHEN runtime > 90 THEN 'long'
        WHEN runtime BETWEEN 30 AND 90 THEN 'normal'
        ELSE 'short'
    END AS duration
FROM
    movies m
WHERE runtime / 60::REAL > 2 AND -- remember why we are not giving runtime_hours > 2, order of execution matters.
     title IN ('Donnie Brasco',     -- IN operator
              'The Usual Suspects',
              'Schindler''s List',
              'Shutter Island',
              'A Beautiful Mind'
               ) OR -- When multiple logical operators, then use () to make sure order of execution is correct, else NOT --> AND --> OR 
             title NOT ILIKE '% % %' AND -- asking not to include movies with more than 3 words
             title ILIKE '% %' AND -- asking for movies with 2 parts (or 2 words), so single word movies are excluded
             orig_title IS NOT NULL; -- NuLL is treated differently hence orig_tiltle = NULL will not work
```

what all should we recollect from the above code ?

- Do we know how to use alias AS ? And where to use double quotes ?
- Do you know how to use casting ?
- Do we know how to use derived columns ?
- Do we know how to use case statements ?
- Do we know how to use logical operators ? And how to use them with parentheses, as you should be careful about the order of execution.
- Why we used `WHERE runtime / 60::REAL > 2` instead of using `WHERE runtime_hours > 2`?


## Moral of the story

- Do we understand different datatypes and where they are used?
    - boolean
    - character
    - number
    - datetime
    - binary
- We looked into some casting functions.
- We learned how to filter rows using `WHERE` clause.
- Looked into logical operators `AND`, `OR`, `NOT`. Why the order operator is important?
- We learned about `IN`, `BETWEEN`, `IS NULL` keywords and how to use it
- We learned about column aliases and table aliases and why they are important (table aliases will be used a lot in the next lecture when we deal with JOINS)
- We learned about order of execution - And why it is important
- We learned about derived columns.
- We learned about case statements.