In [468]:
# Initialize Otter
import otter
grader = otter.Notebook("proj2.ipynb")

# Project 2: Query Performance
## Due Date: Thursday, October 5th, 5:00 PM

## Assignment Details
In this project, we will explore how the database system optimizes query execution and how users can futher tune the performance of their queries.

This project works with the Lahman's Baseball Database, an open source collection of baseball statistics from 1871 to 2020. It contains a variety of data, like batting statistics, team stats, managerial records, Hall of Fame records, and much more.

You may find this project's queries to be simpler than in Project 1. However, although the queries may not be as complex, we still expect you to spend ample time thinking through the effects of each of the methods, as reasoning about the tradeoff between different approaches is the goal of this assignment.

**Note:** If at any point during the project, the internal state of the database or its tables have been modified in an undesirable way (i.e. a modification not resulting from the instructions of a question), restart your kernel and clear output and simply re-run the notebook as normal. This will shutdown you current connection to the database, which will prevent the issue of multiple connections to the database at any given point, and when re-running the notebook you will create a fresh database based on the provided Postgres dump.

## Logistics & Scoring Breakdown

- Each coding question has **both public tests and hidden tests**. Roughly 50% of your coding grade will be made up of your score on the public tests released to you, while the remaining 50% will be made up of unreleased hidden tests.
- Public tests for multiple choice questions are for sanity check only (e.g. you are answering in the correct format). Partial credit will be awarded.
- Free-response questions will be manually graded. Please answer thoughtfully and concisely in complete sentences, drawing from knowledge in lectures and from your inspection of query plans.

This is an **individual project**. However, you’re welcome to collaborate with any other student in the class as long as it’s within the academic honesty guidelines.


| Question    | 0 | 1 | 2              | 3    | 4    | 5              | 6              | 7    | 8              | 9        | 10        |
| ----------- | - | - | -------------- | ---- | ---- | -------------- | -------------- | ---- | -------------- | -------- | --------- |
| No Subparts | 1 |   |                |      |      |                |                |      |                |          | 6         |
| a           |   | 1 | 1              | 1    | 1    | 2              | 2              | 2    | 1              | 1        |           |
| b           |   | 3 | 3              | 1    | 1    | 1              | 2              | 1    | 1              | 1        |           |
| c           |   |   | 1              | m: 2 | 1    | 1              | 1              | m: 3 | 1              | 2 (m: 2) |           |
| d           |   |   | 4 (m: 2, a: 2) | m: 2 | m: 3 | 4 (m: 2, a: 2) | 1              |      | 4 (m: 2, a: 2) |          |           |
| e           |   |   |                |      |      |                | 4 (m: 2, a: 2) |      |                |          |           |
| **Total**       | 1 | 4 | 9              | 6    | 6    | 8              | 10             | 6    | 7              | 4        | manual: 6 |


**Grand Total:** 67 points (manual: 26, autograded: 41)

In [469]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

## Getting Connected
Similar to Project 1, we will be using the `JupySQL` library to connect this notebook to a PostgreSQL database server on your JupyterHub account. Run the following cell to initiate the connection.

In [470]:
%reload_ext sql
%sql postgresql://jovyan@127.0.0.1:5432/postgres

In [471]:
# See full display
%config SqlMagic.displaylimit = 50

## Setting up the Database
The following cell will create the `baseball` database (if needed), unzip the Postgres dump of the Lahman's Baseball Database, populate the `baseball` database with the desired tables and data, and finally display all databases associated with the Postgres instance. After running the cell, you should see the `baseball` database in the generated list of databases outputted by `%sql \l`.

**Note:** If you run into the **role does not exist**/**database does not exist** error the first time you run this cell, feel free to ignore it. It does not affect data import.

In [472]:
!unzip -u data/baseball.zip -d data/

Archive:  data/baseball.zip


In [473]:
!psql postgresql://jovyan@127.0.0.1:5432/baseball -c 'SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE datname = current_database()  AND pid <> pg_backend_pid();'
!psql -h localhost -c 'DROP DATABASE IF EXISTS baseball'
!psql -h localhost -c 'CREATE DATABASE baseball'
!psql -h localhost -d baseball -f data/baseball.sql
!psql -h localhost -c 'SET max_parallel_workers_per_gather = 0;'
%sql \l

 pg_terminate_backend 
----------------------
 t
(1 row)

DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 5219
COPY 104256
COPY 179
COPY 6236
COPY 425
COPY 6879
COPY 104324
COPY 13943
COPY 17350
COPY 138838
COPY 12028
COPY 31955
COPY 13110
COPY 4191
COPY 30

Name,Owner,Encoding,Collate,Ctype,Access privileges
baseball,jovyan,UTF8,en_US.utf8,en_US.utf8,
imdb,jovyan,UTF8,en_US.utf8,en_US.utf8,
imdb_lecture,jovyan,UTF8,en_US.utf8,en_US.utf8,
imdb_perf_lecture,jovyan,UTF8,en_US.utf8,en_US.utf8,
jovyan,jovyan,UTF8,en_US.utf8,en_US.utf8,
postgres,jovyan,UTF8,en_US.utf8,en_US.utf8,
stops_lecture,jovyan,UTF8,en_US.utf8,en_US.utf8,
template0,jovyan,UTF8,en_US.utf8,en_US.utf8,=c/jovyan jovyan=CTc/jovyan
template1,jovyan,UTF8,en_US.utf8,en_US.utf8,=c/jovyan jovyan=CTc/jovyan


Now, run the following cell to connect to the `baseball` database. There should be no errors after running the following cell.

In [474]:
%sql postgresql://jovyan@127.0.0.1:5432/baseball

To ensure that the connection to the database has been established, let's try grabbing the first 5 rows from the `halloffame` table.

In [475]:
%%sql
SELECT * FROM halloffame LIMIT 5

/srv/conda/envs/notebook/lib/python3.11/site-packages/sql/connection/connection.py:827: JupySQLRollbackPerformed: Server closed connection. JupySQL executed a ROLLBACK operation.


playerid,yearid,votedby,ballots,needed,votes,inducted,category,needed_note
cobbty01,1936,BBWAA,226,170,222,Y,Player,
ruthba01,1936,BBWAA,226,170,215,Y,Player,
wagneho01,1936,BBWAA,226,170,215,Y,Player,
mathech01,1936,BBWAA,226,170,205,Y,Player,
johnswa01,1936,BBWAA,226,170,189,Y,Player,


## Connect to the grader

Run the following cell for grading purposes.

In [476]:
# Just run the following cell, no further action is needed.
from data101_utils import GradingUtil
grading_util = GradingUtil("proj2")
grading_util.prepare_autograder()

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## Table Descriptions
In its entirety the Lahman's Baseball Database contains 27 tables containing a variety of statistics for players, teams, games, schools, etc. For simplicity, this project will focus on a subset of the tables:
* `appearances`: details on the positions each player appeared at
* `batting`: batting statistics for each player
* `collegeplaying`: list of players and the colleges they attended
* `halloffame`: Hall of Fame voting data
* `people`: player information (name, date of birth, and biographical info)
* `salaries`: player salary data
* `schools`: list of colleges that players attended

As a reminder from Project 1, `%sql \d <table_name>` is helpful for identifying the columns in a table.

<br><br>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 0: PostgreSQL Explain Analyze
**Please read through this section carefully, as a vast majority of the project will require you to inspect query plans via interpreting the output of the explain analyze command.**

To inspect the query plan for a given query, create a variable storing the query as a string and invoke a `psql` shell command to `explain analyze` the query: 

`your_query_str = "__REPLACE_ME_WITH_QUERY__"`

`!psql -h localhost -d baseball -c "explain analyze $your_query_str"`

Take a look at the following sample query plan.

![title](data/sample_query.png)

It is highly recommended to read through [this article](https://www.cybertec-postgresql.com/en/how-to-interpret-postgresql-explain-analyze-output/) and the postgreSQL [documentation 14.1.2](https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE) to see how you can interpret the output above. Everything before "Tools to interpret Explain Analyze output" is useful.


<div class="alert alert-block alert-info">
Here are some key things to note for all question parts:
<ul>
<li>When we ask you to identify the <b>query cost</b>, we are looking for the <b>total cost</b>.</li>
    <ul>
    <li>There are two cost values: the first is the <b>startup cost</b> (cost to return the first row) and the second is the <b>total cost</b> (cost to return all rows).</li>
    <li>The unit for the estimated query cost is an arbitrary estimation of disk I/O (1 is the cost for reading an 8kB page during a sequential scan).</li>
        <li>Feel free to round the query cost / time to the nearest integer, but we'll accept anything more exact.</li>
    </ul>
<li>When we ask you to identify the <b>query time</b>, we are looking for the <b>execution time</b> (in ms).</li>
    <ul>
        <li>We recognize that the execution time may vary between different cell executions, so the autograder will tolerate a reasonable range.</li>
    </ul>
</ul>
</div>

Now, inspect the query plan above by following the below steps:

1. Manually copy the entire query command (i.e., `SELECT ... `) from the screenshot into the cell below.

_Type your answer here, replacing this text._

In [477]:
%%sql --save query_0 result_0 <<
SELECT FROM people AS p INNER JOIN collegeplaying AS cp ON p.playerid = cp.playerid

In [478]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_0 = %sqlcmd snippets query_0
grading_util.save_results("result_0", query_0, result_0);
result_0.DataFrame().head(3)

0
1
2


In [479]:
%sql EXPLAIN ANALYZE {{query_0}}

QUERY PLAN
Hash Join (cost=861.83..1193.88 rows=17350 width=0) (actual time=5.904..10.469 rows=17350 loops=1)
Hash Cond: ((cp.playerid)::text = (p.playerid)::text)
-> Seq Scan on collegeplaying cp (cost=0.00..286.50 rows=17350 width=38) (actual time=0.011..1.285 rows=17350 loops=1)
-> Hash (cost=619.70..619.70 rows=19370 width=38) (actual time=5.777..5.778 rows=19370 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1049kB
-> Seq Scan on people p (cost=0.00..619.70 rows=19370 width=38) (actual time=0.006..2.524 rows=19370 loops=1)
Planning Time: 0.128 ms
Execution Time: 10.981 ms


In [480]:
sample_query_cost = 1193.88
sample_query_timing = 18.079

In [481]:
grader.check("q0")

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 1: Queries and Views, Part 1

In Questions 1 and 2, you will compare and contrast writing queries with subqueries and views.

## Question 1a
Write a query that finds `namefirst`, `namelast`, `playerid` and `yearid` of all people who were successfully inducted into the Hall of Fame. **Note**: Your query should **NOT** use any sub-queries.

In [482]:
%%sql --save query_1a result_1a <<
select p.namefirst, p.namelast, p.playerid, h.yearid
from halloffame as h 
inner join people as p 
on h.playerid = p.playerid
and h.inducted = 'Y';

In [483]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_1a = %sqlcmd snippets query_1a
grading_util.save_results("result_1a", query_1a, result_1a)
result_1a.DataFrame().head(3)

Unnamed: 0,namefirst,namelast,playerid,yearid
0,Ty,Cobb,cobbty01,1936
1,Babe,Ruth,ruthba01,1936
2,Honus,Wagner,wagneho01,1936


In [484]:
grader.check("q1a")

<br><br>

---

## Question 1b
In this question, we will compare the query you wrote in `Question 1a` against the provided query below in `Question 1bi` by inspecting both query plans.

#### Question 1bi: 
Inspect the query plan for `provided_query` and the query you wrote in `Question 1a` by running the cells below.

In [485]:
%%sql --save provided_query result_provided <<
-- just run this cell
SELECT namefirst, namelast, p.playerid, yearid
FROM people AS p, (SELECT * FROM halloffame WHERE inducted = 'Y') AS hof 
WHERE p.playerid = hof.playerid;

In [486]:
# just run this cell 
provided_query = %sqlcmd snippets provided_query
%sql EXPLAIN ANALYZE {{provided_query.strip(';')}}

QUERY PLAN
Nested Loop (cost=0.29..262.79 rows=21 width=278) (actual time=0.030..1.816 rows=323 loops=1)
-> Seq Scan on halloffame (cost=0.00..96.39 rows=21 width=42) (actual time=0.012..0.537 rows=323 loops=1)
Filter: ((inducted)::text = 'Y'::text)
Rows Removed by Filter: 3868
-> Index Scan using master_pkey on people p (cost=0.29..7.92 rows=1 width=274) (actual time=0.004..0.004 rows=1 loops=323)
Index Cond: ((playerid)::text = (halloffame.playerid)::text)
Planning Time: 0.123 ms
Execution Time: 1.846 ms


In [487]:
# just run this cell
%sql EXPLAIN ANALYZE {{query_1a}}

QUERY PLAN
Nested Loop (cost=0.29..262.79 rows=21 width=278) (actual time=0.031..1.846 rows=323 loops=1)
-> Seq Scan on halloffame h (cost=0.00..96.39 rows=21 width=42) (actual time=0.012..0.500 rows=323 loops=1)
Filter: ((inducted)::text = 'Y'::text)
Rows Removed by Filter: 3868
-> Index Scan using master_pkey on people p (cost=0.29..7.92 rows=1 width=274) (actual time=0.004..0.004 rows=1 loops=323)
Index Cond: ((playerid)::text = (h.playerid)::text)
Planning Time: 0.113 ms
Execution Time: 1.878 ms


Record the **execution time** and **cost** for each query.

In [488]:
provided_query_cost = 959.06
provided_query_timing = 14.929
your_query_cost = 959.06
your_query_timing = 14.776

In [489]:
grader.check("q1bi")


#### Question 1bii:
Given your findings from inspecting the query plans of the two queries, answer the following question. Assign the variable `q1b_part2` to a list of all of the below statements that are true.


Consider the following statements:
<br>
A. Both the queries have the same cost
<br>
B. The provided query has a faster execution time because it makes use of a subquery.
<br>
C. The query you wrote has a faster execution time because it does not make use a subquery.
<br>
D. The provided query has less cost because it makes use of a subquery.
<br>
E. The query you wrote has less cost because it does not make use a subquery.
<br>
F. The queries have the same output.
<br>
G. The queries do not have the same output.
    
**Note:** Your answer should look like `q1b_part2 = ['A', 'B']`

In [490]:
q1b_part2 = ['A','C','F']

In [491]:
grader.check("q1bii")

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


# Question 2: Queries and Views, Part 2

In this question, you will continue analyzing queries with/without views and materialized views.
* Question 2a: Write a query that computes people in a Hall of Fame.
* Question 2 Tutorial: Use this query to create a view called `inducted_hof_ca` and a materialized view, `inducted_hof_ca_mat`.
* Question 2b: Write three queries that achieve the same result:
  * Question 2bi: One that uses no views.
  * Question 2bii: One that uses the `inducted_hof_ca` view.
  * Question 2ciii: One that uses the `inducted_hof_ca_mat` materialized view.
* Question 2c: Record the performance of these three queries.
* Question 2d: Analyze and discuss using queries with different types of views.

<br/><br/>

---

## Question 2a

Write a query that returns the people who were successfully inducted into the Hall of Fame and played in college at a school located in California. For each player, return their `namefirst`, `namelast`, `playerid`, `schoolid`, and `yearid` ordered by the `yearid` and then the `playerid`. 

**Note**: For this query, `yearid` refers to player's year of induction into the Hall of Fame.

In [492]:
%%sql --save query_2a result_2a <<
select p.namefirst, p.namelast, p.playerid, cp.schoolid, h.yearid
from collegeplaying as cp
inner join people as p
on cp.playerid = p.playerid
inner join halloffame as h
on p.playerid = h.playerid
inner join schools as s
on s.schoolid = cp.schoolid
where h.inducted = 'Y'
and s.schoolstate = 'CA'
order by h.yearid, p.playerid
;

In [493]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_2a = %sqlcmd snippets query_2a
grading_util.save_results("result_2a", query_2a, result_2a)
result_2a

namefirst,namelast,playerid,schoolid,yearid
Jackie,Robinson,robinja02,ucla,1962
Harry,Hooper,hoopeha01,stmarysca,1971
Joe,Morgan,morgajo02,camerri,1990
Tom,Seaver,seaveto01,cafrecc,1992
Tom,Seaver,seaveto01,usc,1992
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Tony,Gwynn,gwynnto01,sandiegost,2007


In [494]:
grader.check("q2a")

<br/><br/>

---

## Question 2 Tutorial

We are now going to use the query you wrote in the previous part to generate a view, called `inducted_hof_ca`, and a materialized view, `inducted_hof_ca_mat`.

Run the below cells. You do not need to do anything more for this part. 

(Note: the semicolon strip is to avoid executing an empty query with double-semicolons, which causes an error.)

In [495]:
%%sql
/* just run this cell */
DROP VIEW IF EXISTS inducted_hof_ca;
CREATE VIEW inducted_hof_ca AS {{query_2a.strip(';')}};
SELECT * FROM inducted_hof_ca;

namefirst,namelast,playerid,schoolid,yearid
Jackie,Robinson,robinja02,ucla,1962
Harry,Hooper,hoopeha01,stmarysca,1971
Joe,Morgan,morgajo02,camerri,1990
Tom,Seaver,seaveto01,usc,1992
Tom,Seaver,seaveto01,cafrecc,1992
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Tony,Gwynn,gwynnto01,sandiegost,2007


In [496]:
%%sql
/* just run this cell */
DROP MATERIALIZED VIEW IF EXISTS inducted_hof_ca_mat;
CREATE MATERIALIZED VIEW inducted_hof_ca_mat AS {{query_2a.strip(';')}};
SELECT * FROM inducted_hof_ca_mat;

namefirst,namelast,playerid,schoolid,yearid
Jackie,Robinson,robinja02,ucla,1962
Harry,Hooper,hoopeha01,stmarysca,1971
Joe,Morgan,morgajo02,camerri,1990
Tom,Seaver,seaveto01,usc,1992
Tom,Seaver,seaveto01,cafrecc,1992
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Ozzie,Smith,smithoz01,calpoly,2002
Tony,Gwynn,gwynnto01,sandiegost,2007


<br/><br/>

---
### Question 2b

For this question, we want to compute the count of players who were inducted into the Hall of Fame and played baseball at a college in California for each `schoolid` and `yearid` combination ordered by ascending `yearid`.

You should write three queries that accomplish this task, but with different strategies:
* Question 2bi: Use the `inducted_hof_ca` view;
* Question 2bii Use the `inducted_hof_ca_mat` view; and
* Question 2biii: Do not use `inducted_hof_ca` view, `inducted_hof_ca_mat` materialized view, any common table expressions (CTEs), nor any subqueries.

### Question 2bi

Write a query to accomplish the task above using the `inducted_hof_ca` view. Assign your result to `result_2b_view`.

In [497]:
%%sql --save query_2b_view result_2b_view <<
select schoolid, yearid, count(*)
from inducted_hof_ca
group by schoolid, yearid
order by yearid
;

In [498]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_2b_view = %sqlcmd snippets query_2b_view
grading_util.save_results("result_2b_view", query_2b_view, result_2b_view)
result_2b_view

schoolid,yearid,count
ucla,1962,1
stmarysca,1971,1
camerri,1990,1
cafrecc,1992,1
usc,1992,1
calpoly,2002,4
sandiegost,2007,3
capasad,2008,1
sandiegost,2010,2
calavco,2011,1


In [499]:
grader.check("q2bi")

<br/><br/>

#### Question 2bii:

Now, write the query a second time to use the materialized view `inducted_hof_ca_mat`. Assign your result to `result_2b_mat`.

In [500]:
%%sql --save query_2b_mat result_2b_mat <<
select schoolid, yearid, count(*)
from inducted_hof_ca_mat
group by schoolid, yearid
order by yearid
;

In [501]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_2b_mat = %sqlcmd snippets query_2b_mat
grading_util.save_results("result_2b_mat", query_2b_mat, result_2b_mat)
result_2b_mat

schoolid,yearid,count
ucla,1962,1
stmarysca,1971,1
camerri,1990,1
cafrecc,1992,1
usc,1992,1
calpoly,2002,4
sandiegost,2007,3
capasad,2008,1
sandiegost,2010,2
calavco,2011,1


In [502]:
grader.check("q2bii")

<br/><br/>

#### Question 2biii:

Finally, write the query a third time to **not** use the `inducted_hof_ca` view, nor the `inducted_hof_ca_mat` materialized view, nor any common table expressions (CTEs), nor any subqueries. Save your result in `result_2b_no_view`.

In [503]:
%%sql --save query_2b_no_view result_2b_no_view <<
select cp.schoolid, h.yearid, count(*)
from collegeplaying as cp
inner join people as p
on cp.playerid = p.playerid
inner join halloffame as h
on p.playerid = h.playerid
inner join schools as s
on s.schoolid = cp.schoolid
where h.inducted = 'Y'
and s.schoolstate = 'CA'
group by cp.schoolid, h.yearid
order by h.yearid

In [504]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_2b_no_view = %sqlcmd snippets query_2b_no_view
grading_util.save_results("result_2b_no_view", query_2b_no_view, result_2b_no_view)
result_2b_no_view

schoolid,yearid,count
ucla,1962,1
stmarysca,1971,1
camerri,1990,1
cafrecc,1992,1
usc,1992,1
calpoly,2002,4
sandiegost,2007,3
capasad,2008,1
sandiegost,2010,2
calavco,2011,1


In [505]:
grader.check("q2biii")

<br/><br/>

---

### Question 2c
Inspect the query plans for the three queries you wrote above by running the following cells.

In [506]:
# just run this cell
%sql EXPLAIN ANALYZE {{query_2b_view}}

QUERY PLAN
GroupAggregate (cost=523.47..525.39 rows=96 width=20) (actual time=4.294..4.304 rows=13 loops=1)
"Group Key: inducted_hof_ca.yearid, inducted_hof_ca.schoolid"
-> Sort (cost=523.47..523.71 rows=96 width=12) (actual time=4.288..4.292 rows=23 loops=1)
"Sort Key: inducted_hof_ca.yearid, inducted_hof_ca.schoolid"
Sort Method: quicksort Memory: 26kB
-> Subquery Scan on inducted_hof_ca (cost=519.11..520.31 rows=96 width=12) (actual time=4.274..4.280 rows=23 loops=1)
-> Sort (cost=519.11..519.35 rows=96 width=257) (actual time=4.273..4.276 rows=23 loops=1)
"Sort Key: h.yearid, p.playerid"
Sort Method: quicksort Memory: 26kB
-> Nested Loop (cost=386.71..515.95 rows=96 width=257) (actual time=3.933..4.264 rows=23 loops=1)


In [507]:
# just run this cell
%sql EXPLAIN ANALYZE {{query_2b_mat}}

QUERY PLAN
Sort (cost=23.67..24.17 rows=200 width=60) (actual time=0.032..0.034 rows=13 loops=1)
Sort Key: yearid
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=14.03..16.02 rows=200 width=60) (actual time=0.022..0.025 rows=13 loops=1)
"Group Key: yearid, schoolid"
Batches: 1 Memory Usage: 40kB
-> Seq Scan on inducted_hof_ca_mat (cost=0.00..12.30 rows=230 width=52) (actual time=0.005..0.007 rows=23 loops=1)
Planning Time: 0.076 ms
Execution Time: 0.062 ms


In [508]:
# just run this cell
%sql EXPLAIN ANALYZE {{query_2b_no_view}}

QUERY PLAN
GroupAggregate (cost=519.11..521.03 rows=96 width=20) (actual time=5.002..5.014 rows=13 loops=1)
"Group Key: h.yearid, cp.schoolid"
-> Sort (cost=519.11..519.35 rows=96 width=12) (actual time=4.994..4.999 rows=23 loops=1)
"Sort Key: h.yearid, cp.schoolid"
Sort Method: quicksort Memory: 26kB
-> Nested Loop (cost=386.71..515.95 rows=96 width=12) (actual time=4.372..4.979 rows=23 loops=1)
-> Hash Join (cost=386.42..484.98 rows=96 width=30) (actual time=4.348..4.839 rows=23 loops=1)
Hash Cond: ((h.playerid)::text = (cp.playerid)::text)
-> Seq Scan on halloffame h (cost=0.00..96.39 rows=323 width=13) (actual time=0.011..0.758 rows=323 loops=1)
Filter: ((inducted)::text = 'Y'::text)


Then, record the execution time and cost for each query.

In [509]:
with_view_cost = 525.39
with_view_timing = 9.158
with_materialized_view_cost = 24.17
with_materialized_view_timing = 0.087
without_view_cost = 521.03
without_view_timing = 8.062

In [510]:
grader.check("q2c")

<br/><br/>

---

## Question 2d

Given your findings from inspecting the query plans in this Question, as well as your understanding of views and materialized views from lectures, discuss the takeaways of using views and materialized views.

### Question 2di

Assign the variable `q2di` to a list of all of the below statements that are true.

A. Views will reduce the execution time and the cost of a query.<br/>
B. Views will reduce the execution time of a query, but not the cost.<br/>
C. Views will reduce the cost of a query, but not the execution time.<br/>
D. Materialized views reduce the execution time and the cost of a query.<br/>
E. Materialized views reduce the execution time, but not cost of a query<br/>
F. Materialized views reduce the cost of a query, but not the execution time.<br/>
G. Materialized views will result in the same query plan as a query using views.<br/>
H. Materialized views and views take the same time to create.<br/>
I. Materialized views take less time to create than a view.<br/>
J. Materialized views take more time to create than a view.<br/>
    
*Note:* Your answer should look like `q2di = ['A', 'B']`

In [511]:
q2di = ['D', 'J']

In [512]:
grader.check("q2di")

<!-- BEGIN QUESTION -->

#### Question 2dii:

Explain your answer to the previous part (Question 2di) based on your knowledge from lectures and details from the query plans. Your explanation should also include why you didn't choose certain options. Please answer in maximum 5 sentences.

_Based on the explain analyze cells, materialized views have significantly lower cost and excecution time, so I chose choice D. This is because materialized views create tables that are stored in the disk and therefore makes it much faster to pull data. Since we are storing tables in the disk when creating materialized views, it would take more time to create than a view, therefore I chose choice J._

<!-- END QUESTION -->

<br><br>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 3: Predicate Pushdown
In this question, we will explore the impact of predicates (i.e., filters) on a query's execution, particularly inspecting when the optimizer applies predicates.

* Question 3a: Compute a query with all rows.
* Question 3b: Add a simple filter.
* Question 3c: Analyze the tradeoffs to cost.
* Question 3d: Analyze the tradeoffs to execution time.


## Question 3a:
Recall the `inducted_hof_ca` view created in `Question 2`. Inspect the query plan for a query that that gets all rows from the view, and record the execution time and cost using an `EXPLAIN ANALYZE` command.

In [513]:
q3a = "select * from inducted_hof_ca"
!psql -h localhost -d baseball -c "explain analyze $q3a"

ERROR:  relation "inducted_hof_ca" does not exist
LINE 1: explain analyze select * from inducted_hof_ca
                                      ^


In [514]:
query_view_cost = 528.77
query_view_timing = 8.741

In [515]:
grader.check("q3a")

<br><br>

---

## Question 3b:
Now, add a filter to only return rows from `inducted_hof_ca` where the year is later than 2010. Inspect the query plan and record the execution time and cost.

In [516]:
q3b = "select * from inducted_hof_ca where yearid > 2010"
!psql -h localhost -d baseball -c "explain analyze $q3b"

ERROR:  relation "inducted_hof_ca" does not exist
LINE 1: explain analyze select * from inducted_hof_ca where yearid >...
                                      ^


In [517]:
query_view_with_filter_cost = 209.86
query_view_with_filter_timing = 1.491

In [518]:
grader.check("q3b")

<!-- BEGIN QUESTION -->

## Question 3c:
Given your findings from inspecting the query plans of queries from Questions 3a and 3b, fill in the blank and **justify your answer**. Explain your answer based on your knowledge from lectures, and details from the query plans (your explanation should include why you didn't choose other options). Your response should be no longer than 3 sentences.

**Note:** Your answer should be formatted as follows: `A because ...`

**Adding a filter ___ the cost.**
<br>
A. increased
<br>
B. decreased
<br>
C. did not change

_B because predicate pushdown through filtering simplifies our query plan. Because of filtering, we have fewer rows we need to manipulate and therefore smaller cost._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br><br>

---

## Question 3d:
Given your findings from inspecting the query plans of queries from Questions 3a and 3b, fill in the blank and **justify your answer**. Explain your answer based on your knowledge from lectures, and details from the query plans (your explanation should include why you didn't choose other options). Your response should be no longer than 3 sentences.

**Note:** Your answer should be formatted as follows: `A because ...`

**Adding a filter ___ the execution time.**
<br>
A. increased
<br>
B. decreased
<br>
C. did not change

_B because filtering simplifies our query as we need to manipulate fewer rows. Working with fewer rows are going to execute faster than working with more rows, so therefore adding a filter decreased the execution time._

<!-- END QUESTION -->

<br><br>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 4: Join Approaches

In this question, we'll explore different join approaches (Nested Loop Join, Merge Join, Hash Join) and discuss how the query optimizer picks the best approach.

<br/><br/>

---

## Question 4a
Perform an inner join on the `people` and `collegeplaying` tables on the `playerid` column. Project all columns.

In [519]:
%%sql --save query_4a result_4a <<
select * 
from people as p
inner join collegeplaying as cp
on p.playerid = cp.playerid

In [520]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_4a = %sqlcmd snippets query_4a
grading_util.save_results("result_4a", query_4a, result_4a);

display(result_4a.DataFrame().head(3))
%sql EXPLAIN ANALYZE {{query_4a}} 

Unnamed: 0,playerid,birthyear,birthmonth,birthday,birthcountry,birthstate,birthcity,deathyear,deathmonth,deathday,...,height,bats,throws,debut,finalgame,retroid,bbrefid,playerid.1,schoolid,yearid
0,aardsda01,1981,12.0,27.0,USA,CO,Denver,,,,...,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,aardsda01,pennst,2001
1,aardsda01,1981,12.0,27.0,USA,CO,Denver,,,,...,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,aardsda01,rice,2002
2,aardsda01,1981,12.0,27.0,USA,CO,Denver,,,,...,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,aardsda01,rice,2003


QUERY PLAN
Hash Join (cost=861.83..1193.88 rows=17350 width=167) (actual time=6.465..14.424 rows=17350 loops=1)
Hash Cond: ((cp.playerid)::text = (p.playerid)::text)
-> Seq Scan on collegeplaying cp (cost=0.00..286.50 rows=17350 width=21) (actual time=0.017..1.034 rows=17350 loops=1)
-> Hash (cost=619.70..619.70 rows=19370 width=146) (actual time=6.398..6.400 rows=19370 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 3633kB
-> Seq Scan on people p (cost=0.00..619.70 rows=19370 width=146) (actual time=0.003..1.461 rows=19370 loops=1)
Planning Time: 0.246 ms
Execution Time: 15.157 ms


Run the cell above to inspect the query plan for your command.

**Which join approach did the query optimizer choose?** 

A. Nested Loop Join<br/>
B. Merge Join<br/>
C. Hash Join<br/>
D. None of the Above

Assign the variable `q4a` to the correct letter choice above, e.g., `q4a = 'A'`.

In [521]:
q4a = 'C'

In [522]:
grader.check("q4a")

<br><br>

---

## Question 4b

Similar to Question 4a, perform an inner join on the `people` and `collegeplaying` tables on the `playerid` column. Project all columns.

In addition, **sort your output by `playerid`.**

In [523]:
%%sql --save query_4b result_4b <<
select *
from people as p
inner join collegeplaying as cp
on p.playerid = cp.playerid
order by p.playerid

In [524]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_4b = %sqlcmd snippets query_4b
grading_util.save_results("result_4b", query_4b, result_4b);

display(result_4b.DataFrame().head(3))
%sql EXPLAIN ANALYZE {{query_4b}} 

Unnamed: 0,playerid,birthyear,birthmonth,birthday,birthcountry,birthstate,birthcity,deathyear,deathmonth,deathday,...,height,bats,throws,debut,finalgame,retroid,bbrefid,playerid.1,schoolid,yearid
0,aardsda01,1981,12.0,27.0,USA,CO,Denver,,,,...,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,aardsda01,pennst,2001
1,aardsda01,1981,12.0,27.0,USA,CO,Denver,,,,...,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,aardsda01,rice,2002
2,aardsda01,1981,12.0,27.0,USA,CO,Denver,,,,...,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,aardsda01,rice,2003


QUERY PLAN
Merge Join (cost=0.57..1910.20 rows=17350 width=167) (actual time=0.019..14.050 rows=17350 loops=1)
Merge Cond: ((p.playerid)::text = (cp.playerid)::text)
-> Index Scan using master_pkey on people p (cost=0.29..1024.36 rows=19370 width=146) (actual time=0.007..2.968 rows=19368 loops=1)
-> Index Only Scan using collegeplaying_pkey on collegeplaying cp (cost=0.29..620.54 rows=17350 width=21) (actual time=0.007..1.758 rows=17350 loops=1)
Heap Fetches: 0
Planning Time: 0.268 ms
Execution Time: 14.567 ms


Run the cell above to inspect the query plan for your command.

**Which join approach did the query optimizer choose?** 

A. Nested Loop Join<br/>
B. Merge Join<br/>
C. Hash Join<br/>
D. None of the Above

Assign the variable `q4b` to the correct letter choice above, e.g., `q4b = 'A'`.

In [525]:
q4b = 'B'

In [526]:
grader.check("q4b")

<br><br>

---
## Question 4c
Write a query to retrieve all possible player pair combinations. Project all columns, but **limit to 1000 rows** to ensure your query doesn't take an exorbitant amount of time to run.

**Hint:** You can do this by performing an inner join of the `people` table on itself with an inequality condition.

In [527]:
%%sql --save query_4c result_4c <<
select *
from people as p1
inner join people as p2
on p1.playerid != p2.playerid
limit 1000;

In [528]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_4c = %sqlcmd snippets query_4c
grading_util.save_results("result_4c", query_4c, result_4c);

display(result_4c.DataFrame().head(3))
%sql EXPLAIN ANALYZE {{query_4c}} 

Unnamed: 0,playerid,birthyear,birthmonth,birthday,birthcountry,birthstate,birthcity,deathyear,deathmonth,deathday,...,namelast,namegiven,weight,height,bats,throws,debut,finalgame,retroid,bbrefid
0,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
1,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
2,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01


QUERY PLAN
Limit (cost=0.00..15.00 rows=1000 width=292) (actual time=0.015..0.663 rows=1000 loops=1)
-> Nested Loop (cost=0.00..5629241.33 rows=375177530 width=292) (actual time=0.015..0.608 rows=1000 loops=1)
Join Filter: ((p1.playerid)::text <> (p2.playerid)::text)
Rows Removed by Join Filter: 1
-> Seq Scan on people p1 (cost=0.00..619.70 rows=19370 width=146) (actual time=0.006..0.007 rows=1 loops=1)
-> Materialize (cost=0.00..716.55 rows=19370 width=146) (actual time=0.003..0.228 rows=1001 loops=1)
-> Seq Scan on people p2 (cost=0.00..619.70 rows=19370 width=146) (actual time=0.001..0.089 rows=1001 loops=1)
Planning Time: 0.122 ms
Execution Time: 0.729 ms


Run the cell above to inspect the query plan for your command.

**Which join approach did the query optimizer choose?** 

A. Nested Loop Join<br/>
B. Merge Join<br/>
C. Hash Join<br/>
D. None of the Above

Assign the variable `q4c` to the correct letter choice above, e.g., `q4c = 'A'`.

In [529]:
q4c = 'A'

In [530]:
grader.check("q4c")

<!-- BEGIN QUESTION -->

<br><br>

---
## Question 4d

Given your findings above, why did the query optimizer ultimately choose the specific join approach you found in each of the above three scenarios in Questions 4a, 4b, and 4c? Feel free to discuss the pros and cons of each join approach as well.

If you feel stuck, here are some things to consider: Does a non-equijoin constrain us to certain join approaches? What's an added benefit in regards to the output of merge join?

**Note:** Your answer should be formatted as follows: `Q4a: A because ... Q4b: A because ...` You should write no more than 5 sentences.

_Q4a: C because it is efficient to perform hash join when we inner join tables that are not sorted as hashing provides arbitrary hash value to the distinct columns that we are joining. Q4b: B because we now have our tables sorted, so it is much faster to join using a merge join which compares each indexes left to right than a hash join which needs to read table and then match. Q4C: A because nested loop join is actually only join that could possibly work in this scenario where we are generating all possible pairs of players._

<!-- END QUESTION -->

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 5: Indexes, Part 1

In Questions 5, 6, and 7, you will analyze how indexes impact query performance.

Question 5:
* Question 5a: Write a query.
* Question 5b: Add an index with a particular index key and reanalyze the previous query's performance.
* Question 5c: Add an index with a different key and reanalyze the previous query's performance.

<br/>

---

## Question 5a
Write a query that outputs the `playerid` and average `salary` for each player that only batted in 10 games (the number of games in which a player batted can be found in the `g_batting` column of the `appearances` table). Your query should join the `salaries` and `appearances` table on all the common columns `yearid`, `teamid`, and `playerid`, so feel free to use a natural join.

In [531]:
%%sql --save query_5a result_5a <<
select s.playerid, AVG(s.salary)
from salaries as s
natural join appearances as a
where a.g_batting = 10
group by s.playerid

In [532]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_5a = %sqlcmd snippets query_5a
grading_util.save_results("result_5a", query_5a, result_5a);

display(result_5a.DataFrame().head(3))
%sql EXPLAIN ANALYZE {{query_5a}} 

Unnamed: 0,playerid,avg
0,anderla02,240000.0
1,ashbyan01,109000.0
2,ayraubo01,125000.0


QUERY PLAN
GroupAggregate (cost=3636.80..3636.83 rows=1 width=17) (actual time=15.902..15.957 rows=134 loops=1)
Group Key: s.playerid
-> Sort (cost=3636.80..3636.81 rows=1 width=17) (actual time=15.893..15.900 rows=138 loops=1)
Sort Key: s.playerid
Sort Method: quicksort Memory: 35kB
-> Hash Join (cost=2900.02..3636.79 rows=1 width=17) (actual time=9.590..15.786 rows=138 loops=1)
Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
-> Seq Scan on salaries s (cost=0.00..459.28 rows=26428 width=28) (actual time=0.008..1.525 rows=26428 loops=1)
-> Hash (cost=2873.20..2873.20 rows=1341 width=20) (actual time=9.430..9.431 rows=1347 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 86kB


Inspect the query plan above and record the execution time and cost.

In [533]:
result_5a_cost = 3636.43
result_5a_timing = 28.335

In [534]:
grader.check("q5a")

<br><br>

---
## Question 5b

Add an index with name `g_batting_idx` on the `g_batting` column of the `appearances` table.

In [535]:
%%sql
DROP INDEX IF EXISTS g_batting_idx;
create index g_batting_idx on appearances(g_batting)

Now, re-inspect the query plan of the query from `Question 5a` and record its execution time and cost.

In [536]:
# just run this cell
%sql EXPLAIN ANALYZE {{query_5a}} 

QUERY PLAN
GroupAggregate (cost=2377.01..2377.03 rows=1 width=17) (actual time=8.222..8.279 rows=134 loops=1)
Group Key: s.playerid
-> Sort (cost=2377.01..2377.01 rows=1 width=17) (actual time=8.212..8.221 rows=138 loops=1)
Sort Key: s.playerid
Sort Method: quicksort Memory: 35kB
-> Hash Join (cost=1640.23..2377.00 rows=1 width=17) (actual time=1.841..8.104 rows=138 loops=1)
Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
-> Seq Scan on salaries s (cost=0.00..459.28 rows=26428 width=28) (actual time=0.005..1.520 rows=26428 loops=1)
-> Hash (cost=1613.41..1613.41 rows=1341 width=20) (actual time=1.707..1.709 rows=1347 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 86kB


In [537]:
result_5b_cost = 2370.95
result_5b_timing = 8.301

In [538]:
grader.check("q5b")

In the following question, we will explore adding a different index and evaluating the query from `Question 4a`. To avoid any interference by the `g_batting_idx` index, **drop the index before moving onto the next question.**

In [539]:
# you must run this cell!!!
%sql DROP INDEX IF EXISTS g_batting_idx;

<br><br>

---
## Question 5c

Write a query to add an index with name `salary_idx` on the `salary` column of the `salaries` table. Make sure to drop the previous index in `Question 5b` first!

In [540]:
%%sql
DROP INDEX IF EXISTS g_batting_idx;
DROP INDEX IF EXISTS salary_idx;
create index salary_idx on salaries(salary)

Now, re-inspect the query plan of the query from `Question 5a` and record its execution time and cost.

In [541]:
# just run this cell
%sql EXPLAIN ANALYZE {{query_5a}} 

QUERY PLAN
GroupAggregate (cost=3636.80..3636.83 rows=1 width=17) (actual time=15.847..15.902 rows=134 loops=1)
Group Key: s.playerid
-> Sort (cost=3636.80..3636.81 rows=1 width=17) (actual time=15.839..15.847 rows=138 loops=1)
Sort Key: s.playerid
Sort Method: quicksort Memory: 35kB
-> Hash Join (cost=2900.02..3636.79 rows=1 width=17) (actual time=9.539..15.728 rows=138 loops=1)
Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
-> Seq Scan on salaries s (cost=0.00..459.28 rows=26428 width=28) (actual time=0.005..1.511 rows=26428 loops=1)
-> Hash (cost=2873.20..2873.20 rows=1341 width=20) (actual time=9.390..9.390 rows=1347 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 86kB


In [542]:
result_5c_cost = 3636.43
result_5c_timing = 27.773

In [543]:
grader.check("q5c")

<br><br>

---

## Question 5d

Given your findings from inspecting the query plans with no indexes (Question 5a), an index on `g_batting` (Question 5b), and an index on `salary` (Question 5c), assign the variable `q5d` to a list of all of the below statements that are true.

A. Adding the `g_batting` index did not have a significant impact on the query execution time and cost.<br/>
B. Adding the `g_batting` index did have a significant impact on the query execution time, but not the cost.<br/>
C. Adding the `g_batting` index did have a significant impact on the query cost, but not the execution time.<br/>
D. Adding the `g_batting` index did have a significant impact on the query cost and execution time.<br/>
E. Adding the `salary` index did not have a significant impact on the query execution time and cost.<br/>
F. Adding the `salary` index did have a significant impact on the query execution time, but not the cost.<br/>
G. Adding the `salary` index did have a significant impact on the query cost, but not the execution time.<br/>
H. Adding the `salary` index did have a significant impact on the query cost and execution time.

**Note:** Your answer should be formatted as a list of single-character strings, e.g., `q5d = ['A', 'B']`

In [544]:
q5d = ['D', 'E']

In [545]:
grader.check("q5d")

<!-- BEGIN QUESTION -->

### Question 5di Justification

Explain your answer to `Question 5d` above based on your knowledge from lectures, and details from inspecting the query plans (your explanation should include why you didn't choose certain options). Your answer should be no longer than 3 sentences.

_The choice 'D' is correct and ['A','B','C'] are wrong, because the index g_batting_idx helps reduce the amount of scanning with the where clause that prompts to search for values of g_batting, improving the execution time and cost. Next, the choice 'E' is correct and ['F','G','H'] are incorrect, because the index salary_idx doesn't help with scanning g_batting values so it does not have any significant impact on the query cost and execution time._

<!-- END QUESTION -->

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 6: Indexes, Part 2

Continue the analysis on how indexes impact query performance.

Question 6:
* Question 6a: Write a query that uses an **and** boolean operator. Record query performance.
* Question 6b: Write a query that uses an **or** boolean operator. Record query performances.
* Question 6c: Add an index and rerun queries in Questions 6a, 6b. Record query performance.
* Question 6d: Add a multi-attribute index and rerun queries 6a, 6b. Record query performance.
* Question 6e: Analyze query performance; compare and contrast.

Before continuing, make sure to drop all existing indexes from previous questions.

In [546]:
# you must run this cell!!!
%sql DROP INDEX IF EXISTS g_batting_idx;
%sql DROP INDEX IF EXISTS salary_idx;

<br><br>

---

## Question 6a

Write a query that finds the `playerid`, `yearid`, and `salary` for each player that had played 10 games **and** batted in 10 games (the number of games in which a player played can be found in the `g_all` column of the `appearances` table). Your query should join the `salaries` and `appearances` table on all the common columns `yearid`, `teamid`, and `playerid`, so feel free to use a natural join.

In [547]:
%%sql --save query_6a result_6a <<
select s.playerid, s.yearid, s.salary
from salaries as s
natural join appearances as a
where a.g_batting = 10 and a.g_all = 10

In [548]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_6a = %sqlcmd snippets query_6a
grading_util.save_results("result_6a", query_6a, result_6a);

result_6a.DataFrame().head(3)

Unnamed: 0,playerid,yearid,salary
0,wiggial01,1985,512500.0
1,anderla02,1986,240000.0
2,lakest01,1986,60000.0


In [549]:
grader.check("q6a")

Inspect the query plan and record the execution time and cost.

In [550]:
%sql EXPLAIN ANALYZE {{query_6a}} 

QUERY PLAN
Nested Loop (cost=0.29..3296.09 rows=1 width=21) (actual time=6.442..11.704 rows=120 loops=1)
-> Seq Scan on appearances a (cost=0.00..3133.84 rows=20 width=20) (actual time=0.015..9.255 rows=1289 loops=1)
Filter: ((g_batting = 10) AND (g_all = 10))
Rows Removed by Filter: 102967
-> Index Scan using salaries_pkey on salaries s (cost=0.29..8.11 rows=1 width=28) (actual time=0.002..0.002 rows=0 loops=1289)
Index Cond: ((yearid = a.yearid) AND ((teamid)::text = (a.teamid)::text) AND ((lgid)::text = (a.lgid)::text) AND ((playerid)::text = (a.playerid)::text))
Planning Time: 0.977 ms
Execution Time: 11.777 ms


In [551]:
result_6a_cost = 3287.78
result_6a_timing = 12.373

In [552]:
grader.check("6a_cost")

## Question 6b
Write a query that finds the `playerid`, `yearid`, and `salary` for each player that had played 10 games __or__ batted in 10 games. Your query should join the `salaries` and `appearances` table on all the common columns `yearid`, `teamid`, and `playerid`, so feel free to use a natural join.

In [553]:
%%sql --save query_6b result_6b <<
select s.playerid, s.yearid, s.salary
from salaries as s
natural join appearances as a
where a.g_batting = 10 or a.g_all = 10

In [554]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_6b = %sqlcmd snippets query_6b
grading_util.save_results("result_6b", query_6b, result_6b);
result_6b.DataFrame().head(3)

Unnamed: 0,playerid,yearid,salary
0,wiggial01,1985,512500.0
1,forscke01,1986,100000.0
2,carltst01,1986,60000.0


In [555]:
grader.check("q6b")

Inspect the query plan and record the execution time and cost.

In [556]:
%sql EXPLAIN ANALYZE {{query_6b}} 

QUERY PLAN
Hash Join (cost=3190.86..3927.63 rows=1 width=21) (actual time=10.552..16.669 rows=216 loops=1)
Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
-> Seq Scan on salaries s (cost=0.00..459.28 rows=26428 width=28) (actual time=0.022..1.553 rows=26428 loops=1)
-> Hash (cost=3133.84..3133.84 rows=2851 width=20) (actual time=10.385..10.387 rows=1655 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 118kB
-> Seq Scan on appearances a (cost=0.00..3133.84 rows=2851 width=20) (actual time=0.010..9.945 rows=1655 loops=1)
Filter: ((g_batting = 10) OR (g_all = 10))
Rows Removed by Filter: 102601
Planning Time: 0.679 ms
Execution Time: 16.704 ms


In [557]:
result_6b_cost = 3927.35
result_6b_timing = 16.310

In [558]:
grader.check("6b_cost")

## Question 6c
Now, let's see the impact of adding an index on the `g_batting` column. Create an index on the `g_batting` column. Re-inspect the queries from `Question 6a` and `Question 6b` and record the respective execution costs and times.

In [559]:
%%sql
DROP INDEX IF EXISTS g_batting_idx;
create index g_batting_idx on appearances(g_batting)

In [560]:
# record the updated costs for Question 6a ("and" query)
%sql EXPLAIN ANALYZE {{query_6a}} 

QUERY PLAN
Nested Loop (cost=18.64..1778.68 rows=1 width=21) (actual time=1.901..3.936 rows=120 loops=1)
-> Bitmap Heap Scan on appearances a (cost=18.35..1616.43 rows=20 width=20) (actual time=0.208..1.360 rows=1289 loops=1)
Recheck Cond: (g_batting = 10)
Filter: (g_all = 10)
Rows Removed by Filter: 58
Heap Blocks: exact=899
-> Bitmap Index Scan on g_batting_idx (cost=0.00..18.35 rows=1341 width=0) (actual time=0.109..0.109 rows=1347 loops=1)
Index Cond: (g_batting = 10)
-> Index Scan using salaries_pkey on salaries s (cost=0.29..8.11 rows=1 width=28) (actual time=0.002..0.002 rows=0 loops=1289)
Index Cond: ((yearid = a.yearid) AND ((teamid)::text = (a.teamid)::text) AND ((lgid)::text = (a.lgid)::text) AND ((playerid)::text = (a.playerid)::text))


In [561]:
result_6cand_index_cost = 1610.70
result_6cand_index_timing = 3.681

In [562]:
# record the updated costs for Question 6b ("or" query)
%sql EXPLAIN ANALYZE {{query_6b}} 

QUERY PLAN
Hash Join (cost=3190.86..3927.63 rows=1 width=21) (actual time=10.746..17.004 rows=216 loops=1)
Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
-> Seq Scan on salaries s (cost=0.00..459.28 rows=26428 width=28) (actual time=0.007..1.550 rows=26428 loops=1)
-> Hash (cost=3133.84..3133.84 rows=2851 width=20) (actual time=10.588..10.591 rows=1655 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 118kB
-> Seq Scan on appearances a (cost=0.00..3133.84 rows=2851 width=20) (actual time=0.008..10.142 rows=1655 loops=1)
Filter: ((g_batting = 10) OR (g_all = 10))
Rows Removed by Filter: 102601
Planning Time: 0.703 ms
Execution Time: 17.042 ms


In [563]:
result_6cor_index_cost = 3927.35
result_6cor_index_timing = 28.147

In [564]:
grader.check("q6c")

<br/><br/>

---

## Question 6d: Multiple-attribute index

Now, create a multiple column index on `g_batting` and `g_all` called `g_batting_g_all_idx` and record the query execution time and cost for the "or" command in `Question 6b`.

Before continuing, make sure to drop all existing indexes from previous questions.

In [565]:
# you must run this cell!!!
%sql DROP INDEX IF EXISTS g_batting_idx;
%sql DROP INDEX IF EXISTS salary_idx;

In [566]:
%%sql 
DROP INDEX IF EXISTS g_batting_all_idx;
create index g_batting_g_all_idx on appearances(g_batting,g_all)

In [567]:
# record the updated costs for Question 6b ("or" query)
%sql EXPLAIN ANALYZE {{query_6b}} 

QUERY PLAN
Hash Join (cost=2883.54..3620.31 rows=1 width=21) (actual time=2.754..8.896 rows=216 loops=1)
Hash Cond: ((s.yearid = a.yearid) AND ((s.teamid)::text = (a.teamid)::text) AND ((s.lgid)::text = (a.lgid)::text) AND ((s.playerid)::text = (a.playerid)::text))
-> Seq Scan on salaries s (cost=0.00..459.28 rows=26428 width=28) (actual time=0.005..1.535 rows=26428 loops=1)
-> Hash (cost=2826.52..2826.52 rows=2851 width=20) (actual time=2.620..2.624 rows=1655 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 118kB
-> Bitmap Heap Scan on appearances a (cost=1181.99..2826.52 rows=2851 width=20) (actual time=0.942..2.249 rows=1655 loops=1)
Recheck Cond: ((g_batting = 10) OR (g_all = 10))
Heap Blocks: exact=1027
-> BitmapOr (cost=1181.99..1181.99 rows=2871 width=0) (actual time=0.825..0.826 rows=0 loops=1)
-> Bitmap Index Scan on g_batting_g_all_idx (cost=0.00..18.35 rows=1341 width=0) (actual time=0.112..0.112 rows=1347 loops=1)


In [568]:
result_6d_multiple_col_index_cost = 3621.24
result_6d_multiple_col_index_timing = 9.738

In [569]:
grader.check("q6d")

<br/><br/>

---

## Question 6e
Given your findings from inspecting the query plans from all parts of this `Question 6`, assign the variable `q6e` to a list of all below statements that are true.

A. Adding an index on a column used in an AND predicate will reduce the query time, but not the execution cost.<br/>
B. Adding an index on a column used in an AND predicate will reduce the query cost, but not the execution time.<br/>
C. Adding an index on a column used in an AND predicate will reduce the query cost and the execution time.<br/>
D. Adding an index on a column used in an OR predicate will reduce the query time, but not the execution cost.<br/>
E. Adding an index on a column used in an OR predicate will reduce the query cost, but not the execution time.<br/>
F. Adding an index on a column used in an OR predicate will reduce the query cost and the execution time.<br/>
G. Adding a multicolumn index on columns in an OR predicate will reduce the query time, but not the execution cost.<br/>
H. Adding a multicolumn index on columns in an OR predicate will reduce the query cost, but not the execution time.<br/>
I. Adding a multicolumn index on columns in an OR predicate will reduce the query cost and the execution time.

**Note:** Your answer should be formatted as a list of single-character strings, e.g., `q6e = ['A', 'B']`


In [570]:
q6e = ['C','G']

In [571]:
grader.check("q6e")

<!-- BEGIN QUESTION -->

### Question 6ei Justification

Explain your answer to `Question 6e` above based on your knowledge from lectures, and details from inspecting the query plans (your explanation should include why you didn't choose certain options). Your answer should be no longer than 3 sentences.

_First. a single index on a column used in an AND predicate reduces both query cost and execution time as the index allows the query to explore data much faster, but not necessarily in the OR predicate because we have another condition to search through as well. Adding a multicolumn index does help with query time for OR predicate since it helps with exploring values through sorting, but query plan stays the same so we do no see a change in the execution cost. For these reasons, only choices C and G are correct._

<!-- END QUESTION -->

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 7: Indexes, Part 3

Continue the analysis on how indexes impact query performance. Now, use aggregators.

Question 7:
* Question 7a: Write two queries that use aggregators. Record query performance.
* Question 7b: Add an index and rerun queries in Questions 7a. Record query performance.
* Question 7c: Analyze query performance; compare and contrast.

Before continuing, make sure to drop all existing indexes from previous questions.

In [572]:
# you must run this cell!!!
%sql DROP INDEX IF EXISTS g_batting_idx;
%sql DROP INDEX IF EXISTS salary_idx;
%sql DROP INDEX IF EXISTS g_batting_all_idx;

---

## Question 7a

Write 2 queries: one that finds the minimum salary from the salary table `Salaries` and one that finds the average. Inspect the queries' query plans and record their execution times and costs.

Minimum salary:

In [573]:
%%sql --save query_7a_min result_7a_min << 
select min(salary) from salaries

In [574]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_7a_min = %sqlcmd snippets query_7a_min
grading_util.save_results("result_7a_min", query_7a_min, result_7a_min);

display(result_7a_min)
%sql EXPLAIN ANALYZE {{query_7a_min}} 

min
0.0


QUERY PLAN
Aggregate (cost=525.35..525.36 rows=1 width=8) (actual time=3.492..3.493 rows=1 loops=1)
-> Seq Scan on salaries (cost=0.00..459.28 rows=26428 width=8) (actual time=0.007..1.525 rows=26428 loops=1)
Planning Time: 0.067 ms
Execution Time: 3.511 ms


In [575]:
result_7a_min_query_cost = 525.36
result_7a_min_query_timing = 4.00

In [576]:
grader.check("q7a_min")

Average salary:

In [577]:
%%sql --save query_7a_avg result_7a_avg <<
select avg(salary) from salaries

In [578]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_7a_avg = %sqlcmd snippets query_7a_avg
grading_util.save_results("result_7a_avg", query_7a_avg, result_7a_avg);

display(result_7a_avg)
%sql EXPLAIN ANALYZE {{query_7a_avg}} 

avg
2085634.053125473


QUERY PLAN
Aggregate (cost=525.35..525.36 rows=1 width=8) (actual time=4.007..4.008 rows=1 loops=1)
-> Seq Scan on salaries (cost=0.00..459.28 rows=26428 width=8) (actual time=0.008..1.600 rows=26428 loops=1)
Planning Time: 0.051 ms
Execution Time: 4.030 ms


In [579]:
result_7a_avg_query_cost = 525.36
result_7a_avg_query_timing = 4.292

In [580]:
grader.check("q7a_avg")

<br><br>

---
## Question 7b
Create an index on the `salary` column in the `Salaries` table and re-inspect the query plans from the previous part and record the respective execution time and cost.

In [581]:
%%sql 
DROP INDEX IF EXISTS salary_idx;
create index salary_idx on salaries(salary)

In [582]:
# record the updated costs for "min" query
%sql EXPLAIN ANALYZE {{query_7a_min}} 

QUERY PLAN
Result (cost=0.32..0.33 rows=1 width=8) (actual time=0.061..0.062 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.29..0.32 rows=1 width=8) (actual time=0.058..0.058 rows=1 loops=1)
-> Index Only Scan using salary_idx on salaries (cost=0.29..762.78 rows=26428 width=8) (actual time=0.057..0.057 rows=1 loops=1)
Index Cond: (salary IS NOT NULL)
Heap Fetches: 0
Planning Time: 0.265 ms
Execution Time: 0.109 ms


In [583]:
result_7b_min_query_cost = 0.33
result_7b_min_query_timing = 0.07

In [584]:
# record the updated costs for "avg" query
%sql EXPLAIN ANALYZE {{query_7a_avg}} 

QUERY PLAN
Aggregate (cost=525.35..525.36 rows=1 width=8) (actual time=4.171..4.173 rows=1 loops=1)
-> Seq Scan on salaries (cost=0.00..459.28 rows=26428 width=8) (actual time=0.008..1.694 rows=26428 loops=1)
Planning Time: 0.075 ms
Execution Time: 4.196 ms


In [585]:
result_7b_avg_query_cost = 525.35
result_7b_avg_query_timing = 4.494

In [586]:
grader.check("q7b")

<!-- BEGIN QUESTION -->

<br><br>

---

## Question 7c
Given your findings from `Question 7`, which of the following statements is true?
<br> A. An index on the column being aggregated in a query will always provide a performance enhancement.
<br> B. A query finding the MIN(salary) will always benefit from an index on salary, but a query finding MAX(salary) will not.
<br> C. A query finding the COUNT(salary) will always benefit from an index on salary, but a query finding AVG(salary) will not.
<br> D. Queries finding the MIN(salary) or MAX(salary) will always benefit from an index on salary, but queries finding AVG(salary) or COUNT(salary) will not.

**Justify your answer.** Explain your answer based on your knowledge from lectures, and details of the query plans (your explanation should include why you didn't choose certain options). Your response should be no longer than 3 sentences.
 
*Note:* Your answer should be formatted as follows: "A because ... " 

_D because index are useful for min/max values because these could be found by sorting through index. On the other hand, we need to explore all of the rows for avg/count even with index, so creating index had no effect. Therefore, D is the correct answer._

<!-- END QUESTION -->

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 8: Clustered Indexes
In this question, we will inspect the impact that clustering our data on an index can have on a query's performance.
* Question 8a: query
* Question 8b: cluster index on primary key
* Question 8c: cluster index on different key
* Question 8d: observe and analyze.

Before continuing, make sure to drop all existing indexes from previous questions.

In [587]:
# you must run this cell!!!
%sql DROP INDEX IF EXISTS g_batting_idx;
%sql DROP INDEX IF EXISTS salary_idx;
%sql DROP INDEX IF EXISTS g_batting_all_idx;

---

## Question 8a

Write a query that finds the `playerid`, `yearid`, `teamid`, and `ab` for all players whose `ab` was above 500. Inspect the query plan and record the execution time and cost.

In [588]:
%%sql --save query_8a result_8a <<
select playerid, yearid,teamid, ab
from batting
where ab > 500

In [589]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_8a = %sqlcmd snippets query_8a
grading_util.save_results("result_8a", query_8a, result_8a);
result_8a.DataFrame().head(3)

Unnamed: 0,playerid,yearid,teamid,ab
0,dalryab01,1884,CHN,521
1,hornujo01,1884,BSN,518
2,ansonca01,1886,CHN,504


In [590]:
grader.check("q8a")

Inspect the query plan and record the execution time and cost.

In [591]:
%sql EXPLAIN ANALYZE {{query_8a}} 

QUERY PLAN
Seq Scan on batting (cost=0.00..2884.05 rows=8805 width=21) (actual time=0.222..9.813 rows=8839 loops=1)
Filter: (ab > 500)
Rows Removed by Filter: 95485
Planning Time: 0.060 ms
Execution Time: 10.094 ms


In [592]:
result_8a_cost = 2884.05
result_8a_timing = 11.513

In [593]:
grader.check("8a_cost")

<br><br>

---

## Question 8b

Cluster the `batting` table on its primary key (hint: use the psql meta-command `\di` to find out what name of the primary key is). We are able to directly cluster on the primary key (without first creating a separate index) because Postgres automatically creates an index for it.

Then, re-inspect the query plan for the query from `Question 8a` and record the execution time and cost.

In [594]:
%%sql
\di

Schema,Name,Type,Owner
public,allstarfull_pkey,index,jovyan
public,appearances_pkey,index,jovyan
public,awardsmanagers_pkey,index,jovyan
public,awardsplayers_pkey,index,jovyan
public,awardssharemanagers_pkey,index,jovyan
public,awardsshareplayers_pkey,index,jovyan
public,batting_pkey,index,jovyan
public,battingpost_pkey,index,jovyan
public,collegeplaying_pkey,index,jovyan
public,fielding_pkey,index,jovyan


In [595]:
%%sql
cluster batting using batting_pkey

In [596]:
# check the updated costs for query in Question 8a
%sql EXPLAIN ANALYZE {{query_8a}} 

QUERY PLAN
Seq Scan on batting (cost=0.00..2878.05 rows=8805 width=21) (actual time=0.010..11.407 rows=8839 loops=1)
Filter: (ab > 500)
Rows Removed by Filter: 95485
Planning Time: 0.143 ms
Execution Time: 11.675 ms


In [597]:
result_8b_cost = 2878.05
result_8b_timing = 11.541

In [598]:
grader.check("q8b")

<br><br>

---

## Question 8c

Now, let's try clustering the table based on another index. Create an index on `ab` column called `ab_idx` in the `batting` table AND cluster the `batting` table with this new index. Now, re-inspect the query plan and record the execution time and cost.

In [599]:
%%sql --save query_8c result_8c <<
create index ab_idx on batting(ab);
cluster batting using ab_idx;

In [600]:
# Do not delete/edit this cell!
# You must run this cell before running the autograder.
query_8c = %sqlcmd snippets query_8c
grading_util.save_results("result_8c", query_8c, result_8c);

# check the updated costs for query in Question 8a
%sql EXPLAIN ANALYZE {{query_8a}} 

QUERY PLAN
Bitmap Heap Scan on batting (cost=100.53..1785.59 rows=8805 width=21) (actual time=0.265..1.565 rows=8839 loops=1)
Recheck Cond: (ab > 500)
Heap Blocks: exact=135
-> Bitmap Index Scan on ab_idx (cost=0.00..98.33 rows=8805 width=0) (actual time=0.247..0.248 rows=8839 loops=1)
Index Cond: (ab > 500)
Planning Time: 0.166 ms
Execution Time: 1.828 ms


In [601]:
result_8c_cost = 1787.70
result_8c_timing = 1.779

In [602]:
grader.check("q8c")

<br><br>

---

## Question 8d
Given your findings from inspecting the query plans from Questions 8a, 8b, and 8c, assign the variable `q8d` to a list of all statements that are true.

A. Clustering based on the `ab_idx` decreased the cost of the query.<br/>
B. Clustering based on the `ab_idx` increased the cost of the query.<br/>
C. Clustering based on the `ab_idx` increased the execution time of the query.<br/>
D. Clustering based on the `ab_idx` decreased the execution time of the query.<br/>
E. Clustering based on the `batting_pkey` decreased the cost of the query.<br/>
F. Clustering based on the `batting_pkey` increased the cost of the query.<br/>
G. Clustering based on the `batting_pkey` increased the execution time of the query.<br/>
H. Clustering based on the `batting_pkey` decreased the execution time of the query.<br/>
I. None of the above
    
**Note:** Your answer should be formatted as a list of single-character strings, e.g., `q8d = ['A', 'B']`.


In [603]:
q8d = ['A', 'D']

In [604]:
grader.check("q8d")

<br><br>

---

### Question 8di Justification

Explain your answer to `Question 8d` above based on your knowledge from lectures, and details from inspecting the query plans (your explanation should include why you didn't choose certain options). Your answer should be no longer than 3 sentences.

_I chose A and D because clustering on a new index ab_idx where predicate was applied helped reduce the number of rows to scan, which resulted decease in cost and execution time. batting_pkey clustering had no effect as it was already clustered by default when table was created. Therefore, only choices A and D are correct._

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 9: Cost of Index Management
Until now, we have seen the positive potential impact that indexes can have on query performance, but remember in real world technologies/applications, we will be routinely receiving new data (and in large quantities) which would trigger regular updates to our tables. In this section, we will dive into the cost of managing the indexes that we create.

Before starting this question, be sure to delete any indexes by running the below cell.

In [605]:
# you must run this cell!!!
%sql DROP INDEX IF EXISTS g_batting_idx;
%sql DROP INDEX IF EXISTS salary_idx;
%sql DROP INDEX IF EXISTS g_batting_all_idx;
%sql DROP INDEX IF EXISTS ab_idx;

---

## Question 9a

Record the time it takes to insert 300,000 rows into the `salaries` table when no additional index is configured.

Run the following cell to setup a column to track which rows we added as part of these inserts.

In [606]:
%sql ALTER TABLE salaries ADD added boolean DEFAULT False;

Next, run the provided update script and record the **wall time**.

**NOTE:** Running the below cell multiple times may result in an error, unless you first delete the rows with the cell given at the end of this subpart.

In [607]:
%%time
%%sql
DO $$
 DECLARE counter INTEGER := 1;
 BEGIN
     FOR counter IN 100001..400000 LOOP
     INSERT INTO salaries (yearid, teamid, lgid, playerid, salary, added)
         VALUES (2021, 'ATL', 'NL', 'p' || counter, RANDOM() * 1000000, true);
     END LOOP;
END;
$$;

CPU times: user 13.3 ms, sys: 5.04 ms, total: 18.4 ms
Wall time: 2.88 s


In [608]:
result_9a_timing = 3.24

In [609]:
grader.check("q9a")

<br/><br/>

**Before moving onto the next question**,  delete all the rows that were added to the table from the update script.

In [610]:
%%sql
/* just run this cell */
DELETE FROM salaries
WHERE added = 'true';

<br><br>

---

## Question 9b

Now, create an index on the `salary` column and record the **wall time** after executing the update script. Make sure to first run the previous cell to rollback any changes from the previous part!

In [611]:
%%sql 
create index salary_ix on salaries(salary)

**NOTE:** Running the below cell multiple times may result in an error, unless you first delete the rows with the cell given at the end of last subpart.

In [612]:
%%time
%%sql
DO $$
 DECLARE counter INTEGER := 1;
 BEGIN
     FOR counter IN 100001..400000 LOOP
     INSERT INTO salaries (yearid, teamid, lgid, playerid, salary, added)
         VALUES (2021, 'ATL', 'NL', 'p' || counter, RANDOM() * 1000000, true);
     END LOOP;
END;
$$;

CPU times: user 18.4 ms, sys: 1.02 ms, total: 19.5 ms
Wall time: 6.21 s


In [613]:
result_9b_timing = 6.14

In [614]:
grader.check("q9b")

<!-- BEGIN QUESTION -->

<br><br>

---

## Question 9c:
What difference did you notice when you added an index into the salaries table and re-timed the update? Why do you think it happened? Your answer should be no longer than 3 sentences.

_After adding an index into the salaries table, it took longer to execute the update. Having an index creates an extra step in the process of inserting as we need to check the index, and therefore, we have a longer execution time._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br/><br/><br/>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Question 10: Project Takeaways

In this project, we explored how the database system optimizes query execution and how users can futher tune the performance of their queries.

Familiarizing yourself with these optimization and tuning methods will make you a better data engineer. In this question, we'll ask you to recall and summarize these concepts. Who knows? Maybe one day it will help you during an interview or on a project.

In the following answer cell,
1. Name 3 methods you learned in this project. The method can be either the optimization done by the database system, or the fine tuning done by the user.
2. For each method, summarize how and why it can optimize query performance. Feel free to discuss any drawbacks, if applicable.

Your answer should be no longer than ten sentences. Each method identification/discussion is 2 points.


_1. Using materialized views allow tables to be stored and can be used to improve execution time of query that uses that view. A drawback is that since we are storing information into the disk, they take more time to create than a normal view._

_2.Adding filters can lower cost and execution time. This optimizes query performance because predicate pushdowns cuts down the number of rows to load, and we are dealing with less amount of tuples. For this reason, using filters costs less and also takes lesser time to execute._ 

_3. Using indexes could reduce cost and execution time for queries if the index refers to a column that is used in a where clause. Having an index makes it easier to explore through the values and look them up, and reduces cost and time. A drawback is that creating more indexes might lead to longer time to insert new data into the table because having an index creates an extra step in insertions._

<!-- END QUESTION -->

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Congratulations! You have finished Project 2.

Run the following cell to zip and download the results of your queries. You will also need to run the export cell at the end of the notebook.

**Please save your notebook before exporting (this is a good time to do it!)** Otherwise, we may not be able to register your written responses.

**For submission on Gradescope, you will need to submit BOTH the `proj2.zip` file generated by the export cell and the `results.zip` file generated by the following cell.**

**Common submission issues:** You MUST submit the generated zip files (not folders) to the autograder. However, Safari is known to automatically unzip files upon downloading. You can fix this by going into Safari preferences, and deselect the box with the text "Open safe files after downloading" under the "General" tab. If you experience issues with downloading via clicking on the link, you can also navigate to the project 2 directory within JupyterHub (remove `proj2.ipynb` from the url), and manually download the generated zip files. Please post on Ed if you encounter any other submission issues.

In [615]:
grading_util.prepare_submission_and_cleanup()  # builds results.zip

In [616]:
# Close SQL magic connection
# You may disregard "RunTimeError: Could not close connection"
# %sql --close postgresql://127.0.0.1:5432/baseball

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [617]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True, files=['results.zip'])

Running your submission against local test cases...



Your submission received the following results when run against available test cases:

    q0 results: All test cases passed!

    q1a results: All test cases passed!

    q1bi results: All test cases passed!

    q1bii results: All test cases passed!

    q2a results: All test cases passed!

    q2bi results: All test cases passed!

    q2bii results: All test cases passed!

    q2biii results: All test cases passed!

    q2c results: All test cases passed!

    q2di results: All test cases passed!

    q3a results: All test cases passed!

    q3b results: All test cases passed!

    q4a results: All test cases passed!

    q4b results: All test cases passed!

    q4c results: All test cases passed!

    q5a results: All test cases passed!

    q5b results: All test cases passed!

    q5c results: All test cases passed!

    q5d results: All test cases passed!

    q6a results: All test cases passed!

    6a_cost results: All test