# HW4. Indexes & Query Processing

## Objectives

In this assignment, you will review what you have learned in the Views and Indexes and Query Processing Modules. you will further practice: 
 - How indexing can change query processing
 - How indexing changes query performance
 - How B-Trees store records
 - Query processing and optimization

## Q1 (10 points): Indexing

In this question, you will be asked to select suitable indexes to speed up query performance and examine the query plan of a SQL query.

We are going to use a new database called flights.db. In the database, there is a single table, called FLIGHTS. The following shows its schema:

    FLIGHTS(fid, year, month_id, day_of_month, day_of_week_id, 
            carrier_id, flight_num, origin_city, origin_state, 
            dest_city, dest_state, departure_delay, taxi_out, 
            arrival_delay, canceled, actual_time, distance)

Note that this task only needs to use four attributes: `carrier_id`, `origin_city`, `actual_time`, and `dest_city`.

In [0]:
%load_ext sql

In [2]:
%sql sqlite:///flight.db

'Connected: @flight.db'

Consider the following queries:

```sqlite
(a)  SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 180;
```


```sqlite
(b)  SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Gunnison CO' AND actual_time <= 180;
```


```sqlite
(c)  SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 30;
```

Choose one single simple index (index on one attribute) that is most likely to speed up all three queries. Write down the `CREATE INDEX` statement and explain why you chose that index below.

Q1.1. (1 point) What is the CREATE INDEX statement?

In [3]:
%%sql
CREATE INDEX idx1 on Flights(actual_time)

 * sqlite:///flight.db
Done.


[]

Q1.2. (1 point) Why did you choose the index? 

The reason why I choose actual_time is because that there is already a clustered index for the selection query -- but the actual_time query is less efficient due to the comparison.



Open a command line shell and start the sqlite program. Connect to the provided flights.db, and check whether the FLIGHTS table has the index that you indicate above. If not, add this index to the FLIGHTS table. 

Q1.3. (0.5 point) Does the FLIGHTS table has the index that you indicate above?

In [4]:
%%sql
SELECT * FROM Flights INDEXED BY idx1 WHERE actual_time <= 15


 * sqlite:///flight.db
Done.


fid,year,month_id,day_of_month,day_of_week_id,carrier_id,flight_num,origin_city,origin_state,dest_city,dest_state,departure_delay,taxi_out,arrival_delay,canceled,actual_time,distance
713473,2015,7,22,3,AS,65,Wrangell AK,Alaska,Petersburg AK,Alaska,-13.0,3.0,-19.0,0,14.0,31.0
395267,2005,7,14,4,OO,6107,Los Angeles CA,California,Oxnard/Ventura CA,California,0.0,2.0,-15.0,0,15.0,49.0


Yes, the FLIGHTS table has the index idx1.

Q1.4. (1.5 point) Please check whether each query used the index or not. 

**Hint:** you can use `EXPLAIN QUERY PLAN` to see the query plan of each query. Indicate for each query if it used the index or not. 

In [5]:
%%sql
EXPLAIN QUERY PLAN SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 180

 * sqlite:///flight.db
Done.


selectid,order,from,detail
0,0,0,SEARCH TABLE Flights USING INDEX idx1 (actual_time<?)
0,0,0,USE TEMP B-TREE FOR DISTINCT


The index idx1 was used.

In [6]:
%%sql
EXPLAIN QUERY PLAN SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Gunnison CO' AND actual_time <= 180

 * sqlite:///flight.db
Done.


selectid,order,from,detail
0,0,0,SEARCH TABLE Flights USING INDEX idx1 (actual_time<?)
0,0,0,USE TEMP B-TREE FOR DISTINCT


The index idx1 was used.

In [7]:
%%sql
EXPLAIN QUERY PLAN SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 30

 * sqlite:///flight.db
Done.


selectid,order,from,detail
0,0,0,SEARCH TABLE Flights USING INDEX idx1 (actual_time<?)
0,0,0,USE TEMP B-TREE FOR DISTINCT


The index idx1 was used.

Now, consider this query:

```sqlite
(d) SELECT DISTINCT F2.origin_city
     FROM Flights F1, Flights F2
     WHERE F1.dest_city = F2.dest_city
         AND F1.origin_city='Gunnison CO'
         AND F1.actual_time <= 30;
```

Q1.5. (2 points) Choose one simple index (index on one attribute), different from the index for the question above, that is likely to speed up this query. Write down the `CREATE INDEX` statement.

In [8]:
%%sql
CREATE INDEX idx2 on Flights(origin_city)

 * sqlite:///flight.db
Done.


[]

Check whether the FLIGHTS table has this second index that you indicate above. If not, add this index to the FLIGHTS table. 

In [9]:
%%sql
SELECT * FROM Flights INDEXED BY idx2 WHERE origin_city = "Seattle WA" AND actual_time <= 35

 * sqlite:///flight.db
Done.


fid,year,month_id,day_of_month,day_of_week_id,carrier_id,flight_num,origin_city,origin_state,dest_city,dest_state,departure_delay,taxi_out,arrival_delay,canceled,actual_time,distance
555195,2005,7,6,3,WN,1216,Seattle WA,Washington,Spokane WA,Washington,10.0,5.0,-10.0,0,35.0,224.0
610178,2005,7,25,1,WN,1678,Seattle WA,Washington,Spokane WA,Washington,0.0,12.0,-22.0,0,33.0,224.0


Now we want to know how effective the two indexes are. We compare the runtimes of the queries with and without indexes. 

**Hint:** Use `timer on` on sqlite3 command line to turn SQL timer on.

Q1.6. (2 points) Execute queries (a) to (d) on the FLIGHTS table that do not have the two indexes. Please record the runtime of each query. 

In [10]:
%%sql
DROP INDEX idx1

 * sqlite:///flight.db
Done.


[]

In [11]:
%%sql
DROP INDEX idx2 

 * sqlite:///flight.db
Done.


[]

In [12]:
%%time
%%sql
SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 180

 * sqlite:///flight.db
Done.
CPU times: user 102 ms, sys: 21.9 ms, total: 124 ms
Wall time: 124 ms


carrier_id
AS
DL
EV
F9
HP
NW
OO
UA
WN
AA


In [13]:
%%time
%%sql
SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Gunnison CO' AND actual_time <= 180

 * sqlite:///flight.db
Done.
CPU times: user 113 ms, sys: 25.7 ms, total: 138 ms
Wall time: 139 ms


carrier_id
OO


In [14]:
%%time
%%sql
SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 30

 * sqlite:///flight.db
Done.
CPU times: user 130 ms, sys: 27 ms, total: 157 ms
Wall time: 156 ms


carrier_id


In [15]:
%%time
%%sql
SELECT DISTINCT F2.origin_city
     FROM Flights F1, Flights F2
     WHERE F1.dest_city = F2.dest_city
         AND F1.origin_city='Gunnison CO'
         AND F1.actual_time <= 30

 * sqlite:///flight.db
Done.
CPU times: user 96.4 ms, sys: 35.2 ms, total: 132 ms
Wall time: 131 ms


origin_city


Q1.7. (2 points) Execute queries (a) to (d) on the FLIGHTS table that has the two indexes. Please record the runtime of each query. 

In [16]:
%%sql
CREATE INDEX idx1 on Flights(actual_time)


 * sqlite:///flight.db
Done.


[]

In [17]:
%%sql
CREATE INDEX idx2 on Flights(origin_city)

 * sqlite:///flight.db
Done.


[]

In [18]:
%%time
%%sql
SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 180

 * sqlite:///flight.db
Done.
CPU times: user 12.9 ms, sys: 7.02 ms, total: 20 ms
Wall time: 20.7 ms


carrier_id
AS
DL
EV
F9
HP
NW
OO
UA
WN
AA


In [19]:
%%time
%%sql
SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Gunnison CO' AND actual_time <= 180

 * sqlite:///flight.db
Done.
CPU times: user 1.93 ms, sys: 0 ns, total: 1.93 ms
Wall time: 2.11 ms


carrier_id
OO


In [20]:
%%time
%%sql
SELECT DISTINCT carrier_id
     FROM Flights
     WHERE origin_city = 'Seattle WA' AND actual_time <= 30

 * sqlite:///flight.db
Done.
CPU times: user 11.7 ms, sys: 10.2 ms, total: 21.9 ms
Wall time: 23.9 ms


carrier_id


In [21]:
%%time
%%sql
SELECT DISTINCT F2.origin_city
     FROM Flights F1, Flights F2
     WHERE F1.dest_city = F2.dest_city
         AND F1.origin_city='Gunnison CO'
         AND F1.actual_time <= 30

 * sqlite:///flight.db
Done.
CPU times: user 11.5 s, sys: 337 ms, total: 11.8 s
Wall time: 11.8 s


origin_city


 -- a)

 -- Before Index: 124ms

 -- After Index: 20.7ms

\\
 -- b)

 -- Before Index: 139ms

 -- After Index: 2.11ms

\\
 -- c)

 -- Before Index: 156ms

 -- After Index: 23.9ms

\\
 -- d)

 -- Before Index: 131ms

 -- After Index: 11.8s


## Q2 (6 points): B-Trees

Assume:

    (1) blocks can hold either 10 records or 99 keys and 100 pointers
    (2) the average B-tree node is 70% full. This means it will have 69 keys and 70 pointers. 

We can use B-trees as part of several different structures. For each structure described in the questions Q2.1 to Q2.3 below, determine: 

    (a) the total number of blocks needed for a 1,000,000-record file
    (b) the average number of disk I/O’s to retrieve a record given its search key

You may assume nothing is in memory initially, and the search key is the primary key for the records.

**Q2.1. (2 points) The data file is a sequential file, sorted on the search key, with 10 records per block. The B-tree is a dense index.**

a) 114,494 blocks = 100,000 (= 1,000,000/10, for data; @ lowest level) + 14,286 (= 1,000,000/70 @ leaf level) + 204 (= 14,286/70 @ pre-leaf level) + 3 (= 204/70 @ second level) + 1 (the root)

b) Since there are four levels to scan B-tree, we must make a total of five disk reads to go through the B-tree to the desired data block. 

**Q2.2. (2 points) The data file is a sequential file, sorted on the search key, with 10 records per block. The B-tree is a sparse index.**

a) 101,450 blocks = 100,000 (= 1,000,000/10, for data; @ lowest level) + 1,429 (= 100,000/70 @ leaf level) + 20 (= 1,429/70 @ second level) + 1 (the root)

\\

b) 4

**Q2.3. (2 points) The data file consists of records in no particular order, packed 10 to a block. The B-tree is a dense index.**

a) 144,929 blocks = 142,857 (= 1,000,000/7, for data; @ leaf level) + 2,041 (= 142,857/70, @ pre-leaf level) + 30 (= 2,041/70 @ second level) + 1 (the root)

\\
b) 4

## Q3 (9 points): Query Processing

> Indented block



In the first assignment, given the bank database below:

 - Customer = {<span style="text-decoration:underline">customerID</span>, firstName, lastName, income, birthDate}
 - Account = {<span style="text-decoration:underline">accNumber</span>, type, balance, branchNumber<sup>FK-Branch</sup>}
 - Owns = {<span style="text-decoration:underline">customerID</span><sup>FK-Customer</sup>, <span style="text-decoration:underline">accNumber</span><sup>FK-Account</sup>}
 - Transactions = {<span style="text-decoration:underline">transNumber</span>, <span style="text-decoration:underline">accNumber</span><sup>FK-Account</sup>, amount}
 - Employee = {<span style="text-decoration:underline">sin</span>, firstName, lastName, salary, branchNumber<sup>FK-Branch</sup>}
 - Branch = {<span style="text-decoration:underline">branchNumber</span>, branchName, managerSIN<sup>FK-Employee</sup>, budget}

you wrote a SQL query to:

Show account number, account type, account balance, and transaction amount of the accounts with balance higher than 100,000 and transaction amounts higher than 15000, starting with the accounts with the highest transaction amount and highest account balance. 

Q3.1. (3 points) Parse your query into a query parse tree.

<img src="https://drive.google.com/uc?export=view&id=1te8rGZKPJtk-V7zVn_xG0iTixKt_HxkD" alt="ParseTree.png" style="width: 800px;"/>


Q3.2. (3 points) Convert your parse tree to the equivalent relational algebraic representation (rewrite if necessary).

<img src="https://drive.google.com/uc?export=view&id=1RlXkanDqA0Fi5q6zbPPcSZR1-VTd6dxF" alt="QP.png" style="width: 800px;"/> 

Q3.3. (3 points) Assume you have a million records in each of the six tables above. If you need, make necessary assumptions about your storage blocks, as well as about charactristics in the bank.db. Can you enumerate the size and cost of the intermediate tables in your query plan?




$\textbf{Accounts with more than 100,000 balance }$ <br> 
This is a range query. Size depends on how accounts' balance is distributed.  

Assume 10% account's have balance over 100,000.
  
T(B) = 1M * 10% = 0.1M

V(B, accNumber) = 0.1M (Since accNumber is the primary key of Account)

\\

$\textbf{Transactions amount that is more than 15,000}$ <br> 
This is a range query. Size depends on how transactions' amount is distributed. 

Assume 1% transaction's amount over 15,000.
  
T(A) = 1M * 1% = 0.01M

Assume each transaction that is greater than 15,000 and accNumber combination is unique, then V(A, accNumber) =0.01M

\\

$\textbf{Join Account and Transactions}$ <br> 
T(U) = T(B  ⋈  A) = T(B)T(A)/ max (V(B, accNumber), V(A, accNumber))
  
max (V(B, accNumber), V(A, accNumber)) = 0.1M

Therefore, T(U) = 0.1*0.01/ 0.1 = 0.01M

V(U, accNumber) = 0.01M

Final projection will retrieve 4 columns from U



## Submission

Complete the answers to the questions in the [hw4.ipynb](hw4.ipynb) notebook and zip the notebook with additional files that you may have used in a file named HW4.zip, and submit it through Canvas system to your Homework (4) activity.