# SQL index

1. what is index?
2. how we use index?
3. when to use index or when not to?

a table scan is when the database has to scan the data row by row. The cost of a table scan is proportional to the number of rows.

index is used to avoid table scans that take enormous time.

a query planner shows how the database is going to run the query.

```sql
explain query plan
```
this is database-specific command so this works for sqlite but not for the other database.

### Drawbacks of Using Indexes

* `Storage Overhead:` Indexes require additional storage space, which can be substantial for large tables or multiple indexes.
* `Slower Write Operations:` INSERT, UPDATE, and DELETE operations can be slower because the database must update the index each time the data changes.
* `Maintenance Overhead:` Indexes need to be maintained and updated as data is added, modified, or deleted, which can increase database maintenance costs.


Index is a separate data structure that database uses to look up data quickly.

### creating index:
```sql
select
  *
from
  movies
where
  director = 'Guy Ritchie';
  
create index idx_director on movies (director); -- specify a column that you want to create index for.
drop index idx_director;
```

Index may not be always the best for some queries.

### B+ Trees

why does using index improve the speed of look-up?
B+ tree looks like binary search tree(BST) but it is balaced, has more than 2 nodes, and all data are stored in leaf nodes as data and primary key pair.
We are looking at leaf nodes. 
1. **Efficient Disk I/O:** B+ trees are optimized for systems where disk I/O is the bottleneck. They `minimize the number of disk accesses` required to locate data, which is crucial for databases that store large amounts of data on disk.

2. **Minimized Tree Height:** By `allowing multiple keys and children per node`, B+ trees reduce the height of the tree, ensuring that searches, insertions, and deletions can be performed quickly.

3. **Range Queries:** `B+ trees are particularly good for range queries` (e.g., finding all values between two keys). `Because all leaf nodes are linked`, it's easy to traverse from one leaf to the next without needing to go back up the tree.

4. **Balanced Structure:** B+ trees maintain `a balanced structure`, ensuring `consistent performance for all operations. `This balancing is done automatically during insertion and deletion operations, making B+ trees self-balancing.

indexes and keys

In a database, when you create an index, the `index is stored in a B+ tree format`. This structure allows the database to efficiently traverse from the root of the tree to the appropriate `leaf node where the indexed key is stored`. The leaf nodes of the B+ tree contain pointers to the actual data records in the table.

* **Data Retrieval:** Once the leaf node is found, it provides `a direct pointer`` to the location of the actual data record`. This means the data can be retrieved in a similarly efficient manner, without the need to perform a full table scan.

* In a **clustered index,** `the actual table data is stored in the leaf nodes of the B+ tree itself.` Therefore, searching for data using a clustered index involves traversing the B+ tree and directly reaching the data stored in its leaf nodes.(primary is index for the table)

* For **non-clustered indexes**, `the B+ tree stores pointers (references) to the actual rows in the table.` After finding the index entry using the B+ tree, a second step is required to access the actual table data using these pointers.

For MySQL, rowid is not default but when the data does not have primary key and unique data column. 

SQL creates indexes for unique columns if you don't make primary key

multi column indexes

1. **Order of Columns in Index Matters:**

  * The order in which columns are declared in a multi-column index significantly affects how the index is used by the SQL query optimizer.
  * The SQL engine can use the index for efficient lookups only if the query's WHERE clause conditions align with the leading (left-most) columns of the index.
1. **Index Utilization Stops After a Range Condition:**

 * When a range condition (like <, >, BETWEEN, or !=) is applied to one of the indexed columns, the use of the index can stop at that column.

```sql
explain query plan
select
  title
from
  movies
where
  release_date = 2022
  and rating = 7
  and revenue > 100; -- only revenue is using index.

create index idx on movies (revenue, release_date, rating );
```
To optimize the above query, consider changing the index order if the primary use case involves equality checks on release_date and rating:
```sql
-- Revised Multi-Column Index
CREATE INDEX idx_movies_optimized ON movies (release_date, rating, revenue);
```

With this revised index:

* **release_date = 2022 and rating = 7:** The index can efficiently use both conditions because they are equality conditions on the leading columns.
* **revenue > 100:** The range condition now applies to the last column in the index. The index can still be used to filter by revenue after release_date and rating have been used for equality filtering.