### Optimizing Postgres Databases: Vacuuming Postgres Databases
#### These are exercises done as part of <a href = "www.dataquest.io"> DataQuest</a>'s Data Engineer Path
This is not replicated for commercial use; strictly personal development.<br>
All exercises are (c) DataQuest, with slight modifications so they use my PostGres server on my localhost

> Shouldn't the speed be the same? Why would query speeds be affected by a few deletes? In this mission, we will learn > the process by which Postgres runs destructive commands, the reason why it can have a non-trivial effect on querying > speeds, and the internal tools to reclaim the lost speed.
>
>DataQuest

#### Vaccuuming Postgres Databases
<b>1.</b> Instructions:
- Use the provided `cur` object.
- Run the `DELETE FROM` command on `homeless_by_coc` to delete all the rows in the table.
- Reload the data by running `INSERT` or running a `COPY FROM` psycopg2 cursor query that loads data from the  `homeless_by_coc.csv` file into the `homeless_by_coc` table.
    - Commit your changes.
- Using `execute()`, count the number of rows from `homeless_by_coc`.
- Assign the `int` value return value to `homeless_rows`.

```python
conn = psycopg2.connect(dbname="dq", user="hud_admin", password="abc123")
cur = conn.cursor()

cur.execute("DELETE FROM homeless_by_coc")

filename = 'homeless_by_coc.csv'
with open(filename) as f:
    cur.copy_expert('COPY homeless_by_coc FROM STDIN WITH CSV HEADER', f)
conn.commit()

cur.execute("SELECT COUNT(*) FROM homeless_by_coc")
homeless_rows = cur.fetchone()[0]
```

`DELETE`<br>
Instead of removing the rows from the table, Postgres will mark the rows as dead, which means they will be eventually removed, once the commit has succeeded.<br>
- Dead rows helps keep consistency and isolation within a transaction<br>
- Dead rows increase table size and will lengthen query times.
- To check if a table has any hanging dead rows, we use an internal table from the `pg_catalog` called `pg_stat_all_tables` which contains a collection of helpful table statistics.<Br><Br>

Transactions are a way to ensure multiple users can concurrently run commands.<br>

All transactions follow a specific set of properties called ACID.

- Atomicity: If one thing fails in the transaction, the whole transaction fails.
- Consitency: A transaction will move the database from one valid state to another.
- Isolation: Concurrent effects to the database will be followed through as sequential changes.
- Durability: Once the transaction is commited, it will stay that way regardless of crash, power outage, etc.

<b>2.</b> Instructions
- Use the provided `cur` object.
- Before the `DELETE` command, find the number of dead rows for the `homeless_by_coc` table.
    - Print the result.
- After loading the table, find the number of dead rows for the `homeless_by_coc` tables.
- Assign the `int` return value to `homeless_dead_rows`.

```python
conn = psycopg2.connect(dbname="dq", user="hud_admin", password="abc123")
cur = conn.cursor()

cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname = 'homeless_by_coc'")
print(cur.fetchone()[0]) #prints 0

cur.execute("DELETE FROM homeless_by_coc")
with open('homeless_by_coc.csv') as f:
    cur.copy_expert('COPY homeless_by_coc FROM STDIN WITH CSV HEADER', f)
conn.commit()

cur.execute("SELECT COUNT(*) FROM homeless_by_coc")
homeless_dead_rows = cur.fetchone()[0] #prints 86529
```

<font color = 'blue'>Now to try on the Valenbisi Data. I've deleted and re-made these tables a lot, so I suspect there to be dead rows. However, in the latest version of Postgres, it routinely cleans up unused deadrows for you.</font>

In [3]:
import psycopg2
import pprint as pp

conn = psycopg2.connect(dbname="valenbisi2018", user="nmolivo")
cur = conn.cursor()

cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname = 'vbstatic'")
print(cur.fetchone()[0])

0


<b>3.</b> Instructions:
- Use the provided `cur` object.
- Note, we have already deleted the rows for you.
- Try running a vacuum on `homeless_by_coc`.

`VACUUM`
- If you run `VACUUM` without a table name, it will vacuum every user created table the current logged in user has access to
- Vacuuming a table will remove the marked dead rows
- You have to do this in SQL because the command cannot run in a Transaction Block.
- To run `VACUUM` outside a transaction block, we need to explicitly set the autocommit property of the psycopg2.Connection object. 
    - By setting autocommit to True, you are signalling to the `psycopg2` driver that you do not want your queries to run in a transaction block.

<b>4. </b>Instructions:
- Use the provided `cur` and `conn` objects.
- Disable transaction blocks on the connection object.
- Find the number of dead rows for the `homeless_by_coc` table.
    - Print the result.
- Run a vacuum on `homeless_by_coc`.
- After vacuuming the table, find the number of dead rows for the `homeless_by_coc` tables.
- Assign the int return value to `homeless_dead_rows`.

```python
conn = psycopg2.connect(dbname="dq", user="hud_admin", password="abc123")
conn.autocommit = True
cur = conn.cursor()
cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname='homeless_by_coc'")
print(cur.fetchall()[0])
cur.execute("VACUUM homeless_by_coc")
cur.execute("SELECT n_dead_tup FROM pg_stat_all_tables WHERE relname='homeless_by_coc'")
homeless_dead_rows = cur.fetchall()[0]
```

<b>5. </b>Instructions:
- Use the provided `cur` and `conn` objects.
    - Set the connection to execute outside transaction blocks.
- Run an `EXPLAIN` query for a select all query on `homeless_by_coc`.
    - Pretty print the results.
- Vacuum analyze `homeless_by_coc`.
- Rerun the explain query.
- Pretty print the results from the explain query.

```python
conn = psycopg2.connect(dbname="dq", user="hud_admin", password="abc123")
conn.autocommit = True
cur = conn.cursor()

cur.execute("EXPLAIN SELECT * FROM homeless_by_coc")
pp.pprint(cur.fetchall())

cur.execute("VACUUM ANALYZE homeless_by_coc")
cur.execute("EXPLAIN SELECT * FROM homeless_by_coc")
pp.pprint(cur.fetchall())
```

```
[Output]
[('Seq Scan on homeless_by_coc  (cost=0.00..2974.24 rows=41024 width=480)',)]
[('Seq Scan on homeless_by_coc  (cost=0.00..3429.29 rows=86529 width=88)',)]
```

<b>6.</b> Instructions:
- Use the provided `cur` and `conn` objects.
- Set the connection to execute outside a transaction block.
- Using `cur.execute()`, vacuum full all user created tables.

The most powerful and risky `VACUUM` option: `FULL`
- Reclaims space for the entire database server
- Claims an <b>exclusive</b> lock on the table it is vacuuming
    - This means that no insert, update, or delete queries can be issued against that table during the vacuum duration. 
    - Select queries on the table are considerably slowed down to the point where they are unusable.
- When we described a general `VACUUM`, we stated that it will remove dead rows from the table and reclaim their lost space. However, that disk space is never freed, it is still assigned to the table as extra space to be used when more data is inserted.
- `VACUUM FULL` will free the disk space for the whole server.

```python
conn = psycopg2.connect(dbname="dq", user="hud_admin", password="abc123")
conn.autocommit = True
cur = conn.cursor()
cur.execute("VACUUM FULL")
```

In [19]:
import psycopg2
conn = psycopg2.connect(dbname = "valenbisi2018", user = "nmolivo")
conn.autocommit = True
cur = conn.cursor()
cur.execute("VACUUM FULL")

> Postgres has a feature called <b>autovacuum</b> and it runs periodically on your tables to ensure that dead rows are removed, and your statistics are up to date.
>
> In the latest versions of Postgres, autovacuum is on by default, and requires no additional setup.
>
> When do we explicitly vacuum tables?
> 1. Are you running your normal analysis tasks without major table deletes and load? Then, leave vacuuming to the autovacuum.
>
>2. Have you recently deleted a significant amount of data in your tables, and you want to follow it up with complex analysis commands? Then, run a `VACUUM` or `VACUUM ANALYZE` to ensure optimized query commands.
>
>3. Are your tables growing out of control, and is there little free space left on the database server? Then, disable all queries and run a `VACUUM FULL` to reclaim a signficant amount of space.
>
>DataQuest