### These are exercises done as part of <a href = "www.dataquest.io"> DataQuest</a>'s Data Engineer Path
This is not replicated for commercial use; strictly personal development.<br>
All exercises are (c) DataQuest, with slight modifications so they use my PostGres server on my localhost

<font color = 'blue'>So we had some practice creating tables in the first mission, but this one provides more information about datatypes and what to think about when assigning them. In the last mission, I checked out the Postgres documentation to determine what datatypes I should use, so this will be a good double check and show of my work.<br><br><b>Over all, it's important to make sure you know your data and assign it the correct datatypes or else Data Corruption will occur.</b></font>

#### Creating Tables Mission
<b>1.  </b>Instructions:
- Using the provided `cur` object, execute the `SELECT` query from the example on the table `ign_reviews`.
- Call `print()` on the description property of `cur`.

In [5]:
import psycopg2
conn = psycopg2.connect("dbname=valenbisi2018 user=vbuser password=vbisi2018")
cur = conn.cursor()
cur.execute('SELECT * FROM vbstatic LIMIT 0')
print(cur.description)
conn.close()

(Column(name='id', type_code=20, display_size=None, internal_size=8, precision=None, scale=None, null_ok=None), Column(name='update', type_code=1043, display_size=None, internal_size=255, precision=None, scale=None, null_ok=None), Column(name='available', type_code=23, display_size=None, internal_size=4, precision=None, scale=None, null_ok=None), Column(name='free', type_code=23, display_size=None, internal_size=4, precision=None, scale=None, null_ok=None), Column(name='name', type_code=1043, display_size=None, internal_size=255, precision=None, scale=None, null_ok=None), Column(name='long', type_code=1700, display_size=None, internal_size=-1, precision=65535, scale=65535, null_ok=None), Column(name='lat', type_code=1700, display_size=None, internal_size=-1, precision=65535, scale=65535, null_ok=None), Column(name='total', type_code=23, display_size=None, internal_size=4, precision=None, scale=None, null_ok=None))


<font color = 'blue'>One error I forsee occuring as I add more data to our vbstatic data.. so that it one day becomes vbdynamic data, populating with data scraped using APIs and an AWS instance, is that the 'id' field will become too large.<br><br>According to the <a href = "https://www.postgresql.org/docs/9.1/static/datatype-numeric.html">Postgres Documentation</a>, data type `integer` can be a maximum of 4 bytes, or 32 bits.<br><br> How can we properly store this value? For now, we will use datatype `bigserial`, because we want it to auto-increment as more data is added. oh, and also `numeric` for our Longitude and Latitude values, since they get pretty precise.</font>

>### Numeric Types
>|Name|Storage Size|Description|Range|
>|------|------|------|------|
>|`smallint`|2 bytes|small-range integer|-32768 to +32767|
>|`integer`|4 bytes|typical choice for integer|-2147483648 to +2147483647|
>|`bigint`|8 bytes|large-range integer|-9223372036854775808 to 9223372036854775807|
>|`decimal`|variable|user-specified precision, exact|up to 131072 digits before the decimal point; up to 16383 digits after the decimal point|
>|`numeric`|variable|user-specified precision, exact|up to 131072 digits before the decimal point; up to 16383 digits after the decimal point|
>|`real`|4 bytes|variable-precision, inexact|6 decimal digits precision|
>|`double precision`|8 bytes|variable-precision, inexact|15 decimal digits precision|
>|`serial`|4 bytes|autoincrementing integer|1 to 2147483647|
>|`bigserial`|8 bytes|large autoincrementing integer|1 to 9223372036854775807|
>
><a href = "https://www.postgresql.org/docs/9.1/static/datatype-numeric.html">Postgres Documentation</a>

<b>2.  </b>Instructions:
- Use the provided `cur` object.
- Create a table `ign_reviews` that contains a single field using the correct type for this data.
- Set the `id` column as the `PRIMARY KEY`.
- Commit your changes using the `conn` object.

<font color = 'blue'>For this exercise, I will delete table `vbstatic2` using the command line: `valenbisi2018=# DROP TABLE vbstatic2;` and recreate it in this notebook.<br><br>Remember that for exercises where I create tables rather than just querie them, I switch to user `nmolivo` which is considered a Superuser.<br><Br> To see a description of all your users, type the following command into the CL: `\du`. Mine produces the following output in the CL:<br><Br>
`Role name |                         Attributes                         | Member of `<br>
`-----------+------------------------------------------------------------+-----------`<br>
` nmolivo   | Superuser, Create role, Create DB                          | {}`<br>
` postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}`<br>
` vbuser    |                                                            | {}`<br>
</font>

In [8]:
conn = psycopg2.connect("dbname=valenbisi2018 user=nmolivo")
cur = conn.cursor()
cur.execute('CREATE TABLE vbstatic2 (id BIGSERIAL PRIMARY KEY, \
            update VARCHAR(255), \
            available INT, \
            free INT, \
            name VARCHAR(255), \
            long NUMERIC, \
            lat NUMERIC, \
            total INT)');
conn.commit()
conn.close()

<b>3.  </b>Instructions:
- Import the `csv` module.
- Find the maximum character size of the `name` field using the `csv.reader` object.
- Assign the size to the variable `max_name_len`.

<font color = 'blue'>The actual DQ exercise has you checking for the largest score, `max_score`. I'm more concerned that the size of my `VARCHAR()` columns can be optimized, so I will be performing a similar exercise on `vbstatic2`'s `name` column.</font>

In [24]:
import csv
with open('vb_table.csv') as f:
    next(f)
    reader = csv.reader(f)
    unique_name_lens = [len(row[4]) for row in reader] #name column is row[4]

max_len = max(unique_name_lens)

In [25]:
#Wow, with max len 50, we can definitely cut down VARCHAR(255) to VARCHAR(55); technically adding a little bit extra, just in case.
max_len

50

In [19]:
import csv
with open('vb_table.csv') as f:
    next(f)
    reader = csv.reader(f)
    unique_name_lens = [len(row[5]) for row in reader] #long column is row[5]

max_len = max(unique_name_lens)

In [21]:
#ok, so our longitude and latitude can be limited to just 15 decimal digits of precision; datatype double precision 
max_len

12

<font color = 'blue'>Note that I stored my datetime variable `update` as a Character datatype. I suspect this is because I populated my table with a csv, which based on my trial and error, does not store the data in a datetime format that won't result in corrupted data when I try to use one of SQL's datetime datatypes. This will be an answer I'm seeking to find as I work my way through the DQ modules. In the meantime, let's make sure the `update` column's `VARCHAR` count is optimized:</font>

>### Character Types
>|Name|Description|
>|-----|-----|
>|`character varying(n), varchar(n)`|variable-length with limit|
>|`character(n), char(n)`|fixed-length, blank padded|
>|`text`|variable unlimited length|
>
> <a href = "https://www.postgresql.org/docs/9.1/static/datatype-character.html">Postgres Documentation</a>

>`CHAR(N)` pads any empty space of a character with whitespace " " characters while `VARCHAR(N)` does not.
>
> The only reason the `CHAR` datatype is implemented is to keep Postgres consistent with the SQL specification.
>
>In conclusion, when using Postgres, it's better to use the `TEXT` field for uncertain sizes and `VARCHAR(N)` for ones you know the maximum length.
>
> DataQuest

In [22]:
import csv
with open('vb_table.csv') as f:
    next(f)
    reader = csv.reader(f)
    unique_name_lens = [len(row[1]) for row in reader] #long column is row[5]

max_len = max(unique_name_lens)

In [23]:
#looks like update can be limited to len 24.
max_len

19

<b>4.  </b>Instructions:
- Use the provided `cur` object.
- Add columns with the proper datatype and length.
- Commit your changes using the `conn` object.
- Note: If you're having trouble running the `CREATE TABLE` command, you can drop the table with `DROP TABLE` before creating it.

<font color = 'blue'>Good point; I've been using the CL to drop table!</font>

In [27]:
conn = psycopg2.connect("dbname=valenbisi2018 user=nmolivo")
cur = conn.cursor()
# Add your field and type here.
cur.execute("DROP TABLE vbstatic2");
cur.execute('CREATE TABLE vbstatic2 (id BIGSERIAL PRIMARY KEY, \
            update VARCHAR(24), \
            available INT, \
            free INT, \
            name VARCHAR(55), \
            long DOUBLE PRECISION, \
            lat DOUBLE PRECISION, \
            total INT)');
conn.commit()
conn.close()

<b>5.  `BIGINT`, `VARCHAR()`, `TEXT` </b>Instructions:
- Use the provided `cur` object.
- Add the title, url, and platform, genre columns with the proper datatype and/or length if required.
- Commit your changes using the `conn` object.

<font color = 'blue'>The following exercises have to do with editing the other datatypes in its `reviews` data. I will provide answers in markdown cells, as we've accounted for all the different datatypes in `vbstatic`. <br><br>Note that while these are the answers that compile as correct in DQ, they are not in the same order as the columns in the dataframe being used. In practice, I found this creates errors.</font>

```python
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        genre TEXT
    )
""")
conn.commit()
```

<b>6.  `DECIMAL`</b>:
- Use the provided `cur` object.
- Add the the `score` column with the proper float-like datatype, precision, and scale.
- Commit your changes using the `conn` object.

```python
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
# Add your field and type here.
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        genre TEXT,
        score DECIMAL(3, 1)
    )
""")
# for datatype DECIMAL, 3 is the total number of digits in the number, 1 is the number of digits after the decimal
conn.commit()
```

<b>7.  `BOOLEAN`</b>:
- Use the provided `cur` object.
- Add the the `editors_choice` column with the proper datatype.
- Commit your changes using the `conn` object.

> Valid literal values for the "true" state are:
> 
> TRUE 't' 'true' 'y' 'yes' 'on' '1'
> 
> For the "false" state, the following values can be used:
> 
> FALSE 'f' 'false' 'n' 'no' 'off' '0'
>
> DataQuest

```python
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
# Add your field and type here.
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        genre TEXT,
        score DECIMAL(3, 1),
        editores_choice BOOLEAN
    )
""")
conn.commit()
```

>### Datetime Types
>|Name|Storage Size|Description|Low Value|High Value|Resolution|
>|-----|-----|-----|-----|-----|-----|
>|`timestamp [ (p) ] [ without time zone ]`|8 bytes|both date and time (no time zone)|4713 BC|294276 AD|1 microsecond / 14 digits|
>|`timestamp [ (p) ] with time zone`|8 bytes|both date and time, with time zone|4713 BC|294276 AD|1 microsecond / 14 digits|
>|`date`|4 bytes|date (no time of day)|4713 BC|5874897 AD|1 day|
>|`time [ (p) ] [ without time zone ]`|8 bytes|time of day (no date)|00:00:00|24:00:00|1 microsecond / 14 digits|
>|`time [ (p) ] with time zone`|12 bytes|times of day only, with time zone|00:00:00+1459|24:00:00-1459|1 microsecond / 14 digits|
>|`interval [ fields ] [ (p) ]`|16 bytes|time interval|-178000000 years|178000000 years|1 microsecond / 14 digits|
>
> <a href = "https://www.postgresql.org/docs/9.1/static/datatype-datetime.html">Postgres Documentation</a>

<b>8.  `DATE`</b>:
- Use the provided `cur` object.
- Create the last column, `release_date`, with the proper datetime type.
- Import the `csv` module and `date` module.
- Using the `csv` module, transform the `year`, `month`, and `day` values into a date object for each row.
- Insert the values into the created table using the `INSERT` statement from above.
- Commit your changes using the `conn` object.

```python
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()

import csv
from datetime import date

conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("""
    CREATE TABLE ign_reviews (
        id BIGINT PRIMARY KEY,
        score_phrase VARCHAR(11),
        title TEXT,
        url TEXT,
        platform VARCHAR(20),
        score DECIMAL(3, 1),
        genre TEXT,
        editors_choice BOOLEAN,
        release_date DATE
    )
""")

with open('ign.csv', 'r') as f:
    next(f)
    reader = csv.reader(f)
    for row in reader:
        updated_row = row[:8]
        updated_row.append(date(int(row[8]), int(row[9]), int(row[10])))
        cur.execute("INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)", updated_row)
conn.commit()
```

<font color = 'blue'>Ok I'll manipulate my csv data to pull month, date, and year. This will be important to incorporate into our eventual scraper/data collection.</font>

In [47]:
import numpy as np
import pandas as pd
from datetime import date, datetime

In [54]:
data = pd.read_csv('../valencia-data-projects/valenbisi/vb_data/data.csv')

In [57]:
data.drop('Unnamed: 0', axis =1, inplace = True)

In [59]:
data = data[['update','available','free','total','name','Long','Lat']]

In [60]:
data.rename(columns = {'Long':'long', 'Lat':'lat'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)


In [62]:
data['update']= [datetime.strptime(x , '%d/%m/%Y %H:%M:%S') for x in data['update']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [67]:
type(data['update'][2])

pandas._libs.tslib.Timestamp

In [42]:
data

Unnamed: 0,id,update,available,free,total,name,long,lat
0,0,20/02/2018 05:27:07,0,24,25,004_PLAZA_DE_LA_VIRGEN_CALLE_BAILIA,-0.375341,39.476747
1,1,20/02/2018 05:27:07,3,17,20,007_PZA_DEL_MERCADO_TAULA_DE_CANVIS,-0.379184,39.474872
2,2,20/02/2018 05:27:07,14,0,15,168_AVDA. MALVARROSA,-0.327885,39.476871
3,3,20/02/2018 05:27:07,0,16,16,229_CALLE_AITANA_ESQ_AVDA_BURJASSOT,-0.394805,39.493104
4,4,20/02/2018 05:27:07,2,18,20,248_AVDA_TRES_CRUCES_JOSE_MARIA_MORTES_LERMA,-0.404968,39.462840
5,5,20/02/2018 05:27:07,5,11,16,220_CALLE_CASTAN_TOBEÑAS_ESQ_CALLE_DE_GOYA,-0.398320,39.473855
6,6,20/02/2018 05:27:07,6,19,25,001_GUILLEN_DE_CASTRO,-0.382928,39.480042
7,7,20/02/2018 05:27:07,9,6,15,026_CALLE_SAN_JOSE_DE_CALASANZ,-0.386051,39.466195
8,8,20/02/2018 05:27:07,7,12,20,027_CALLE_SAN_VICENTE_MARTIR_129,-0.381850,39.463362
9,9,20/02/2018 05:27:07,0,20,20,031_CALLE SALAMANCA,-0.365027,39.467365


In [69]:
conn = psycopg2.connect("dbname=valenbisi2018 user=nmolivo")
cur = conn.cursor()
# Add your field and type here.
cur.execute("DROP TABLE vbstatic;")
cur.execute('CREATE TABLE vbstatic (id BIGSERIAL PRIMARY KEY, \
            update TIMESTAMP, \
            available INT, \
            free INT, \
            total INT, \
            name VARCHAR(55), \
            long DOUBLE PRECISION, \
            lat DOUBLE PRECISION)');
conn.commit()
conn.close()

In [98]:
conn = psycopg2.connect("dbname=valenbisi2018 user=nmolivo")
cur = conn.cursor()
cur.execute("DROP TABLE vbstatic;")
conn.close()

### <font color = 'blue'>Let's make an SQL database with a pandas dataframe Using SQL Alchemy</font>

`dialect+driver://username:password@host:port/database`<br>
<a href = 'http://docs.sqlalchemy.org/en/latest/core/engines.html'>SQL Alchemy Documentation: Engine Configuration</a><br>
<a href = 'http://docs.sqlalchemy.org/en/latest/core/type_basics.html#sql-standard-and-multiple-vendor-types'>SQL Alchemy Documentation: dtypes</a>

In [85]:
import sqlalchemy

In [89]:
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://nmolivo:MYPASSWORD@localhost/valenbisi2018')
data.to_sql('vbstatic', engine, dtype = {'id': sqlalchemy.types.BIGINT, \
                                         'update':sqlalchemy.types.TIMESTAMP(timezone=False), \
                                         'available':sqlalchemy.types.INT, \
                                         'free':sqlalchemy.types.INT, \
                                         'total':sqlalchemy.types.INT, \
                                         'name':sqlalchemy.types.CHAR(length=55), \
                                         'long': sqlalchemy.types.Float(precision=15), \
                                         'lat': sqlalchemy.types.Float(precision=15)})

In [96]:
with engine.connect() as conn:
    conn.execute('ALTER TABLE vbstatic ADD PRIMARY KEY (index);')

<font color = 'blue'>Yay! Using SQLAlchemy was way more efficient.</font>