### Postgres for Data Engineers Project: Storing Tropical Storm Data
#### These are exercises done as part of <a href = "www.dataquest.io"> DataQuest</a>'s Data Engineer Path
This is not replicated for commercial use; strictly personal development.<br>
All exercises are (c) DataQuest, with slight modifications so they use my PostGres server on my localhost<br><br>
Download Tropical Storm data from the <a href = "https://data.world/dhs/historical-tropical-storm">data.world</a>

<font color = 'blue'>Ok before we create the table `stormdata` in our database `dq_exercises`. I want to examine our data so we can use the correct datatypes.</font>

In [1]:
import psycopg2
import pandas as pd

In [2]:
data = pd.read_csv('https://dq-content.s3.amazonaws.com/251/storm_data.csv')

In [3]:
data.head()

Unnamed: 0,FID,YEAR,MONTH,DAY,AD_TIME,BTID,NAME,LAT,LONG,WIND_KTS,PRESSURE,CAT,BASIN,Shape_Leng
0,2001,1957,8,8,1800Z,63,NOTNAMED,22.5,-140.0,50,0,TS,Eastern Pacific,1.140175
1,2002,1961,10,3,1200Z,116,PAULINE,22.1,-140.2,45,0,TS,Eastern Pacific,1.16619
2,2003,1962,8,29,0600Z,124,C,18.0,-140.0,45,0,TS,Eastern Pacific,2.10238
3,2004,1967,7,14,0600Z,168,DENISE,16.6,-139.5,45,0,TS,Eastern Pacific,2.12132
4,2005,1972,8,16,1200Z,251,DIANA,18.5,-139.8,70,0,H1,Eastern Pacific,1.702939


In [4]:
list(data.columns)

['FID',
 'YEAR',
 'MONTH',
 'DAY',
 'AD_TIME',
 'BTID',
 'NAME',
 'LAT',
 'LONG',
 'WIND_KTS',
 'PRESSURE',
 'CAT',
 'BASIN',
 'Shape_Leng']

In [5]:
len(data)

59228

In [6]:
data.head()

Unnamed: 0,FID,YEAR,MONTH,DAY,AD_TIME,BTID,NAME,LAT,LONG,WIND_KTS,PRESSURE,CAT,BASIN,Shape_Leng
0,2001,1957,8,8,1800Z,63,NOTNAMED,22.5,-140.0,50,0,TS,Eastern Pacific,1.140175
1,2002,1961,10,3,1200Z,116,PAULINE,22.1,-140.2,45,0,TS,Eastern Pacific,1.16619
2,2003,1962,8,29,0600Z,124,C,18.0,-140.0,45,0,TS,Eastern Pacific,2.10238
3,2004,1967,7,14,0600Z,168,DENISE,16.6,-139.5,45,0,TS,Eastern Pacific,2.12132
4,2005,1972,8,16,1200Z,251,DIANA,18.5,-139.8,70,0,H1,Eastern Pacific,1.702939


In [None]:
data.columns = [['fid', 'year', 'month', 'day', 'ad_time']]

<font color = 'blue'>The pandas.DataFrame function <a href = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html">`info()`</a> will tell us what datatype each column is. Note that it refers to strings as objects.</font>

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59228 entries, 0 to 59227
Data columns (total 14 columns):
FID           59228 non-null int64
YEAR          59228 non-null int64
MONTH         59228 non-null int64
DAY           59228 non-null int64
AD_TIME       59228 non-null object
BTID          59228 non-null int64
NAME          59228 non-null object
LAT           59228 non-null float64
LONG          59228 non-null float64
WIND_KTS      59228 non-null int64
PRESSURE      59228 non-null int64
CAT           59228 non-null object
BASIN         59228 non-null object
Shape_Leng    59228 non-null float64
dtypes: float64(3), int64(7), object(4)
memory usage: 6.3+ MB


<font color = 'blue'>When determining which datatype to make our data, it can be helpful to see the `max_len` of each data column, so I write a for-loop to both see `max_len` and store it in a dictionary, if we need it later for some reason:</font>

In [8]:
col_len_dict = {}
for col in list(data.columns):
    max_len = 0
    for i in range(1,len(data)):
        if max_len < len(str(data[col][i])):
            max_len = len(str(data[col][i]))
    print(col + " " + str(max_len))
    col_len_dict[col] = max_len

FID 5
YEAR 4
MONTH 2
DAY 2
AD_TIME 5
BTID 4
NAME 9
LAT 4
LONG 7
WIND_KTS 3
PRESSURE 4
CAT 2
BASIN 15
Shape_Leng 9


<font color = 'blue'>Upon first glance, we'll want to find out the precision and scope of decimals and length of strings required to store this data. So we'll do the following steps:<br>
- Remove the 'Z' character from `AD_TIME`, convert to timestamp <br>
- See the max and min numbers for <br>
&nbsp;&nbsp;&nbsp;&nbsp;- `LONG`<br>
&nbsp;&nbsp;&nbsp;&nbsp;- `LAT`<Br>
&nbsp;&nbsp;&nbsp;&nbsp;- `Shape_Leng`</font>

In [9]:
from datetime import datetime

In [10]:
data['AD_TIME'] = [datetime.strptime(a[0:2]+":"+a[2:4], "%H:%M").time() for a in data['AD_TIME']]

In [19]:
print("LONG min = "+str(min(data['LONG'])))
print("LONG max = "+str(max(data['LONG'])))
print()
print("LAT min = "+str(min(data['LAT'])))
print("LAT max = "+str(max(data['LAT'])))
print()
print("Shape_Leng min = "+str(min(data['Shape_Leng'])))
print("Shape_Leng max = "+str(max(data['Shape_Leng'])))
print()
print("WIND_KTS min = "+str(min(data['WIND_KTS'])))
print("WIND_KTS max = "+str(max(data['WIND_KTS'])))
print()
print("PRESSURE min = "+str(min(data['PRESSURE'])))
print("PRESSURE max = "+str(max(data['PRESSURE'])))

LONG min = -180.0
LONG max = 180.0

LAT min = 4.2
LAT max = 69.0

Shape_Leng min = 0.0
Shape_Leng max = 11.18034

WIND_KTS min = 10
WIND_KTS max = 165

PRESSURE min = 0
PRESSURE max = 1024


In [12]:
import sqlalchemy

<font color = 'blue'>Remember if you run into errors with the following code, you can always remove a created table with the following code:</font>
```python
conn = psycopg2.connect("dbname=dq_exercises user=nmolivo")
cur = conn.cursor()
cur.execute("DROP TABLE stormdata")
conn.commit()
```

### <font color = 'blue'> CREATE TABLE

<font color = 'blue'>First I'm going to make all our column names lowercase. This helps our code compile easier, as we interface to SQL using Python</font>

In [60]:
data.columns = [['fid', 'year', 'month', 'day', 'ad_time', 'btid', 'name', 'lat', 'long', 'wind_kts', 'pressure', 'cat', 'basin', 'shape_len']]

In [62]:
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://nmolivo:MYPASSWORD@localhost/dq_exercises')
data.to_sql('stormdata', engine, dtype = {'fid': sqlalchemy.types.INT, \
                                         'year':sqlalchemy.types.INT, \
                                         'month': sqlalchemy.types.INT, \
                                         'day': sqlalchemy.types.INT, \
                                         'ad_time': sqlalchemy.types.TIME(timezone=False), \
                                         'btid': sqlalchemy.types.CHAR(length=4), \
                                         'name': sqlalchemy.types.CHAR(length=9), \
                                         'lat': sqlalchemy.types.NUMERIC(precision=4, scale=2, asdecimal=True), \
                                         'long': sqlalchemy.types.NUMERIC(precision=7, scale=4, asdecimal=True), \
                                         'wind_kts': sqlalchemy.types.INT, \
                                         'pressure': sqlalchemy.types.INT, \
                                         'cat': sqlalchemy.types.CHAR(length=2), \
                                         'basin': sqlalchemy.types.CHAR(length=15), \
                                         'shape_len': sqlalchemy.types.NUMERIC(precision=9, scale=7, asdecimal=True)})

In [63]:
with engine.connect() as conn:
    conn.execute('ALTER TABLE stormdata ADD PRIMARY KEY ("fid");')

### <font color = 'blue'>CREATE USERS</font>

In [4]:
conn = psycopg2.connect("dbname=dq_exercises user=nmolivo")
cur = conn.cursor()
cur.execute("""
CREATE USER stormadmin WITH CREATEDB PASSWORD 'admin123';
CREATE GROUP stormusers NOLOGIN;
REVOKE ALL ON stormdata FROM stormusers;
GRANT SELECT ON stormdata TO stormusers;
""")
conn.commit()

### <font color = 'blue'>INSERT NEW DATA</font>

In [66]:
data2 = pd.read_csv('https://dq-content.s3.amazonaws.com/251/storm_data_additional.csv')

In [76]:
data2.head()

Unnamed: 0,fid,date,btid,name,lat,long,wind_kts,pressure,cat,basin,shape_len
0,97fc91afc6acbb8df4563a90b8b1c4fa,1851-06-25 00:00:00,1,NOTNAMED,28.0,-94.8,80,0,H1,North Atlantic,0.6
1,174b0313d3601872c6fd2c65150eef1c,1851-06-25 06:00:00,1,NOTNAMED,28.0,-95.4,80,0,H1,North Atlantic,0.6
2,74f4a8b3a417f9509ce5f285f5666a99,1851-06-25 12:00:00,1,NOTNAMED,28.0,-96.0,80,0,H1,North Atlantic,0.509902
3,a17379f02b1d82946a49cd865931e8ad,1851-06-25 18:00:00,1,NOTNAMED,28.1,-96.5,80,0,H1,North Atlantic,0.509902
4,013c8fb523008502e67c70c9ddc17dc5,1851-06-26 00:00:00,1,NOTNAMED,28.2,-97.0,70,0,H1,North Atlantic,0.608276


In [74]:
data2.columns = [['fid', 'date', 'btid', 'name', 'lat', 'long', 'wind_kts', 'pressure', 'cat', 'basin', 'shape_len']]

In [68]:
col_len_dict2 = {}
for col in list(data2.columns):
    max_len = 0
    for i in range(0,len(data2)):
        if max_len < len(str(data2[col][i])):
            max_len = len(str(data2[col][i]))
    print(col + " " + str(max_len))
    col_len_dict2[col] = max_len

FID 32
DATETIME 19
BTID 4
NAME 9
LAT 4
LONG 7
WIND_KTS 3
PRESSURE 4
CAT 2
BASIN 15
Shape_Leng 9


<Font color = 'blue'>Wow this data looks completely different than our last data!! To insert it, we will need to modify our existing table to accept new datatypes. I used the <a href = "https://www.postgresql.org/docs/7.4/static/functions-formatting.html">PostgreSQL Documentation</a> as a reference for date formatting.</font>

In [71]:
conn = psycopg2.connect("dbname=dq_exercises user=nmolivo")
cur = conn.cursor()
cur.execute("""
ALTER TABLE stormdata ADD COLUMN date TIMESTAMP;
UPDATE stormdata SET date = to_date(year || '-' || month || '-' || day || ' ' || ad_time, 'YYYY-MM-DD HH24:MI:SS');
""")
conn.commit()

<font color ='blue'>Remove extra columns</font>

In [73]:
conn = psycopg2.connect("dbname=dq_exercises user=nmolivo")
cur = conn.cursor()
cur.execute("""
ALTER TABLE stormdata DROP COLUMN day;
ALTER TABLE stormdata DROP COLUMN month;
ALTER TABLE stormdata DROP COLUMN year;
ALTER TABLE stormdata DROP COLUMN ad_time;
""")
conn.commit()

<font color = 'blue'>Change columns to proper datatypes</font>

In [78]:
conn = psycopg2.connect("dbname=dq_exercises user=nmolivo")
cur = conn.cursor()
cur.execute("""
ALTER TABLE stormdata ALTER COLUMN fid TYPE VARCHAR(32);
""")
conn.commit()

<font color = 'blue'>To insert our second dataframe, `data2` into our table, we'll need to<br>
1. Make sure the columns are in the correct order<br>
2. Turn the dataframe into a list of tuples representing each row.</font>

In [79]:
data2 = data2[['fid', 'btid', 'name', 'lat', 'long', 'wind_kts', 'pressure', 'cat', 'basin', 'shape_len', 'date']]

In [88]:
values = data2.values.tolist()

<font color = 'blue'>Here it is, the big insert!</font>

In [90]:
sql = "INSERT INTO stormdata(fid, btid, name, lat, long, wind_kts, pressure, cat, basin, shape_len, date) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
conn = psycopg2.connect("dbname=dq_exercises user=nmolivo")
cur = conn.cursor()
cur.executemany(sql, values)
conn.commit()
cur.close()

<font color = 'blue'>Always check that your appends and merges worked correctly. I checked the following number against the table records count seen using Postico, and it worked!</font>

In [91]:
len(data) + len(data2)

118456