# Storing Storm Data

Recently, the _International Hurricane Watchgroup_ (IHW) has been asked to update their analysis tools. Because of the increase in public awareness of hurricanes, they are required to be more diligient with the analysis of historical hurricane data they share across the organization. They have asked us to help work with their team to productionize their services.

Accepting the job, their team tells us that they have been having trouble sharing data across the teams and keeping it consistent. From what they've told us, it seems that their method of sharing the data with their data anaylsts has been to save a CSV file on their local servers and have every data analyst pull the data down. Then, each analyst uses a local SQLite engine to store the CSV, run their queries, and send their results around.

From what they have toldus, we think that this is an inefficient way of sharing data. To understand what we will be working on, they have sent us a CSV file. Their CSV file contains the following fields:

- `fid` - ID for the row
- `year` - Recorded year
- `month` - Recorded month
- `day` - Recorded date
- `ad_time` - Recorded time in UTC
- `btid` - Hurricane ID
- `name` - Name of the hurricane
- `lat` - Latitude of the recorded location
- `long` - Longitude of the recorded location
- `wind_kts` - Wind speed in knots per second
- `pressure` - Atmospheric pressure of the hurricane
- `cat` - Hurricane category
- `basin` - The basin the hurricane is located
- `shape_leng` - Hurricane shape length

We need to create a database that will accomplish the following requirements:

- Database for the IHW to store their tables.
- Table that contains the fields detailed in the CSV file
- User that can update, read, and insert into a table of the data.
- Insert the data into the table.


[File Download Link](https://dq-content.s3.amazonaws.com/251/storm_data.csv)

## Import Libraries & Helper Functions

In [59]:
# import libraries
import csv
import datetime
import psycopg2

import pandas as pd
import pprint as pp

_Helper Functions_

In [17]:
# explore the uniqueness of each column's values
def display_columns_unique_values(df):
    '''display each columns unique values
        - if only 1, will display it
        - if unable to get unique values, will display data structure type
    '''
    for column in df.columns:

        try:
            unique_values = df[column].unique()
            num_unique_values = len(unique_values)
            result = f"Column {column.upper():>15}: {num_unique_values:}"

            if num_unique_values == 1:
                result += f"|Unique values: {unique_values}"

            print(result)

        # some columns contain dict objects which aren't hashable so they are ignored
        except TypeError as tp:
            print(f"Column {column.upper()}: Type: {type(df[column][0])}")
            pass

## Downloading CSV file

In [1]:
!ls

project01_storing_storm_data.ipynb  storm_data.csv


alternative approach: _download the file_

In [10]:
import csv
import io
from urllib import request

file_link = 'https://dq-content.s3.amazonaws.com/251/storm_data.csv'

response = request.urlopen(file_link)
reader = csv.reader(io.TextIOWrapper(response))

for line in reader:
    print(line)
    break

['\ufeffFID', 'YEAR', 'MONTH', 'DAY', 'AD_TIME', 'BTID', 'NAME', 'LAT', 'LONG', 'WIND_KTS', 'PRESSURE', 'CAT', 'BASIN', 'Shape_Leng']


## Exploring the Dataset

In [13]:
df = pd.read_csv('storm_data.csv')

In [14]:
df.head()

Unnamed: 0,FID,YEAR,MONTH,DAY,AD_TIME,BTID,NAME,LAT,LONG,WIND_KTS,PRESSURE,CAT,BASIN,Shape_Leng
0,2001,1957,8,8,1800Z,63,NOTNAMED,22.5,-140.0,50,0,TS,Eastern Pacific,1.140175
1,2002,1961,10,3,1200Z,116,PAULINE,22.1,-140.2,45,0,TS,Eastern Pacific,1.16619
2,2003,1962,8,29,0600Z,124,C,18.0,-140.0,45,0,TS,Eastern Pacific,2.10238
3,2004,1967,7,14,0600Z,168,DENISE,16.6,-139.5,45,0,TS,Eastern Pacific,2.12132
4,2005,1972,8,16,1200Z,251,DIANA,18.5,-139.8,70,0,H1,Eastern Pacific,1.702939


In [16]:
df.shape

(59228, 14)

In [15]:
display_columns_unique_values(df)

Column             FID: 59228
Column            YEAR: 158
Column           MONTH: 12
Column             DAY: 31
Column         AD_TIME: 4
Column            BTID: 1410
Column            NAME: 482
Column             LAT: 582
Column            LONG: 1950
Column        WIND_KTS: 61
Column        PRESSURE: 131
Column             CAT: 12
Column           BASIN: 2
Column      SHAPE_LENG: 957


Explore each column's unique values:

In [27]:
df['BTID'].value_counts().sort_index()

1       18
2       25
3       25
4       55
5       23
6       21
7       64
8       19
9       31
10      73
11      27
12      17
13       9
14      61
15      15
16       5
17      25
18      29
19      33
20      26
21      21
22      35
23      17
24      27
25       9
26      17
27      29
28      23
29      17
30      21
        ..
1381    22
1382    22
1383    39
1384    19
1385    25
1386    13
1387     9
1388    25
1389     7
1390    18
1391    12
1392    30
1393    52
1394    23
1395     8
1396    72
1397    17
1398    26
1399    13
1400    51
1401    46
1402    46
1403    57
1404    32
1405    22
1406    33
1407     7
1408    13
1409    31
1410    34
Name: BTID, Length: 1410, dtype: int64

In [25]:
df['NAME'].unique()

array(['NOTNAMED', 'PAULINE', 'C', 'DENISE', 'DIANA', 'KRISTY', 'KAY',
       'MAGGIE', 'GREG', 'JOVA', 'DANIEL', 'DOUGLAS', 'HECTOR', 'ERICK',
       'CELESTE', 'GEORGETTE', 'ORLENE', 'YOLANDA', 'EUGENE', 'FERNANDA',
       'LANE', 'DOLORES', 'KATHERINE', 'DORA', 'BARBARA', 'NARDA', 'IONE',
       'GUILLERMO', 'JIMENA', 'BUD', 'ELIDA', 'KANOA', 'ROSALIE', 'DOT',
       'CONNIE', 'RAMONA', 'KATE', 'KATHLEEN', 'LORRAINE', 'HILARY',
       'JOHN', 'DOREEN', 'MIRIAM', 'PATRICIA', 'FICO', 'EMILIA', 'GILMA',
       'EMA', 'SONIA', 'GIL', 'KENNA', 'KENNETH', 'RAYMOND', 'LINDA',
       'RICK', 'FAUSTO', 'ESTELLE', 'MARIE', 'POLO', 'FEFA', 'JAVIER',
       'ROSLYN', 'FABIO', 'FELICIA', 'DARBY', 'IRAH', 'ELEANOR',
       'CLAUDIA', 'HERNAN', 'HENRIETTE', 'ISMAEL', 'JULIO', 'ENRIQUE',
       'KEVIN', 'BLANCA', 'LALA', 'MADELINE', 'GWEN', 'HYACINTH', 'MARTY',
       'FRANK', 'IGNACIO', 'ALIKA', 'LOWELL', 'ALETTA', 'ANDRES', 'ISIS',
       'TARA', 'WILA', 'DALILIA', 'BORIS', 'GENEVIEVE', 'IOKE', '

In [28]:
max_len = 0
for name in df['NAME'].unique():
    max_len = max(max_len, len(name))
print('Max length of name:', max_len)

Max length of name: 9


In [31]:
df['WIND_KTS'].unique()

array([ 50,  45,  70,  30,  75,  60, 100,  35,  40,  25,  80,  85,  90,
        15,  65,  20,  95,  55, 110, 120, 125, 115, 105, 130, 135, 140,
       150, 145,  77,  84, 155,  67, 160,  10, 165,  34,  43,  52,  28,
        29,  27,  87, 102,  98,  94,  33,  32,  31,  62, 118, 117,  54,
        56,  68,  58,  82,  63,  37,  48,  93,  78])

In [34]:
df['WIND_KTS'].value_counts().sort_index()

10       59
15      196
20      916
25     5056
27        2
28        3
29        2
30     5980
31        2
32        2
33        2
34        1
35     6002
37        1
40     4957
43        2
45     6052
48        1
50     4303
52        2
54        1
55     2488
56        1
58        1
60     3272
62        2
63        3
65     2682
67        1
68        2
       ... 
75     2866
77        6
78        2
80     2151
82        1
84        3
85     1738
87        1
90     2075
93        1
94        1
95      889
98        1
100    1160
102       2
105     794
110     650
115     520
117       1
118       1
120     395
125     261
130     172
135      95
140     112
145      32
150      21
155      13
160       8
165       3
Name: WIND_KTS, Length: 61, dtype: int64

In [35]:
df['PRESSURE'].unique()

array([   0, 1005, 1012, 1007, 1002, 1011,  997,  973,  992, 1010, 1008,
        960,  968, 1006, 1009, 1000, 1003,  972,  975, 1004,  994,  976,
        990,  995,  980,  999,  996,  998,  955,  985,  969,  963,  987,
       1014,  970,  966,  974, 1001,  979,  984,  946,  950,  978,  961,
        948,  981,  988,  977,  965, 1013,  989,  993,  944,  952, 1015,
        986,  949,  991,  962,  958,  935,  928,  926,  929,  936,  932,
        940,  982,  951,  956,  943,  957,  953,  930,  920,  959,  945,
        954,  983,  964,  933,  941,  947,  939,  938,  971,  921,  900,
        919,  924,  923,  967,  937,  942,  934,  925,  910,  915,  927,
        931,  906,  903,  905,  917,  902,  913,  907,  892,  899,  914,
        888,  889,  922,  912,  901,  916,  882,  911,  918, 1016, 1020,
       1017, 1018, 1019, 1021, 1022, 1023, 1024,  908,  897,  909])

In [37]:
df['PRESSURE'].value_counts().sort_index()

0       37002
882         1
888         1
889         1
892         5
897         2
899         1
900         6
901         1
902         2
903         1
905         5
906         2
907         1
908         2
909         3
910        23
911         1
912         2
913         3
914         4
915         7
916         3
917         3
918         1
919         4
920        18
921        12
922         4
923        12
        ...  
995       497
996       301
997       564
998       453
999       389
1000     1146
1001      456
1002      766
1003      716
1004      740
1005     1248
1006      969
1007      924
1008      961
1009     1172
1010      830
1011      361
1012      329
1013      155
1014      106
1015       51
1016       28
1017       15
1018        9
1019        5
1020       12
1021        4
1022        4
1023        4
1024        1
Name: PRESSURE, Length: 131, dtype: int64

In [38]:
df['CAT'].unique()

array(['TS', 'H1', 'TD', 'H3', 'H2', 'L', 'H4', 'H5', 'E', 'W', 'SS',
       'SD'], dtype=object)

In [19]:
df['BASIN'].unique()

array(['Eastern Pacific', 'North Atlantic'], dtype=object)

**Data Types to Use**:

- `fid` - Row ID - Integer
- `year`, `month`, `day`, `ad_time` - combine into timestamp with timezone
- `btid` - Hurricane ID - smallint
- `name` - Name of the hurricane - TEXT
- `lat` - Latitude of the recorded location - smallint
- `long` - Longitude of the recorded location - smallint
- `wind_kts` - Wind speed in knots per second - smallint
- `pressure` - Atmospheric pressure of the hurricane - smallint
- `cat` - Hurricane category - VARCHAR(2)
- `basin` - The basin the hurricane is located - TEXT
- `shape_leng` - Hurricane shape length - Decimal with 1 digit before the decimal and 6 digits after it.

## Create the Database and Users

In [71]:
conn = psycopg2.connect(dbname="johannes", user="johannes")
cursor = conn.cursor()

# Autocommit instead of committing every transaction - needed to create DB
conn.autocommit = True

cursor.execute('CREATE DATABASE ihw')
cursor.execute("CREATE USER production WITH PASSWORD 'abc123'")
cursor.execute("CREATE USER analyst WITH PASSWORD 'def456'")

conn.commit()
conn.close()

## Create the Table

In [74]:
conn = psycopg2.connect(dbname="ihw", user="johannes")
cursor = conn.cursor()

cursor.execute("""
CREATE TABLE hurricanes(
    fid INTEGER PRIMARY KEY,
    recorded_at TIMESTAMP,
    btid INTEGER,
    name VARCHAR(10),
    lat DECIMAL(4, 1),
    long DECIMAL(4, 1),
    wind_kts SMALLINT,
    pressure INTEGER,
    category VARCHAR(2),
    basin VARCHAR(16),
    shape_leng DECIMAL(8, 6)
)
""")

conn.commit()
conn.close()

## Manage User Privileges

With a table set up, it's now time to create a user on the Postgres database that can insert, update, and read the data but not delete. This is to make sure that someone who might get a hold of this user does not issue a destructive command. Essentially, this is like creating a "data production" user whose job it is is to always write new and existing data to the table.

Futhermore, even though it wasn't according to the spec, we know that the IHW team's analysts just run read queries on the data. Also, since the analysts only know SQLite queries, they may not be well-versed in a production database. As such, it might be risky handing out a general production user for them to query their data.

In [75]:
conn = psycopg2.connect(dbname="ihw", user="johannes")
cursor = conn.cursor()

cursor.execute("""
REVOKE ALL ON hurricanes FROM production;
REVOKE ALL ON hurricanes FROM analyst;
GRANT SELECT, INSERT, UPDATE ON hurricanes TO production;
GRANT SELECT ON hurricanes TO analyst;
""")

conn.commit()
conn.close()

## Insert the Data

In [77]:
# unable to login to database as user 'production' --> will try via regular py script
conn = psycopg2.connect(dbname='ihw', user='johannes')

In [79]:
conn = psycopg2.connect(dbname='ihw', user='johannes')
cursor = conn.cursor()
conn.autocommit = True

with open('storm_data.csv', 'r') as f:
    
    next(f) # skip header line
    reader = csv.reader(f)

    rows = []
    for row in reader:
        #print(row)
        recorded_at = datetime.datetime(int(row[1]), int(row[2]), int(row[3]), 
                                        hour=int(row[4][:2]), minute=int(row[4][2:-1]))
        
        new_row = [row[0], recorded_at] + row[5:]
        #print(new_row)
        rows.append(
            cursor.mogrify(
                "(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
                new_row
            ).decode('utf-8')
        )
    
    #pp.pprint(rows[:10])
        
    cursor.execute("INSERT INTO hurricanes VALUES " + ",".join(rows))

In [80]:
# validate data
cursor.execute('select count(1) from hurricanes')
pp.pprint(cursor.fetchall())

[(59228,)]


In [16]:
df.shape

(59228, 14)

-> This matches the number of rows in our original dataset!

## Additional Steps

Here's just a small list of features we can add to our Postgres Database:

- A `readonly` group instead of just one `readonly` user.
- Try downloading the following [file](https://dq-content.s3.amazonaws.com/251/storm_data_additional.csv) that contains a new dataset that is slightly different.
    - See if you can insert it in your created table.
- Launch your Postgres instance! Right now only you have access to your data on your local machine. To share it, you need to use something like a **cloud-storage solution**.
    - [Launch Postgres on AWS](https://aws.amazon.com/getting-started/tutorials/create-connect-postgresql-db/)
    - [Launch Postgres with Heroku](https://www.heroku.com/postgres)