# *Storing Storm Data*

### Introduction

Recently, the International Hurricane Watchgroup (IHW) has been asked to update their analysis tools. Because of the increase in public awareness of hurricanes, they are required to be more diligient with the analysis of historical hurricane data they share across the organization. They have asked you, someone with experience in databases, to help work with their team to productionize their services.

Accepting the job, their team tells you that they have been having trouble sharing data across the teams and keeping it consistent. From what they've told you, it seems that their method of sharing the data with their data anaylsts has been to save a CSV file on their local servers and have every data analyst pull the data down. Then, each analyst uses a local SQLite engine to store the CSV, run their queries, and send their results around.

From what they have told you, you might be thinking that this is an inefficient way of sharing data. To understand what you will be working on, they have sent you a CSV file. Their CSV file contains the following fields:

>fid - ID for the row<br>
year - Recorded year<br>
month - Recorded month<br>
day - Recorded date<br>
ad_time - Recorded time in UTC<br>
btid - Hurricane ID<br>
name - Name of the hurricane<br>
lat - Latitude of the recorded location<br>
long - Longitude of the recorded location<br>
wind_kts - Wind speed in knots per second<br>
pressure - Atmospheric pressure of the hurricane<br>
cat - Hurricane category<br>
basin - The basin the hurricane is located<br>
shape_leng - Hurricane shape length<br>


In this Guided Project, you will be using the local installed version of Postgres you installed from the previous project. This is much different than previous Guided Projects as you will be using your own notebook and your own Python environment instead of Dataquest's. Your job is to show that you can create a database that will accomplish the following requirements:

- Database for the IHW to store their tables.
- Table that contains the fields detailed in the CSV file
- User that can update, read, and insert into a table of the data.
- Insert the data into the table.

In the next screen, the IHW has given you a sample file to work with. So, let's get started!

### Downloading CSV File

There are two ways you can work with the CSV file in this project. The first, is that you can download the file from this link. Otherwise, you can use Python to programmatically download the file and read it like any other file. This is possible using the Python standard library, `urllib`.

```
import io
from urllib import request

response = request.urlopen('https://www.example.com/some_file.csv')
reader = csv.reader(io.TextIOWrapper(response))

for line in reader:
    print(line)
```    
    
The `urlopen()` method is part of the `request` module in the `urllib` library. It opens up the webpage requested and creates an HTML response object. Before we can use the response object as a file, we need to use another module, `io`.

We use the `io` module to fake a file descriptor by using the `TextIOWrapper()` method. This wraps any string-like object and forces it to act like a Python file descriptor. You can choose to open the file locally or use the `urlopen()` in your own project but we will assume you are using the `urlopen()`.

In [1]:
import io 
import csv
from urllib import request
import pandas as pd

response = request.urlopen("https://dq-content.s3.amazonaws.com/251/storm_data.csv")
reader = csv.reader(io.TextIOWrapper(response))

df = pd.read_csv(io.TextIOWrapper(response))
df.head()

Unnamed: 0,FID,YEAR,MONTH,DAY,AD_TIME,BTID,NAME,LAT,LONG,WIND_KTS,PRESSURE,CAT,BASIN,Shape_Leng
0,2001,1957,8,8,1800Z,63,NOTNAMED,22.5,-140.0,50,0,TS,Eastern Pacific,1.140175
1,2002,1961,10,3,1200Z,116,PAULINE,22.1,-140.2,45,0,TS,Eastern Pacific,1.16619
2,2003,1962,8,29,0600Z,124,C,18.0,-140.0,45,0,TS,Eastern Pacific,2.10238
3,2004,1967,7,14,0600Z,168,DENISE,16.6,-139.5,45,0,TS,Eastern Pacific,2.12132
4,2005,1972,8,16,1200Z,251,DIANA,18.5,-139.8,70,0,H1,Eastern Pacific,1.702939


### Exploring Various Columns and deciding their required Data Types

In [2]:
columns = list(df.columns)
num_columns = columns[:6] + columns[7:-3] + columns[-1:]
print(num_columns)

for item in num_columns:
    print(max(df[item].value_counts().index), len(str(max(df[item].value_counts().index))))

['FID', 'YEAR', 'MONTH', 'DAY', 'AD_TIME', 'BTID', 'LAT', 'LONG', 'WIND_KTS', 'PRESSURE', 'Shape_Leng']
59228 5
2008 4
12 2
31 2
1800Z 5
1410 4
69.0 4
180.0 5
165 3
1024 4
11.18034 8


#### NUMERICAL COLUMNS DATATYPES:

- For `FID`:
    - We wil use INTEGER datatype. Since, it's largest value is 59228.<br><br>
- Columns `YEAR`, `MONTH`, `DAY` represent any particular date and `AD_TIME` represents a record of the time in [UTC (Coordinated Universal Time)](https://en.wikipedia.org/wiki/Coordinated_Universal_Time):
    - Here, we will combine all of them into a single column and use TIMESTAMP datatype for this column.<br><br>
- For `BTID`, `WIND_KTS` and `PRESSURE`:
    - We will use SMALLINT. Since, there maximum values are 1410, 165 and 1024 respectively.<br><br>
- For `LAT` and `LONG`:
    - We will use DECIMAL datatype with precision 4 and scale 1. Since, they have max. 3 digits before decimal and 1 digit after decimal.<br><br>
- For `Shape_Leng`:
    - We will use DECIMAL datatype with precision 8 and scale 6. Since, it has max. two digits before decimal and 6 digits after decimal.

In [3]:
str_columns = [columns[x] for x in (6, 11, 12)]
print(str_columns)

for item in str_columns:
    print(max([(len(x), x) for x in df[item].unique()]))

['NAME', 'CAT', 'BASIN']
(9, 'SEBASTIEN')
(2, 'TS')
(15, 'Eastern Pacific')


#### STRING COLUMNS DATATYPES
- For `NAME`:
    - We will use VARCHAR(10) since max length is 9
- For `CAT`:
    - We will use VARCHAR(2) since max length is 2
- For `BASIN`:
    - we will use VARCHAR(16) since max length is 15

### Creating the Table

In [4]:
# Create Database First

import psycopg2
from datetime import datetime

conn = psycopg2.connect("dbname=postgres user=postgres password=postgres host=localhost")
conn.autocommit = True
cur = conn.cursor()

cur.execute("DROP DATABASE IF EXISTS ihw")
cur.execute("CREATE DATABASE ihw")
conn.close()

In [5]:
conn = psycopg2.connect("dbname=ihw user=postgres password=postgres host=localhost")
conn.autocommit = True
cur = conn.cursor()
cur.execute("DROP TABLE IF EXISTS hurricanes")

cur.execute("""
    CREATE TABLE hurricanes (
        fid INTEGER PRIMARY KEY,
        date TIMESTAMP,
        btid SMALLINT,
        name VARCHAR(10),
        lat DECIMAL(4, 1), 
        long DECIMAL(4, 1), 
        wind_kts SMALLINT, 
        pressure SMALLINT,
        category VARCHAR(2),
        basin VARCHAR(16),
        shape_length  DECIMAL(8, 6)
        )
    """)

conn.close()

### Creating Users

With a table set up, it's now time to create a user on the Postgres database that can insert, update, and read the data but not delete. This is to make sure that someone who might get a hold of this user does not issue a destructive command. Essentially, this is like creating a "data production" user whose job it is is to always write new and existing data to the table.

Futhermore, even though it wasn't according to the spec, we know that the IHW team's analysts just run read queries on the data. Also, since the analysts only know SQLite queries, they may not be well-versed in a production database. As such, it might be risky handing out a general production user for them to query their data.

From what you have learned about security and restricting well meaning users, it might be a good idea to restrict those analysts from some commands. Those commands can be anything from adding new data to the table or changing the values. You should decide what commands should be given to the analyst user.

In [6]:
conn = psycopg2.connect("dbname=ihw user=postgres password=postgres host=localhost")
conn.autocommit = True
cur = conn.cursor()

cur.execute("CREATE USER ihw_production WITH PASSWORD 'ihw.production.whi'")
cur.execute("CREATE USER ihw_analyst WITH PASSWORD 'ihw.analyst.whi'")

cur.execute("REVOKE ALL ON hurricanes FROM ihw_production")
cur.execute("REVOKE ALL ON hurricanes FROM ihw_analyst")
cur.execute("GRANT INSERT, UPDATE, SELECT ON hurricanes TO ihw_production")
cur.execute("GRANT SELECT ON hurricanes TO ihw_analyst")
conn.close()

### Creating Readonly Group

In [8]:
conn = psycopg2.connect("dbname=ihw user=postgres password=postgres host=localhost")
conn.autocommit = True
cur = conn.cursor()
cur.execute("DROP GROUP IF EXISTS analysts")

cur.execute("CREATE GROUP analysts NOLOGIN")
conn.close()

### Insert the Data

In [9]:
conn = psycopg2.connect("dbname=ihw user=ihw_production password=ihw.production.whi host=localhost")
conn.autocommit = True
cur = conn.cursor()

response = request.urlopen("https://dq-content.s3.amazonaws.com/251/storm_data.csv")
reader = csv.reader(io.TextIOWrapper(response))
next(reader)

mogrified_values = []

for row in reader:
    date = datetime(int(row[1]), int(row[2]), int(row[3]), hour=int(row[4][:2]), minute=int(row[4][2:-1]))
    updated_row = [row[0], date] + row[5:]
#     print(updated_row)

    mogrified = cur.mogrify("(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", updated_row).decode('utf-8')
    mogrified_values.append(mogrified)
    
cur.execute("INSERT INTO hurricanes VALUES " + ",".join(mogrified_values))
conn.close()

### Confirming Data is read into PostgreSQL

In [11]:
conn = psycopg2.connect("dbname=ihw user=ihw_analyst password=ihw.analyst.whi host=localhost")
# conn.autocommit = True
cur = conn.cursor()

cur.execute("SELECT * FROM hurricanes limit 3")
print(cur.fetchall())

conn.close()

[(2001, datetime.datetime(1957, 8, 8, 18, 0), 63, 'NOTNAMED', Decimal('22.5'), Decimal('-140.0'), 50, 0, 'TS', 'Eastern Pacific', Decimal('1.140175')), (2002, datetime.datetime(1961, 10, 3, 12, 0), 116, 'PAULINE', Decimal('22.1'), Decimal('-140.2'), 45, 0, 'TS', 'Eastern Pacific', Decimal('1.166190')), (2003, datetime.datetime(1962, 8, 29, 6, 0), 124, 'C', Decimal('18.0'), Decimal('-140.0'), 45, 0, 'TS', 'Eastern Pacific', Decimal('2.102380'))]


In [None]:
conn = psycopg2.connect("dbname=ihw user=postgres password=postgres host=localhost")
conn.autocommit = True
cur = conn.cursor()

cur.execute("REVOKE CONNECT ON DATABASE ihw FROM public;")
conn.close()

In [None]:
conn = psycopg2.connect("dbname=ihw user=postgres password=postgres host=localhost")
conn.autocommit = True
cur = conn.cursor()

cur.execute("""SELECT pid, pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE datname = 'ihw' AND pid <> pg_backend_pid();""")
conn.close()