# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [3]:
# Do all imports and installs here
import pandas as pd
import re
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import types as t
from datetime import datetime, timedelta

# all_etl contaisn methods used to initiate the spark session
# hurdat_etl and nexrad_etl contain the ETL steps necessary to process and QC the HURDAT and NEXRAD data
from all_etl import *
from hurdat_etl import *
from nexrad_etl import *
from sql_queries import *

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use?

This data can be used for historic reconstruction of hurricanes. This would be used in the underlying database for analytics and developing tropical weather models.

The data is spatially and temporally indexed, so a user can retrieve records within a given space and time window. 

The main tools I used are Spark for data processing, ARM's pyart package to process radar files, and SidewalkLabs/Google's S2 package for spatial indexing.

https://medium.com/@ligz/installing-standalone-spark-on-windows-made-easy-with-powershell-7f7309799bc7
https://arm-doe.github.io/pyart/
https://www.sidewalklabs.com/blog/s2-cells-and-space-filling-curves-keys-to-building-better-digital-map-tools-for-cities/


#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

HURDAT
NOAA's Hurricane Research Division maintains the HURDAT project which analyzes tropical storm tracks and strengths. The data includes the geographic position of each storm at 6-hour intervals. This data includes wind speed and air pressure for the storm at each of these readings.

At the time of this project, the HURDAT database includes about 52 thousand rows (storm track points).

https://www.aoml.noaa.gov/hrd/data_sub/re_anal.html

NEXRAD
The National Weather Servive operates NEXRAD Dopplar Radar stations across the US, which operate continuously through the day. The primary products of Dopplar Radar measurements are the reflectivity and velocity of the atmosphere.

A single NEXRAD file contains 11 million records, representing reflecivity/ velocity measurements for one radar station at one point in time. Thousands of NEXRAD files are generated per day, so this project only uses a single file for proof of concept.

https://www.ncdc.noaa.gov/data-access/radar-data/noaa-big-data-project
https://s3.amazonaws.com/noaa-nexrad-level2/index.html
https://www.nsstc.uah.edu/users/brian.freitag/AWS_Radar_with_Python.html

In [5]:
# Read in the data here

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

HURDAT
        data is convoluted with header rows interspersed within table

NEXRAD
    data in az-el, rather than lat lon
    data is in a 6480 by 1832 matrix, and needs to be reshaped
    sparse data set, only records value if refl / velo to record
    data time stamps index off of start time, start time embedded in text and needs to be extracted


    
#### Cleaning Steps
Document steps necessary to clean the data

   transform coordinate system from polar to grid
   converted lat / lon to S2 cell
   reshaped data
   flagged header rows, and iterated over table to associate child rows with parent row

In [None]:
# Performing cleaning tasks here

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

HURDAT
    fact table: track point: windspeeds, pressure, lat, lon, S2 cell, time, storm
    dim table: storm: storm info
    dim table: spatial: S2 cell, centroid, parent cell
    dim table: time: datetime heirarchy

NEXRAD
    fact table: sample: reflectivity, velocity, lat, lon, S2 cell, alt, time, station
    dim table: station: station info    
    dim table: spatial: S2 cell, centroid, parent cell
    dim table: time: datetime heirarchy

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.