# Homework 9: Surfs Up!
### Grant T. Aguinaldo


<img src='images/surfs-up.jpeg' />

Congratulations! You've decided to treat yourself to a long holiday vacation in Honolulu, Hawaii! To help with your trip planning, you decided to do some climate analysis on the area. Because you are such an awesome person, you have decided to share your ninja analytical skills with the community by providing a climate analysis api. The following outlines what you need to do.

## Step 1 - Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

* Create a Jupyter Notebook file called `data_engineering.ipynb` and use this to complete all of your Data Engineering tasks.

* Use Pandas to read in the measurement and station CSV files as DataFrames.

* Inspect the data for NaNs and missing values. You must decide what to do with this data.

* Save your cleaned CSV files with the prefix `clean_`.

---

## Step 2 - Database Engineering

Use SQLAlchemy to model your table schemas and create a sqlite database for your tables. You will need one table for measurements and one for stations.

* Create a Jupyter Notebook called `database_engineering.ipynb` and use this to complete all of your Database Engineering work.

* Use Pandas to read your cleaned measurements and stations CSV data.

* Use the `engine` and connection string to create a database called `hawaii.sqlite`.

* Use `declarative_base` and create ORM classes for each table.

  * You will need a class for `Measurement` and for `Station`.

  * Make sure to define your primary keys.

* Once you have your ORM classes defined, create the tables in the database using `create_all`.

---

## Step 3 - Climate Analysis and Exploration

You are now ready to use Python and SQLAlchemy to do basic climate analysis and data exploration on your new weather station tables. All of the following analysis should be completed using SQLAlchemy ORM queries, Pandas, and Matplotlib.

* Create a Jupyter Notebook file called `climate_analysis.ipynb` and use it to complete your climate analysis and data exporation.

* Choose a start date and end date for your trip. Make sure that your vacation range is approximately 3-15 days total.

* Use SQLAlchemy `create_engine` to connect to your sqlite database.

* Use SQLAlchemy `automap_base()` to reflect your tables into classes and save a reference to those classes called `Station` and `Measurement`.

### Precipitation Analysis

* Design a query to retrieve the last 12 months of precipitation data.

* Select only the `date` and `prcp` values.

* Load the query results into a Pandas DataFrame and set the index to the date column.

* Plot the results using the DataFrame `plot` method.

<center><img src='images/precip.png' /></center>

* Use Pandas to print the summary statistics for the precipitation data.

### Station Analysis

* Design a query to calculate the total number of stations.

* Design a query to find the most active stations.

  * List the stations and observation counts in descending order

  * Which station has the highest number of observations?

* Design a query to retrieve the last 12 months of temperature observation data (tobs).

  * Filter by the station with the highest number of observations.

  * Plot the results as a histogram with `bins=12`.

  <center><img src='images/temp_hist.png' height="400px" /></center>

### Temperature Analysis

* Write a function called `calc_temps` that will accept a start date and end date in the format `%Y-%m-%d` and return the minimum, average, and maximum temperatures for that range of dates.

* Use the `calc_temps` function to calculate the min, avg, and max temperatures for your trip using the matching dates from the previous year (i.e. use "2017-01-01" if your trip start date was "2018-01-01")

* Plot the min, avg, and max temperature from your previous query as a bar chart.

  * Use the average temperature as the bar height.

  * Use the peak-to-peak (tmax-tmin) value as the y error bar (yerr).

<center><img src='images/temp_avg.png' height="400px"/></center>


### Optional Recommended Analysis

* The following are optional challenge queries. These are highly recommended to attempt, but not required for the homework.

  * Calcualte the rainfall per weather station using the previous year's matching dates.

* Calculate the daily normals. Normals are the averages for min, avg, and max temperatures.

  * Create a function called `daily_normals` that will calculate the daily normals for a specific date. This date string will be in the format `%m-%d`. Be sure to use all historic tobs that match that date string.

  * Create a list of dates for your trip in the format `%m-%d`. Use the `daily_normals` function to calculate the normals for each date string and append the results to a list.

  * Load the list of daily normals into a Pandas DataFrame and set the index equal to the date.

  * Use Pandas to plot an area plot (`stacked=False`) for the daily normals.

  <center><img src="images/daily_normals.png" /></center>

---

## Step 4 - Climate App

Now that you have completed your initial analysis, design a Flask api based on the queries that you have just developed.

* Use FLASK to create your routes.

### Routes

* `/api/v1.0/precipitation`

  * Query for the dates and temperature observations from the last year.

  * Convert the query results to a Dictionary using `date` as the key and `tobs` as the value.

  * Return the json representation of your dictionary.

* `/api/v1.0/stations`

  * Return a json list of stations from the dataset.

* `/api/v1.0/tobs`

  * Return a json list of Temperature Observations (tobs) for the previous year

* `/api/v1.0/<start>` and `/api/v1.0/<start>/<end>`

  * Return a json list of the minimum temperature, the average temperature, and the max temperature for a given start or start-end range.

  * When given the start only, calculate `TMIN`, `TAVG`, and `TMAX` for all dates greater than and equal to the start date.

  * When given the start and the end date, calculate the `TMIN`, `TAVG`, and `TMAX` for dates between the start and end date inclusive.

## Hints

* You will need to join the station and measurement tables for some of the analysis queries.

* Use Flask `jsonify` to convert your api data into a valid json response object.

## Copyright

Coding Boot Camp © 2017. All Rights Reserved.


In [1]:
import pandas as pd
import re
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
%matplotlib inline

In [2]:
measure = './Resources/hawaii_measurements.csv'
station = './Resources/hawaii_stations.csv'

In [3]:
df_measure = pd.read_csv(measure)
df_station = pd.read_csv(station)

In [4]:
df_measure.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [5]:
df_measure.shape

(19550, 4)

In [6]:
df_station

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6
5,USC00519523,"WAIMANALO EXPERIMENTAL FARM, HI US",21.33556,-157.71139,19.5
6,USC00519281,"WAIHEE 837.5, HI US",21.45167,-157.84889,32.9
7,USC00511918,"HONOLULU OBSERVATORY 702.2, HI US",21.3152,-157.9992,0.9
8,USC00516128,"MANOA LYON ARBO 785.2, HI US",21.3331,-157.8025,152.4


***
### Descriptions of the columns of the combined dataset.

For this analysis, we'll assume the following about the data set. 

* First, the column `station` is the US weather station number. 
* Second, the column `name` is the station name, including the number as well as the state and the country of the station. 
* Third, the `latitude` and `longitude` are the coordinates of the station.
* Forth, the `elevation` is provided in the units of feet. 
* Fifth, the column, `prcp` is the amount of precipitation measured by that station on the given day (noted in column `date`).
* Finally, the column `tobs` is the temperature observed at the station, in Fahrenheit.  
***

In [7]:
df_station.shape

(9, 5)

In [8]:
df = pd.merge(df_station, df_measure, on = 'station', how='inner')

In [9]:
df.head()

Unnamed: 0,station,name,latitude,longitude,elevation,date,prcp,tobs
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-01,0.08,65
1,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-02,0.0,63
2,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-03,0.0,74
3,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-04,0.0,76
4,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-06,,73


In [10]:
df.shape

(19550, 8)

In [11]:
df['month'] = ''
df['day'] = ''
df['year'] = ''

In [12]:
index_row_problem = []

for index, row in df.iterrows():
    try:
        df.set_value(index, 'month', re.split(r'[-]+', row["date"])[1])
        df.set_value(index, 'year', re.split(r'[-]+', row["date"])[0])
        df.set_value(index, 'day', re.split(r'[-]+', row["date"])[2])
    except:
        index_row_problem.append(row)

In [13]:
df.head()

Unnamed: 0,station,name,latitude,longitude,elevation,date,prcp,tobs,month,day,year
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-01,0.08,65,1,1,2010
1,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-02,0.0,63,1,2,2010
2,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-03,0.0,74,1,3,2010
3,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-04,0.0,76,1,4,2010
4,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-06,,73,1,6,2010


In [14]:
df['date_format'] = pd.to_datetime(df['date'])

In [15]:
df.head()

Unnamed: 0,station,name,latitude,longitude,elevation,date,prcp,tobs,month,day,year,date_format
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-01,0.08,65,1,1,2010,2010-01-01
1,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-02,0.0,63,1,2,2010,2010-01-02
2,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-03,0.0,74,1,3,2010,2010-01-03
3,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-04,0.0,76,1,4,2010,2010-01-04
4,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,2010-01-06,,73,1,6,2010,2010-01-06


In [16]:
del df['date']

In [17]:
df.head()

Unnamed: 0,station,name,latitude,longitude,elevation,prcp,tobs,month,day,year,date_format
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.08,65,1,1,2010,2010-01-01
1,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,63,1,2,2010,2010-01-02
2,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,74,1,3,2010,2010-01-03
3,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,76,1,4,2010,2010-01-04
4,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,,73,1,6,2010,2010-01-06


In [18]:
df.describe()

Unnamed: 0,latitude,longitude,elevation,prcp,tobs
count,19550.0,19550.0,19550.0,18103.0,19550.0
mean,21.382151,-157.839901,39.858363,0.160644,73.097954
std,0.079017,0.085735,64.987876,0.468746,4.523527
min,21.2716,-158.0111,0.9,0.0,53.0
25%,21.3331,-157.84889,7.0,0.0,70.0
50%,21.33556,-157.8168,14.6,0.01,73.0
75%,21.45167,-157.8015,32.9,0.11,76.0
max,21.5213,-157.71139,306.6,11.53,87.0


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19550 entries, 0 to 19549
Data columns (total 11 columns):
station        19550 non-null object
name           19550 non-null object
latitude       19550 non-null float64
longitude      19550 non-null float64
elevation      19550 non-null float64
prcp           18103 non-null float64
tobs           19550 non-null int64
month          19550 non-null object
day            19550 non-null object
year           19550 non-null object
date_format    19550 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(1), object(5)
memory usage: 2.4+ MB


***

### Approach to missing data. 
From `df.info()` we see that we are missing 1,447 data points for the `prcp` column. For this analysis, we'll apply a missing data procedure where we will fill in the average of all of the available data points from each station for the given month for each missing data point. 

***

In [20]:
df.head()

Unnamed: 0,station,name,latitude,longitude,elevation,prcp,tobs,month,day,year,date_format
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.08,65,1,1,2010,2010-01-01
1,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,63,1,2,2010,2010-01-02
2,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,74,1,3,2010,2010-01-03
3,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,76,1,4,2010,2010-01-04
4,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,,73,1,6,2010,2010-01-06


In [21]:
unique_names_city = []
all_list = df['name'].tolist()
for each in all_list:
    if each not in unique_names_city:
        unique_names_city.append(each)
unique_names_city

['WAIKIKI 717.2, HI US',
 'KANEOHE 838.1, HI US',
 'KUALOA RANCH HEADQUARTERS 886.9, HI US',
 'PEARL CITY, HI US',
 'UPPER WAHIAWA 874.3, HI US',
 'WAIMANALO EXPERIMENTAL FARM, HI US',
 'WAIHEE 837.5, HI US',
 'HONOLULU OBSERVATORY 702.2, HI US',
 'MANOA LYON ARBO 785.2, HI US']

In [22]:
unique_months = []
all_list_mo = df['month'].tolist()
for each in all_list_mo:
    if each not in unique_months:
        unique_months.append(each)
unique_months

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']

In [23]:
mean_ =  df[(df['name'] == 'WAIKIKI 717.2, HI US') & (df['month'] == '01')]['prcp'].mean()
mean_

0.040085836909871206

In [24]:
df[(df['name'] == 'WAIKIKI 717.2, HI US') & (df['month'] == '02')]['prcp'].mean()

0.0673611111111111

In [25]:
df[(df['name'] == 'WAIKIKI 717.2, HI US') & (df['month'] == '12')]['prcp'].mean()

0.07531400966183568

In [26]:
df.loc[(df['prcp'].isnull()) & (df['name'] == 'WAIKIKI 717.2, HI US') & (df['month'] == '01'), 'prcp']

4      NaN
26     NaN
341    NaN
1045   NaN
1046   NaN
1410   NaN
1411   NaN
Name: prcp, dtype: float64

In [27]:
def fill_mean(dataset):
    
    for each_city in unique_names_city:

        for each_month in unique_months:

            ds = dataset[dataset['name'] == each_city]
            ds = ds[ds['month'] == each_month]
            
            mean_prcp = ds['prcp'].mean()
            
            dataset.loc[(dataset['prcp'].isnull()) &
                   (dataset['name'] == each_city) &
                   (dataset['month'] == each_month), 
                   'prcp'] = mean_prcp

In [28]:
fill_mean(df)

In [29]:
df.head()

Unnamed: 0,station,name,latitude,longitude,elevation,prcp,tobs,month,day,year,date_format
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.08,65,1,1,2010,2010-01-01
1,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,63,1,2,2010,2010-01-02
2,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,74,1,3,2010,2010-01-03
3,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,76,1,4,2010,2010-01-04
4,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.040086,73,1,6,2010,2010-01-06


In [30]:
df.describe()

Unnamed: 0,latitude,longitude,elevation,prcp,tobs
count,19550.0,19550.0,19550.0,19550.0,19550.0
mean,21.382151,-157.839901,39.858363,0.158283,73.097954
std,0.079017,0.085735,64.987876,0.452465,4.523527
min,21.2716,-158.0111,0.9,0.0,53.0
25%,21.3331,-157.84889,7.0,0.0,70.0
50%,21.33556,-157.8168,14.6,0.02,73.0
75%,21.45167,-157.8015,32.9,0.12,76.0
max,21.5213,-157.71139,306.6,11.53,87.0


In [31]:
df.loc[(df['prcp'].isnull()) & (df['name'] == 'WAIKIKI 717.2, HI US') & (df['month'] == '01'), 'prcp']

Series([], Name: prcp, dtype: float64)

In [32]:
df.iloc[4]

station                 USC00519397
name           WAIKIKI 717.2, HI US
latitude                    21.2716
longitude                  -157.817
elevation                         3
prcp                      0.0400858
tobs                             73
month                            01
day                              06
year                           2010
date_format     2010-01-06 00:00:00
Name: 4, dtype: object

In [33]:
df.iloc[26]

station                 USC00519397
name           WAIKIKI 717.2, HI US
latitude                    21.2716
longitude                  -157.817
elevation                         3
prcp                      0.0400858
tobs                             70
month                            01
day                              30
year                           2010
date_format     2010-01-30 00:00:00
Name: 26, dtype: object

In [34]:
df.iloc[341]

station                 USC00519397
name           WAIKIKI 717.2, HI US
latitude                    21.2716
longitude                  -157.817
elevation                         3
prcp                      0.0400858
tobs                             68
month                            01
day                              13
year                           2011
date_format     2011-01-13 00:00:00
Name: 341, dtype: object

In [35]:
df.columns.tolist()

['station',
 'name',
 'latitude',
 'longitude',
 'elevation',
 'prcp',
 'tobs',
 'month',
 'day',
 'year',
 'date_format']

In [36]:
df.head()

Unnamed: 0,station,name,latitude,longitude,elevation,prcp,tobs,month,day,year,date_format
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.08,65,1,1,2010,2010-01-01
1,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,63,1,2,2010,2010-01-02
2,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,74,1,3,2010,2010-01-03
3,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.0,76,1,4,2010,2010-01-04
4,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0,0.040086,73,1,6,2010,2010-01-06


In [38]:
df.to_csv('datafile.csv', index=False)

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19550 entries, 0 to 19549
Data columns (total 11 columns):
station        19550 non-null object
name           19550 non-null object
latitude       19550 non-null float64
longitude      19550 non-null float64
elevation      19550 non-null float64
prcp           19550 non-null float64
tobs           19550 non-null int64
month          19550 non-null object
day            19550 non-null object
year           19550 non-null object
date_format    19550 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(4), int64(1), object(5)
memory usage: 2.4+ MB
