# Visualize Solar Radation Data

The data in this notebook come from the [National Solar Radiation Data Base](http://rredc.nrel.gov/solar/old_data/nsrdb/), specifically the [1991 - 2010 update to the National Solar Radiation Database](http://rredc.nrel.gov/solar/old_data/nsrdb/1991-2010/).  The data set consists of CSV files [measured at USAF weather stations](http://rredc.nrel.gov/solar/old_data/nsrdb/1991-2010/hourly/list_by_USAFN.html)

## Setup

Run the `download_sample_data.py` script to download Lidar from [Puget Sound LiDAR consortium](http://pugetsoundlidar.ess.washington.edu) and other example data sets.  

From your local clone of the `datashader` repository:
```
cd examples
conda env create environment.yml
source activate ds 
python download_sample_data.py
```
Note on Windows, replace `source activate ds` with `activate ds`.

In [None]:
import glob
import os
import re

from collections import defaultdict
from dask.distributed import Client
from holoviews.operation import decimate
from holoviews.operation.datashader import dynspread
import dask
import dask.dataframe as dd
import holoviews as hv
import numpy as np
import pandas as pd

hv.notebook_extension('bokeh')
decimate.max_samples=1000
dynspread.max_px=20
dynspread.threshold=0.5

client = Client()

In [None]:
NUM_STATIONS = None # adjust to and integer limit to subset of SOLAR_FILES

In [None]:
SOLAR_FNAME_PATTERN = os.path.join('data', '72*', '*solar.csv')
SOLAR_FILES = glob.glob(SOLAR_FNAME_PATTERN)
META_FILE = os.path.join('data', 'NSRDB_StationsMeta.csv')

get_station_yr = lambda fname: tuple(map(int, os.path.basename(fname).split('_')[:2]))
STATION_COMBOS = defaultdict(lambda: [])
for fname in SOLAR_FILES:
    k, v = get_station_yr(fname)
    STATION_COMBOS[k].append([v, fname])
choices = tuple(STATION_COMBOS)
if NUM_STATIONS:
    choices = choices[:NUM_STATIONS]
STATION_COMBOS = {k: STATION_COMBOS[k] for k in choices}
files_for_station = lambda station: [x[1] for x in STATION_COMBOS[station]]
station_year_files = lambda station, year: [x for x in files_for_station(station) if '_{}_'.format(year) in x]

In [None]:
def clean_col_names(dframe):
    cols = [re.sub('_$', '', re.sub('[/:\(\)_\s^-]+', '_', col.replace('%', '_pcent_'))).lower()
            for col in dframe.columns]
    dframe.columns = cols
    return dframe

In [None]:
meta_df = clean_col_names(pd.read_csv(META_FILE, index_col='USAF'))

In [None]:
meta_df.loc[list(STATION_COMBOS)]

In [None]:
keep_cols = ['date', 'y', 'x', 'julian_hr', 'year', 'usaf', 'month', 'hour']

@dask.delayed
def read_one_fname(usaf_station, fname):
    dframe = clean_col_names(pd.read_csv(fname))
    station_data = meta_df.loc[usaf_station]
    hour_offset = dframe.hh_mm_lst.map(lambda x: pd.Timedelta(hours=int(x.split(':')[0])))   
    keep = keep_cols + [col for col in dframe.columns
                        if ('metstat' in col or col in keep_cols)
                        and 'flg' not in col]
    dframe['date'] = pd.to_datetime(dframe.yyyy_mm_dd) + hour_offset
    dframe['month'] = dframe.date.dt.month
    dframe['hour'] = dframe.date.dt.hour
    dframe['usaf'] = usaf_station
    dframe['y'], dframe['x'] = station_data.nsrdb_lat_dd, station_data.nsrdb_lon_dd 
    dframe['julian_hr'] = dframe.date.dt.hour + (dframe.date.dt.dayofyear - 1) * 24
    dframe['year'] = dframe.date.dt.year
    dframe[dframe <= -999] = np.NaN
    return dframe.loc[:, keep]

def read_one_station(station):
    '''Read one USAF station's 1991 to 2001 CSVs - dask.delayed for each each year'''
    files = files_for_station(station)
    return dd.from_delayed([read_one_fname(station, fname) for fname in files]).compute()

In [None]:
example_usaf = tuple(STATION_COMBOS)[0]
df = read_one_station(example_usaf)

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
desc = df.date.describe()
desc

The next cell makes some labels for the time series groupby operations' plots and boxplots.

In [None]:
direct, dif_h, glo_h = ('Direct Normal', 
                        'Diffuse Horizontal', 
                        'Global Horizontal',)
labels = {}
watt_hrs_m2_cols = [col for col in df.columns if 'wh_m_2' in col and not 'suny' in col]
for col in watt_hrs_m2_cols:
    label_1 = "Clear Sky " if 'csky' in col else "Measured "
    label_2 = direct if '_dir_' in col else glo_h if '_glo_' in col else dif_h
    labels[col] = label_1 + label_2
labels

In [None]:
def get_station_quantiles(station=None, grouper='julian_hr', usaf_data=None):
    '''Given a station name or dataframe do groupby on time bins
    Parameters:
        station:    Integer name of a USAF weather station 
                    (folder names holding years' CSVs)
        groupby:    One of "julian_hr" "hour" "month_hour"
                    (Note the julian_hr does not standardize relative to leap
                    years: non-leap years have 8760 hrs, leap years 8784 hrs)
        usaf_data:  Give CSVs' dataframe instead of station name
    Returns:
        summary_df  Dataframe with 25%, 50%, 75% for each column
    '''

    if usaf_data is None:
        usaf_data = read_one_station(station)
    if grouper == 'hour':
        group_var = usaf_data.date.dt.hour
    elif grouper == 'month':
        group_var = usaf_data.date.dt.month
    elif grouper == 'month_hour':
        group_var = [usaf_data.date.dt.month, usaf_data.date.dt.hour]
    else:
        group_var = grouper
    usaf_data = usaf_data.groupby(group_var)
    usaf_data = usaf_data[keep_cols + watt_hrs_m2_cols]
    low = usaf_data.quantile(0.25)
    median = usaf_data.median()
    hi = usaf_data.quantile(0.75)
    median[grouper] = median.index.values
    median['usaf'] = station
    # For the low, hi quartiles subset the columns
    # for smaller joins - do not include 3 copies of x,y,date, etc
    join_arg_cols = [col for col in low.columns if col not in keep_cols]
    summary_df = median.join(low[join_arg_cols], 
                             rsuffix='_low').join(hi[join_arg_cols], rsuffix='_hi')
    return summary_df

Get Julian day of year summary for one USAF station using `pandas.DataFrame.groupby`. 

In [None]:
julian_summary = get_station_quantiles(station=example_usaf, grouper='julian_hr',)
julian_summary.head()

The function `get_station_quantiles` returns a `DataFrame` with
 * spatial coordinates `x` and `y`
 * columns related to clear sky solar radiation (columns with `_csky_` as a token)
 * measured solar radiation (columns without `_csky_` as a token)
 * some date / time related columns helpful for `groupby` operations

In [None]:
julian_summary.columns

In [None]:
def plot_gen(station=None, grouper='julian_hr', usaf_data=None):
    '''Given a station name or dataframe do groupby on time bins
    Parameters:
        station:    Integer name of a USAF weather station 
                    (folder names holding years' CSVs)
        groupby:    One of "julian_hr" "hour" "month_hour"
        usaf_data:  Give CSVs' dataframe instead of station name
    Returns:
        curves:     Dictionary of hv.Curve objects showing 
                    25%, 50%, 75% percentiles
    '''
    summary_df = get_station_quantiles(station=station, 
                                       grouper=grouper, 
                                       usaf_data=usaf_data)
    curves = {}
    kw = dict(style=dict(s=2,alpha=0.5))
    for col, label in zip(watt_hrs_m2_cols, labels):
        dates = pd.DatetimeIndex(start=pd.Timestamp('2001-01-01'),
                                 freq='H', 
                                 periods=summary_df.shape[0])
        median_col = summary_df[col]
        low_col = summary_df[col + '_low']
        hi_col = summary_df[col + '_hi']
        hi = hv.Curve((dates, hi_col), label=label + ' (upper quartile)')(**kw)
        low = hv.Curve((dates, low_col),label=label + ' (lower quartile)')(**kw)
        median = hv.Curve((dates, median_col), label=label)(**kw)
        plot_id = tuple(col.replace('metstat_', '').replace('_wh_m_2', '').split('_'))
        curves[plot_id] = low * median * hi
        curves[plot_id].group = labels[col]
    return curves

Run `plot_gen` (function above) with an example USAF station to get a dictionary of `holoviews.Curve` objects that have been combined with the overloaded `holoviews` `*` operator for `Curves` or other `holoviews.element` objects.  The `*` operator is used to show 25%, 50%, and 75% time series.

In [None]:
hour_of_year = plot_gen(station=example_usaf)

Now we have a dictionary with short keys for different plots of 25%, 50%, 75% of:
 * `(glo,)`: Measured Global Horizontal
 * `(dir,)`: Measured Direct Normal
 * `(dif,)`: Measured Diffuse Horizontal
 * `('csky', 'glo')`: Clear Sky Global Horizontal
 * `('csky', 'dir')`: Clear Sky Direct Normal
 * `('csky', 'dif')`: Clear Sky Diffuse Horizontal

In [None]:
list(hour_of_year)

In [None]:
%%opts Curve [width=700 height=500]
%%opts Layout [tabs=True]
hour_of_year[('dir',)] + hour_of_year[('csky', 'dir')] 

In [None]:
%%opts Curve [width=700 height=500 ]
%%opts Layout [tabs=True]
hour_of_year[('glo',)] + hour_of_year[('csky', 'glo')] + hour_of_year[('dif',)] + hour_of_year[('csky', 'dif',)]

The next cells repeat the groupby operations for hour of day.

In [None]:
usaf_data = read_one_station(example_usaf)
hour_of_day = plot_gen(grouper='hour', usaf_data=usaf_data)

In [None]:
%%opts Curve [width=700 height=500]
%%opts Layout [tabs=True]
hour_of_day[('dir',)] + hour_of_day[('csky', 'dir')] 

When grouping by hour of day or month of year, the number of groups on the horizontal axis is small enough for box plots to show distributions legibly.  The next cell uses `holoviews.BoxWhisker` plots to show the direct normal radiation.

In [None]:
%%opts BoxWhisker [width=600 height=600]
%%opts Layout [tabs=True]
(hv.BoxWhisker(usaf_data, kdims=['hour'], vdims=['metstat_dir_wh_m_2'],
               group='Direct Normal - Hour of Day') +
 hv.BoxWhisker(usaf_data, kdims=['month'], vdims=['metstat_dir_wh_m_2'],
               group='Direct Normal - Month of Year'))