# Sample code for processing netcdf4 files for kaggle Solar Energy Prediction Competition.
https://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest#description


In [None]:
# A good link to start wrapping your head around netcdf data format: 
# https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-tutorial.html#Intro

# This is a good link describing the dataset for the competition.
# https://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest/discussion/5057

# It is important to undertsand that the data provided is the *prediction* of a parameter
# (eg. *prediction* of the total precipitation), rather than the *observed* data.
# The data is a dictionary of all the helper/axis variables in it once you load the 
# netcdf4 and the actual data. The actual data is a big array of shape (5113, 11, 5, 9, 16) 
# with 5113 daily predictions from 1994 to 2007, 11 ensemble members of the GEFS 
# (different submodel predictions I think), 5 actual predictions (it's released at midnight 
# I think so it's forcast for 12, 15, 18, 21, and 24 hours out), and 9 latitudes and 16 
# longitudes for where the predictions are spatially.
# The GEFS is a weather model that just predicts various things at various locations, 
# and the data is those predictions.

# A good python code sample if you prefer a hacker's approach:
# http://schubert.atmos.colostate.edu/~cslocum/netcdf_example.html


In [None]:
# Installing netcdf4 python library may be non-trivial. The below code is confirmed to work on AWS SageMaker notebook 
# with 'conda python 3' kernel

In [4]:
!conda install -c anaconda netcdf4 --yes
from netCDF4 import Dataset

#!conda install -c ioos xarray -y

Solving environment: done


  current version: 4.5.12
  latest version: 4.7.10

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/python3

  added / updated specs: 
    - netcdf4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.6.16          |           py36_0         154 KB  anaconda
    libssh2-1.8.2              |       h1ba5d50_0         250 KB  anaconda
    cftime-1.0.3.4             |   py36hdd07704_1         310 KB  anaconda
    libnetcdf-4.6.1            |       h10edf3e_2         1.3 MB  anaconda
    ca-certificates-2019.5.15  |                0         133 KB  anaconda
    openssl-1.1.1              |       h7b6447c_0         5.0 MB  anaconda
    netcdf4-1.4.2              |   py36h4b4f87f_0         526 KB  anaconda
    curl-7.65.2                |       hbc8304

The easiest (not necessarily the fastest) way to xfer the data to your SageMaker machine is to:
1. download it to your local machine (eg. your laptop)
2. upload the file to your AWS S3 bucket (eg: 3://sergey-ML-workshop) 
3. download the file from AWS S3 bucket to the machine used to host your SageMaker notebook.
    3a. in SageMaker Jupyter console, open a terminal window.
    3b. in the terminal window, issue a command to copy the file to your data directory. Example:
         $cd SageMaker
        $cd <YourProjectDirecotory>
         $mkdir data
        $mkdir data/train
         $cd data/train
        $aws s3 cp s3://sergey-ML-workshop/data.nc .
        
        

In [1]:
import numpy as np
import netCDF4
import pandas as pd
from pandas import Series
import xarray as xr


ds = xr.open_dataset('/home/ec2-user/SageMaker/train/tmax_2m_latlon_subset_19940101_20071231.nc')
df = ds.to_dataframe()

# loop through columns you want to use

def get_split(df, freq='D', split_type = 'train', cols_to_use = ['Maximum_temperature']):
    rt_set = []
    
    # use 70% for training
    if split_type == 'train':
        lower_bound = 0
        upper_bound = round(df.shape[0] * .7)
        
    # use 15% for validation
    elif split_type == 'validation':
        lower_bound = round(df.shape[0] * .7)
        upper_bound = round(df.shape[0] * .85)
        
    # use 15% for test
    elif split_type == 'test':
        lower_bound = round(df.shape[0] * .85)
        upper_bound = df.shape[0]
        
    for h in list(df):
        #if h in cols_to_use:
            
            target_column = df[h].values.tolist()[lower_bound:upper_bound]
            
            #date_str = str(df.iloc[0]['time'])
            #date_str = ''
            #year = date_str[0:4]
            #month = date_str[4:6]
            #date = date_str[7:]
                                                
            start_dataset = pd.Timestamp("{}-{}-{} 00:00:00".format('1994', '01', '01', freq=freq))
                        
            # create a new json object for each column
            json_obj = {'start': str(start_dataset),
                       'target':target_column}
    
            rt_set.append(json_obj)
        
    return rt_set   


  result = pd.to_timedelta(num_timedeltas, unit=units, box=False)


In [5]:
train_set = get_split(df, 'D')
test_set = get_split(df, 'D', split_type = 'test')

In [8]:
import json

In [9]:
def write_dicts_to_file(path, data):
    with open(path, 'wb') as fp:
        for d in data:
            fp.write(json.dumps(d).encode("utf-8"))


In [10]:
write_dicts_to_file('max_temp_train.json', train_set)
write_dicts_to_file('max_temp_test.json', test_set)

In [11]:
!aws s3 cp max_temp_train.json s3://forecasting-do-not-delete/train/max_temp_train.json
!aws s3 cp max_temp_test.json s3://forecasting-do-not-delete/test/max_temp_test.json

upload: ./max_temp_train.json to s3://forecasting-do-not-delete/train/max_temp_train.json
upload: ./max_temp_test.json to s3://forecasting-do-not-delete/test/max_temp_test.json
