# ReadMe
This notebook shows the process for loading and preprocessing data and training and optimizing a model. Written as a notebook instead of markdown for extra interactive fun!

## Preprocess, Train, and Optimize
```data_utility.py``` and ```training_utility.py``` host the functions to open data, preprocess it, and train and optimize a model. Specifically, ```training_utility.py```'s ```main()``` function can easily do all of these together. Run the below code to try it out!

In [None]:
import training_utility

arch = training_utility.ModelArchitecture(steps_in=4, steps_out=1, resample_rate_min=60)
opt_info = training_utility.OptimizationInfo(n_trials=1, n_splits=5, n_epochs=1, min_improvement=0, patience=5)

training_utility.main('../ex_training_data', arch, opt_info, train_test_ratio=0.75, generate_new_data=True, export_folder='results')

Congrats, you just preprocess, trained, and optimized a (very poorly performing but quick to train) model!

Let's break this down. ```main()``` has five parameters you can adjust:
- ```dirname```: The directory of the training data. Results and preprocessed data will also be exported in this directory. See **"Data Preprocessing"** for more information on how this folder must be structured.
- ```arch```: A ```ModelArchitecture``` object, which contains information about how the model is structured.
- ```opt_info```: An ```OptimizationInfo``` object, which contains information about how you want to structure the optimization.
- ```train_test_ratio``` (optional): A float between 0 and 1. What fraction of the dataset to use as training data. The exact amount may be adjusted slightly so the split does not split days into fractions. Defaults to 0.75.
- ```generate_new_data``` (optional): If true, will open and preprocess data from scratch. Otherwise, will search for pickled preprocessed data in ```dirname```. Defaults to False.
- ```export_folder``` (optional): Name of folder within ```dirname``` to store the model, optimization study, and data scalers. Defaults to an empty string (exporting to ```dirname```).

Now, let's break down what goes into ```ModelArchitecture```.

In [None]:
arch = training_utility.ModelArchitecture(steps_in=4, steps_out=1, resample_rate_min=60)

There are three parameters useful in this context:
- ```steps_in```: The number of time steps inputed into the model to make predictions off of.
- ```steps_out```: The length of the model's predictions in time steps.
- ```resample_rate_min```: The length of each time step in minutes.

And finally, let's break down what goes into ```OptimizationInfo```.

In [None]:
opt_info = training_utility.OptimizationInfo(n_trials=1, n_splits=5, n_epochs=1, min_improvement=0, patience=5)

Here are the parameters:
- ```n_trials```: How many trials to optimize the model.
- ```n_splits``` (optional): How many splits to use in time series cross validation. Defaults to 5.
- ```n_epochs``` (optional): How many epochs to train the model on. Defaults to 140.
- ```batch_size``` (optional): Number of samples before updating model. If set to -1, will replace with the number of samples per day. Defaults to -1.
- ```min_improvement``` (optional): Minimum improvement before early stopping. If < 0, does not implement early stopping. Defaults to 0.
- ```patience``` (optional): Number of epochs to wait for improvement before early stopping. Defaults to 5.

**Note on seasonal models:** If you would like to train a seasonal model, read the section "Seasonal Models" below.

And that's pretty much it! You can play with the variables to create a model of your liking.

## Data Preprocessing
You can also preprocess and pickle data for later use without training an entire model. All functions needed to load data and preprocess it are written and documented in ```data_utility.py```. You can look there for more specifics on what's going on, but the ```main()``` function covers everything.

In [1]:
import data_utility
X, y = data_utility.main(steps_in=16, steps_out=4, resample_rate_min=15, dirname='../ex_training_data',
    seasons=[[2,3,4],[5,6,7],[8,9,10],[11,12,1]], write_to_file=False)

The ```main()``` function has a few key inputs:
- ```steps_in``` The number of time steps inputed into the model to make predictions off of.
- ```steps_out```: The length of the model's predictions in time steps.
- ```resample_rate_min```: The length of each time step in minutes.
- ```dirname```: The directory of the training data. Preprocessed data will be exported to a file in this directory if ```write_to_file``` is true..
- ```seasons``` (optional): An array grouping each month into seasons. Providing this will train a seasonal model. See Seasonal Models section below for more information. Defaults to None.
- ```write_to_file``` (optional): Whether or not to pickle the preprocessed data to file. Defaults to False.

```main()``` also returns ```X``` and ```y```, which is the data fed into the model and the data compared against, respectively.

Note that the given directory must be structured as follows:
- **Solar irradiance data** with suffix **.nc** in a folder called **sol_irr**
- **Cloud coverage data** with suffix **.cdf** in a folder called **cloud_coverage**
*You can change the names of these folders and suffixes of these files at the top of ```data_utility.py```*

This code was created based on the ARMS [Radiative Flux Analysis](https://www.arm.gov/capabilities/vaps/radfluxanal) (solar irradiance) and [TSI Sky Cover](https://www.arm.gov/capabilities/instruments/tsi) (cloud coverage) data from the Southern Great Plains site. It is set to open data files using xarray, so any filetype supported by xarray should work fine. However, use of other datasets will require changing the input and output var names in ```data_utility.py```'s ```PreprocessingInstructions``` and may require additional rewriting.

## Seasonal Models
A **non-seasonal model** is one model for all times of the year.<br>
A **seasonal model** is a model fine-tuned to different times of the year. Practically speaking, it is comprised of distinct models, each of which train on a specific subset of the data. Data can be split by months.

```training_utility.py``` will load or open data and train the model on all the data given. This works well for non-seasonal models. However, if you would like to train a seasonal model, you will have to finaggle a little bit. You will have to load and preprocess the data, and then call ```training_utility.py```'s ```main()``` on each season's data, as done below.

In [4]:
import training_utility
import data_utility
import os
import warnings

# first preprocess and save the data (write_to_file must be True)
# data will be saved at 'dirname/season/filename', e.g., 'dirname/[2,3,4]/16in.4out.15min.X.[5,6,7].pickle'
steps_in, steps_out = 14, 1
resample_rate_min=60
dirname='../ex_training_data'
seasons=[[2,3,4],[5,6,7],[8,9,10],[11,12,1]]
data_utility.main(steps_in=steps_in, steps_out=steps_out, resample_rate_min=resample_rate_min, dirname=dirname, seasons=seasons,
    write_to_file=True)

# now iteratively call training_utility.main() for each season's data (generate_new_data=False)
arch = training_utility.ModelArchitecture(steps_in=steps_in, steps_out=steps_out, resample_rate_min=resample_rate_min)
opt_info = training_utility.OptimizationInfo(n_trials=1, n_splits=5, n_epochs=1, min_improvement=0, patience=5)

for season in seasons:
    season_dirname = os.path.join(dirname, str(season))
    try:
        training_utility.main(season_dirname, arch, opt_info, train_test_ratio=0.75, generate_new_data=False, export_folder='results')
    except FileNotFoundError as err:
        # in case there is no data for that season
        warnings.warn(f"Warning: Could not load data for {season_dirname}, error message {err}. May be that data does not have data"
            "for this season, so no file to open.")

---------paths to open determined


[32m[I 2022-08-14 15:37:43,570][0m A new study created in memory with name: no-name-475638a3-809d-4285-b915-7763fe9aad9b[0m


---------sol irr data opened
---------cloud coverage data opened
---------data merged
---------data preprocessed
---------time series set up
---------pickled to file, path: ../ex_training_data\[5, 6, 7]\14in.1out.60min.[5, 6, 7].X.pickle
---------pickled to file, path: ../ex_training_data\[5, 6, 7]\14in.1out.60min.[5, 6, 7].y.pickle
-----data loaded
-----begin training
-----epoch #1








loss=2.499190367294957, improvement, best_idx=0,curr_idx=1
[inf, 2.499190367294957]
-----training complete. evaluating model...


[32m[I 2022-08-14 15:38:00,579][0m Trial 0 finished with value: 0.6491945399220171 and parameters: {'n_neurons': 128, 'n_layers': 4, 'cell_type': 'SimpleRNN'}. Best is trial 0 with value: 0.6491945399220171.[0m


-----model evaluated
-----training and optimization done
INFO:tensorflow:Assets written to: ../ex_training_data\[5, 6, 7]\results\best_model\assets


INFO:tensorflow:Assets written to: ../ex_training_data\[5, 6, 7]\results\best_model\assets


-----best model saved to ../ex_training_data\[5, 6, 7]\results\best_model
-----study saved to ../ex_training_data\[5, 6, 7]\results\14in.1out.study.pickle
-----scalers saved to ../ex_training_data\[5, 6, 7]\results\scalers.pickle


