## RONIN Walkthrough 
___

This notebook contains a brief introduction to training and evaluating a random forest radar quality control algorithm using Julia. 

In [None]:
##Begin by loading dependencies 
using Pkg
Pkg.activate(".")
Pkg.instantiate() 
##Make sure Julia can see our module 
push!(LOAD_PATH, "./src/")

###Load key functionality 
###This will take a while the first time you do it 
using Ronin 

### 1) Splitting data into training and testing sets
___
We'll begin by partitioning our data into scans that will be used for training the model (training set) and scans that will be used to evaluate model performance (testing set). It's important to keep these separate. 
<h2><span style="color:Red">WARNING: This function will begin by DELETING the training and testing directories to clean them</span></h2>
It then softlinks the divded files to their respective directories. 

In [None]:
###Make sure to use absolute paths here 
##These are EXAMPLES, make sure to edit for your own directory setup
CASE_PATHS= ["./CFRADIALS/CASES/BAMEX", 
             "./CFRADIALS/CASES/HAGUPIT", 
             "./CFRADIALS/CASES/RITA", 
             "./CFRADIALS/CASES/VORTEX"]

TRAINING_PATH = "./CFRADIALS/CASES/TRAINING"
TESTING_PATH  = "./CFRADIALS/CASES/TESTING"

split_training_testing!([CASE_PATHS], TRAINING_PATH, TESTING_PATH)

# 2) Configure model
___
We'll now set up a configuartion object for use in our model. This structure contains 
key information and settings such as the number of models to use, the decision thresholds for each model, and locations to output data to. 

At a high level, the first step in the process is calculating a set of input features containing information about each gate in the training radar sweeps.
`task_paths` specifies the location of the file containing the information about what features the user wishes to calculate. It should be the same length as the number of models in the chain (more below). 
`input_path` is the path to the file 
or directory where the sweeps are located. 

In [None]:
task_paths = ["./standard_model/final_model_tasks_1.txt", "./standard_model/final_model_tasks_2.txt"]
input_path = TRAINING_PATH

Ronin can also create a "multi-pass" model, where a model is trained on the full dataset, and successive models are trained on subsets of these data. The motivation for this setup is to leverage the probablistic information provided by the random forest approach. Consider a gate where 90% of the trees in the random forest agree on a certain classification - it's possible that gates such as this may have fundamentally different characteristics than gates where the RF model is more evenly split on a class. It is then natural to expect that training a model specifically on gates of the second type may result in improved classification accuracy. Configuring a multi-pass model involves the specification of the number of models one wishes to use in a composite, as well as a range of probabilities to move on to the next pass. More is explained in the following. 

We'll start with a 2-pass model. Grid search testing on the validation dataset has shown that this number of passes best leverages the desire for performance with the retention of meteorological data. 

In [None]:
num_models = 2

Now, we'll define which gates are passed on to successive scans. 


`pass_1_probs = (.1,.9)`


This means that gates where between 10-90% of the trees agree (inclusive) will be passed on to the second pass. 
Gates that <10% of the trees classify as meteorological will be assigned a label of non-meteorological and 
gates that >90% of trees classify as meteorological will be assigned a label of meteorological. This can be done for more passes, but we're just doing 2 as a minimal example.

Met probabilities for the final pass of any composite model are interpreted somewhat differently. The maximum of the two probabilites will be taken, and gates where >= max percent of the trees classify a gate as meteorological will be assigned a label of meteorological/MD, with all other gates being assigned a label of non-meteorological/NMD. For example, if one were to set 

`final_met_prob = (.1,.9)`

gates where >=90% of trees agree on a classification of meteorological would be assigned a label of meteorological/MD, with all other gates being assigned a label of non-meteorological/NMD.


In [None]:

initial_met_prob = (.1f0, .9f0)
final_met_prob = (.1f0, .9f0)

###Combine into vector for model configuration object 
###It's important to note that len(met_probs) is enforced to be equal to num_models 
met_probs = [initial_met_prob, final_met_prob]

Another important feature of Ronin is its implementation of spatial features. These calculations take into account not only the gate of interest, but the gates surrounding it as well. The concept can be loosely equated to convolutions in a neural network. As such, it's important to specify weights for each surrounding observation/gate. Ronin provides a series of default weight matrixes that can be used to do so. More detail follows. 

In [None]:

###The following are default windows specified in RoninConstants.jl 
###Standard 7x7 window 
sw = Ronin.standard_window 
###7x7 window with only nonzero weights in azimuth dimension 
aw = Ronin.azi_window
###7x7 window with only nonzero weights in range dimension 
rw = Ronin.range_window 
###Placeholder window for tasks that do not require spatial context 
pw = Ronin.placeholder_window 

###Specify a weight matrix for each individual task in the configuration file 
###For this model, we are being aggressive and there are a lot of tasks in the task file, 
###so add a lot of different windows in range and azimuth. 
weight_vec_1 = [rw, aw, sw, rw, aw, sw, rw, aw, sw, rw, aw, sw, pw, pw, pw, pw, pw]
weight_vec_2 = [pw, pw, sw, sw, sw, sw, sw, pw, pw, pw, pw, pw]
###Specify a weight vector for each model pass 
###len(weight_vector) is enforced to be equal to num_models (should have a set of weights for each pass) 
task_weights = [weight_vec_1, weight_vec_2] 


In [None]:
base_name = "raw_model"
base_name_features = "output_features" 
###List of paths to output trained models to. Enforced to be same size as num_models 
model_output_paths = [base_name * "_$(i-1).jld2" for i in 1:num_models ]
###List of paths to output calculated features to. Enforced to be same size as num_models 
feature_output_paths = [base_name_features * "_$(i-1).h5" for i in 1:num_models]


###Options are "balanced" or "". If "balanced", the decision trees will be trained 
###on a weighted version of the existing classes in order to combat class imbalance 
class_weights = "balanced"

###Name of variable in cfradials that has already had interactive QC applied 
QC_var = "VG"

###Name of a variable in cfradials that will be used to mask what gates are predicted upon.
###Missing values in this variable mean that gates will be removed - there is considered to be no data there
###Generlly useful to have this be a raw variable 
remove_var = "VV"

###Name of a variable in input cfradials that has not had postprocessing applied. 
###This variable is used to determine where MISSING gates exist in the scan 
remove_var = "VEL"

###Whether or not the input features for the model have already been calculated 
file_preprocessed = [false, false]

###Where to write out the masks to in cfradial file. 
mask_names = ["PASS_1_MASK", "PASS_2_MASK"]




In [None]:
###Create model config object
config = ModelConfig(num_models = num_models,model_output_paths =  model_output_paths,met_probs =  met_probs, 
                    feature_output_paths = feature_output_paths, input_path = TRAINING_PATH,task_mode="nan",file_preprocessed = file_preprocessed,
                     task_paths = task_paths, QC_var = QC_var, remove_var = remove_var, QC_mask = false, mask_names = mask_names,
                     VARS_TO_QC = ["VEL"], class_weights = class_weights, HAS_INTERACTIVE_QC=true, task_weights = task_weights)

# 3) Train a composite model!
___
Now that we have set up our model configuration, we simply invoke the `train_multi_model` function. This will likely take a long time, especially when one is training 2 or more models in a chain (1hr+). 
<b>Data will be written to the cfradial files during this process.</b>

In [None]:
###Train composite model! 
train_multi_model(config)  

# 4) Verify the efficacy of the model on the testing dataset 
___
We'll begin by setting up another `ModelConfig` struct, but this time substituting the path to the testing data for `input_path` 

In [None]:
###Switch input data to testing set 
config.input_path = TESTING_PATH
###Let's also dial up the final met probs a bit to ensure the greatest amount of NMD removal possible 
config.met_probs = [(.1f0, .9f0), (.1f0, .95f0)]


## Now, call the `composite_prediction` function

In [None]:
###I recommend setting `write_predictions_out` to `true` so that predictions
###can be retained for later usage 
predictions, verification, indexers = composite_prediction(config; write_predictions_out=true, prediction_outfile="NEW_MODEL_PREDICTIONS_OUT.h5")

## Now, let's see how the model did using the `get_contingency` function

In [None]:
###If `normalize` is set to `true`, will return a contingency matrix where 
###Each column contains the predictions as a fraction of the total number of true values (each column will add to 1)
get_contingency(predictions, Vector{Bool}(verification[:]))

## Looks pretty good! Lets now use it to actually apply quality control to the testing scans. 

In [None]:
QC_scan(config)