# Data Driven Presales Evaluation

Welcome to using the data driven jupyter notebook for evaluating if data from csv file(s) is sufficient for creating simulators using supervised learning methods. The approach of learning the state transitions, $(\underline{s}, \underline{a}) \rightarrow \underline{s}'$ from data is growing in popularity, however, not all of your data may have the correct distributions for the ranges needed for your Reinforcement Learning use case. This notebook is split up into three sections: 

- Data Relevance
- Sparsity
- Data Distribution Confidence

This notebook uses `nbgrader` package to 'grade' your data quality, distributions, and feasibility of creating approximated simulations from your data. A score of 100 means you passed all the tests. The tests basically consists of assert conditions that individual notebook cells must run successfully, require user input of `Y/N` that you agree, or requiring inputs of Subject Matter Expert (SME) data ranges. The `nbgrader` package allows for certain snippets of code to be hidden from you to simplify the usage of this notebook. When code is hidden from you, you will know because the cell can NO longer be edited in jupyter notebook.

> To pass tests, you may have to create a new cell and write code to filter/smooth/manipulate your data 

Successfully run all cells to assess whether or not a data driven simulator can be adequately created from your data. Once you have ran the cells without assertion errors, quickly double check your script passes the tests by click the `Validate` button in the jupyter notebook. If all tests are passed, then please export this as a PDF to share.

In [None]:
import pandas as pd
from scipy import stats
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import yaml

gitroot = os.popen('git rev-parse --show-toplevel').read()
os.chdir(gitroot.rstrip())

### SECTION A: Data Relevance

- load csv file(s)
- define potential inputs/outputs
- check NaN
- define cadence of state transitions
- check outliers and plot each indiviual dataset
- check NaN after concatenating
- use feature importances to determine best features
- re-define features (inputs)
- save as single csv, named approved_data.csv

`Step 1`: Add path to filenames as strings to the `filenames` list.

In [None]:
filenames = [
    'example_data.csv',
    'data.csv',
]

The test below simply awards 10 points if the data can be successfuly loaded into the jupyter notebook, i.e. paths are real.

In [None]:
df_list = []
for location in filenames:
    read_datafile = pd.read_csv(location)
    df_list.append(read_datafile)

`Step 2`: Add any potential feature names as states and actions to the dictionary in the following way. Here we want to be more broad than you think because this notebook will help you determine which features probably matter more using what's called feature importances (to be looked at later).

```python
config['IO']['feature_name'] = {
    'name1': 'state',
    'name2': 'state',
    'name3': 'action',
}
```

In [None]:
with open('config/config_model.yml') as conf:
    config = yaml.full_load(conf)

# TODO: Modify dictionary for as 'feature_name': 'action' or 'state'
config['IO']['feature_name'] = {
    'theta': 'state',
    'alpha': 'state',
    'theta_dot': 'state',
    'alpha_dot': 'state',
    'Vm': 'action',
}

`Step 3`: Add the desired states to be predicted from the supervised learning simulator. Typically it is the next state after a timestep, $(\underline{s}, \underline{a}) \rightarrow \underline{s}'$. However, they may be additional features in the data that you may wish to actually predict. This is okay too, just make sure to have sufficient proxy information in the features to determine it.

```python
config['IO']['output_name'] = [
    'name1',
    'name2',
]
```

In [None]:
# TODO: Modify list to consist of predicted states
config['IO']['output_name'] = [
    'theta',
    'alpha',
    'theta_dot',
    'alpha_dot',
]

with open('config/config_model.yml', 'w') as conf:
    yaml.dump(config, conf, sort_keys=False)

feature_names = []
for key, value in config['IO']['feature_name'].items():
    feature_names.append(key)

The test below checks if there are NaN (Not a Number) or SNA (Signals Not Available) based upon each csv

In [None]:

for x in check_nan:
    assert(x == None or x == False)

`Step 4`: Change the timelag or the number of iterations that span between the state transition, $(\underline{s}, \underline{a}) \rightarrow \underline{s}'$. Think of this as the number of rows in the csv that dictate the timestep between a "steady state" transition, where a change in an input to the system will be reflected after this many sample measurements.

In [None]:
timelag = 1

The test below finds outliers for each dataset, plots the original states and actions with overlayed outliers marked. Outliers can occur due to noisy sensors, conditions that are abnormal, or if the Signal is Not Available (SNA) where it defaults to a really large or small number.

- fits to data in single csv, check for if any data is outside 3 std
- plots states and actions, overlayed with outliers

In [None]:
## Check for Outliers


`Step 5`: The notebook will prompt you to accept `Y/N` with the outliers. It is okay to have a few as long as they make sense to you and are not going to interrupt learning the normal conditions you expect the simulator to model (not abnormal).

If you need to manipulate the data further, Enter `N`.

In [None]:
accept_outliers = input('Do you accept the outliers in the following dataset? Enter "Yes". If not, type "No" and filter or smooth data: ')

assert(accept_outliers == 'Yes' or accept_outliers == 'yes' or accept_outliers == 'y' or accept_outliers == 'Y'), "Manipulate data to smooth and filter to remove outliers before step 5, then re-run cells up until this point again"

After accepting the outliers, this notebook will concatenate the data and check for NaNs again due to any datasets missing features (columns).

In [None]:
dfs = pd.DataFrame()
for df in df_list:
  dfs = pd.concat([dfs, df[feature_names]], sort=False)

print(dfs.head())

check_nan = hasNaN(df.to_numpy())
assert(check_nan == None or check_nan == False)

We have now qualified your datasets enough to export it to a single csv, named `approved_data.csv`. You have now finished the first section of `Data Relevance`. This does NOT say anything about data sparsity and distribution confidence, which are the next two sections.

In [None]:
dfs.to_csv('approved_data.csv', mode='w', index=False)

csv_to_pickle('approved_data.csv', timelag=timelag)

Below determines the `feature importances`, which are quantifying the inputs that are most valuable in explaining the target variable. The feature importances add up to one when summed, where the largest value is the most important feature. This is a useful trick in designing inputs/ouputs for supervised learning. We determine the feature importances for each of the predicted outputs, based on the features provided in cells above. They are then plotted where a legend designates the predicted value.

> If you are NOT satisfied with your chosen model inputs, modify the inputs in the cell above and run through all the cells leading up to this again.

In [None]:
## Feature Importances


`Step 6`: Enter `Y` if you accept your current model inputs and outputs, If not, Enter `N` and 

In [None]:
accept_features = input('Do you accept the features based upon the feature importances shown? Enter "Yes" to continue. (Otherwise re-enter states and actions and run through cells again):  ')

assert(accept_features == 'Yes' or accept_features == 'yes' or accept_features == 'y' or accept_features == 'Y'), "Re-enter states and actions at step 2 and re-run through cells to visualize feature importances again."

### SECTION B: Sparsity

- define Subject Matter Expert (SME) limits on feature ranges
- plot histograms
- compare SME limits with data limits using 2 std

In [None]:
### Template limits to copy/paste into next cell


`Step 7`: Modify template of min/max values for each of the features - this is where you define the Subject Matter Expert (SME) limits. Please define the range that would be reasonable to run the simulator in, despite what is captured in data. These are the limits that Reinforcement Learning will reasonably explore in to provide novel solutions.

> Copy/Paste the above template below the `%%writefile config/sme_limits.yml` line and run the cell to write to the file. 

In [None]:
%%writefile config/sme_limits.yml
theta:
  min: -1.5708
  max: 1.5708
alpha:
  min: -3.14159
  max: 3.14159
theta_dot:
  min: -7.822916413465077
  max: 7.610718493430506
alpha_dot:
  min: -12.841118663208904
  max: 11.931522401818508
Vm:
  min: -3
  max: 3

The following test will plot the histograms for each of the features and check whether or not the data's mean $\pm 2$ std is larger than the SME limits. 

> The tests assume your data has a gaussian distribution, i.e. bi-modal data can be problematic. 

In [None]:
## Histogram and report sparsity index


### SECTION C: Data Distribution Trust (Confidence on Interpolation)

- evaluate region confidence with model upper bound with 2 std from mean
- evaluate region confidence with model lower bound with 2 std from mean
- evaluate region confidence with SME max
- evaluate region confidence with SME min

We use a Gaussian Mixture Model (GMM) to fit to the data to be able to cluster distributions with means and covariances. We can then sample the GMM with a random state-action pair and evaluate the regions to trust based compared to SME desired limits. 

In [None]:
## Create GMM using the same number of components as the number of features


In [None]:
## Evaluate region confidence with model upper bound with 2 std from mean


In [None]:
## Evaluate region confidence with model lower bound with 2 std from mean


In [None]:
## Evaluate region confidence with SME max


In [None]:
## Evaluate region confidence with SME min
