In [1]:
import ray
import pandas as pd


In [None]:
ray.init()

In [None]:
import asd

# Basic Example

## Load Data

Basic data that can be understood as three different input columns (linear, linear+1 and random values) and three different output columns (sin of linear x, 1 + sin of linear x and random values).

In [None]:
data = pd.read_csv("example_data.csv", dtype=float)

In [None]:
data.head()

## Get Combinations function

This function can be used to determine the overall number of combinations the predictability routine analyses given the number of data columns, fitting type etc. This allows to estimate the overall runtime of the predictability routine.

The function returns a list of combination tuples, where the first `inputs`-many elements correspond to the inputs and the remaining `ouputs`-many to the targets. The argument `targets` can be used to define columns that should exclusively be regarded as targets.

In [None]:
# applied to a numerical example:
asd.get_column_combinations(all_cols=[1, 2, 3, 4, 5, 6],
                                  inputs=3,
                                  outputs=1,
                                  targets=[5, 6]
                                  )

The argument `amount_only` can be used to output the amount of combinations only:

In [None]:
asd.get_column_combinations(all_cols=[1, 2, 3, 4, 5, 6],
                                  inputs=3,
                                  outputs=1,
                                  targets=[5, 6],
                                  amount_only=True
                                  )

Applied to the data above:

In [None]:
asd.get_column_combinations(all_cols=data.columns, inputs=2, outputs=1, targets=["sinx", "randomy", "sinx_plus1"])

In [None]:
asd.get_column_combinations(all_cols=data.columns, inputs=2, outputs=1, targets=["sinx", "randomy", "sinx_plus1"], amount_only=True)

## Predictability function

Running the predictability function over all possible 1+1 combinations where the respective target is either sin, sin+1 or random and the input is linear x, linear x + 1 or random values.

(The purpose of having sin + 1 is to have a tuple that exclusively contains positive values so fitting a power law can be applied.)

### "normal" routine

In [None]:
metrics_dict, datas_dict = asd.run_predictability(data=data,
                                          input_cols=1,
                                          output_cols=1,
                                          col_set=None,
                                          targets=["sinx", "randomy", "sinx_plus1"],
                                          method="kNN",
                                          random_state_split=None,
                                          refined_n_best=0
                                          )

_Note that starting the Ray instance accounts for most of the runtime. Start the cell above again and runtime will be ~1s._

### greedy routine

In [None]:
greedy_metrics_dict, greedy_datas_dict = asd.run_predictability(data=data,
                                          input_cols=2,
                                          output_cols=1,
                                          col_set=None,
                                          targets=["sinx", "randomy", "sinx_plus1"],
                                          method="kNN",
                                          random_state_split=None,
                                          greedy=True,
                                          refined_n_best=0
                                          )

### Results

In [None]:
pd.DataFrame.from_dict(metrics_dict).transpose()

In [None]:
pd.DataFrame.from_dict(greedy_metrics_dict).transpose()

### Structure of returned data dictionary

In [None]:
struc_dict = datas_dict[list(datas_dict.keys())[0]]
for key in list(struc_dict.keys()):
    if type(struc_dict[key]) is dict:
        print(key, "\t dict with key(s):\t", list(struc_dict[key].keys()))
    else:
        print(key, "\t type:\t", type(struc_dict[key]), "\t shape:\t", struc_dict[key].shape)

### Plotting

In [None]:
asd.predictability_plot(datas_dict, list(datas_dict.keys())[2],
                        plot_along=["linear",
                                    "mean",
                                    "pl" # note that power law fits are only performed on all-positive input & output data.
                                    ]
                        )

_Note that you can similarly plot the results of the greedy run._

## Tuple Selection function

This function can be used to limit the number of tuples that is further analysed in more detail.

In [None]:
selected_tuples = asd.tuple_selection(metrics_dict, n_best=3)
selected_tuples

## Refine Predictability function

This function is used to further refine the predictability of the previously selected best tuples.

In [None]:
refined_metrics_dict, refined_datas_dict = asd.refine_predictability(best_tuples=selected_tuples,
                                                                 data_dict=datas_dict,
                                                                 time_left_for_this_task=60,
                                                                 n_jobs=-1,
                                                                 use_ray=True,
                                                                 )

In [None]:
pd.DataFrame.from_dict(refined_metrics_dict).transpose()

#### Structure of returned dictionaries

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
struc_dict = refined_datas_dict[list(refined_datas_dict.keys())[0]]
for key in list(struc_dict.keys()):
    if type(struc_dict[key]) is dict:
        print(key, "\t dict with key(s):\t"#, list(struc_dict[key].keys())
              )
    elif type(struc_dict[key]) is Pipeline:
        print(key, "\t type:\t", type(struc_dict[key]))
    else:
        print(key, "\t type:\t", type(struc_dict[key]), "\t shape:\t", struc_dict[key].shape)

#### Plotting

For the plotting, in order to compare the refined results with the initial ones, we need to hand over the initial datas dict via ``initial_datas_dict``, while also setting ``refined_plot=True``.

In [None]:
asd.predictability_plot(refined_datas_dict,
                        list(refined_datas_dict.keys())[2],
                        refined_plot=True,
                        initial_datas_dict=datas_dict,
                        plot_along=["linear",
                                    "mean",
                                    "pl", # note that power law fits are only performed on all-positive input & output data.
                                    "init"]
                        )

## Predictability w direct refined run

Via ``refined_n_best > 0``, an imediate refined run will be performed within the ``run_predictability`` routine already.

In [None]:
diref_metrics_dict, diref_datas_dict = asd.run_predictability(data=data,
                                          input_cols=1,
                                          output_cols=1,
                                          col_set=None,
                                          targets=["sinx", "randomy", "sinx_plus1"],
                                          method="kNN",
                                          random_state_split=None,
                                          refined_n_best=3,
                                          )

In [None]:
pd.DataFrame.from_dict(diref_metrics_dict).transpose()

For plotting the results, we need to set again ``refined_plot=True``, but as all data is included in the data dict above, we do not need to hand over a second datas dict.

In [None]:
asd.predictability_plot(diref_datas_dict,
                        list(diref_datas_dict.keys())[2],
                        refined_plot=True,
                        plot_along=["init",
                                    "linear",
                                    "mean",
                                    "pl" # note that power law fits are only performed on all-positive input & output data.
                                    ]
                        )