# Working with Structure Function in TAPE

In [None]:
import numpy as np

from tape.analysis.structurefunction2 import calc_sf2

from tape.analysis.structure_function.base_argument_container import StructureFunctionArgumentContainer
from tape.analysis.structure_function.basic.calculator import BasicStructureFunctionCalculator

import tape

## Structure Function Calculators and Argument Containers

Several different Structure Function calculation methods have been implemented.
Each Structure Function calculator is a child class of the `StructureFunctionCalculator` base class.

A dictionary of all available SF calculation methods named `SF_METHODS` is dynamically
generated at run time.
The dictionary is a mapping of unique SF calculation method name to calculator class.
Generally a user would not interact with `SF_METHODS` other than to view the list of SF calculation method names.

The `__init__` docstring for each SF calculator class contains references to the original papers.

In [None]:
from tape.analysis.structure_function.calculator_registrar import SF_METHODS

print(SF_METHODS.keys())

help(SF_METHODS["bauer_2009b"])

Because there are many configuration options for a given Structure Function calculation,
and to avoid both cluttering the API and the use of kwargs, we leverage a
`StructureFunctionArgumentContainer` `dataclass` object.

Each Structure Function calculator instance will accept a `StructureFunctionArgumentContainer` object.
Every Structure Function calculator is required to implement a `expected_argument_container`
class method that will return the specific `StructureFunctionArgumentContainer` required by that class.

Here we show how to check for the expected argument container class, instantiate
one, and check the value of one of the arguments it contains.

Note: 
- The full list of arguments and their functionality can be seen by calling 
`help(StructureFunctionArgumentContainer)`.
- It's possible that a particular type of Structure Function calculator will need
additional unique input arguments, so `StructureFunctionArgumentContainer` can be subclassed.
An example of subclassing the argument container is shown later in this notebook.

In [None]:
# Note that we do not need to create an instance of the BasicStructureFunctionCalculator to call the `expected_argument_container` (static)method.
arg_container_type = BasicStructureFunctionCalculator.expected_argument_container()

# Create an instance of the returned argument container type.
arg_container = arg_container_type()

# Print out the default value for one particular argument.
print(arg_container.bin_count_target)

# Show the full list of arguments and their definitions.
help(StructureFunctionArgumentContainer)

Using the argument container created above, we can create an instance of a "basic" Structure Function calculator method.

In [None]:
# Create an instance of the "basic" calculator using the `SF_METHODS` dictionary.
# We provide an empty array of "data" as well as the argument container created
# in the previous cell.
basic_calculator = SF_METHODS["basic"]([], arg_container)

# Since no input data was provided, we expect an empty output: `([], [])`
basic_calculator.calculate()

## `calc_sf2` - The Driver for Structure Function

The vast majority of the time, users will not interact directly with a Structure
Function calculator.
Instead, we expect users to call the `calc_sf2` function either directly or via an `Ensemble` or `Timeseries`.

`calc_sf2` provides the most intuitive API to calculate the Structure Function for
input data. The API is described in thoroughly in the [calc_sf2 documentation](https://tape.readthedocs.io/en/latest/autoapi/tape/analysis/structurefunction2/index.html#tape.analysis.structurefunction2.calc_sf2).

Below we show various small examples of calling `calc_sf2` directly.

In [None]:
times = [1.11, 2.23, 3.45, 4.01, 5.67, 6.32, 7.88, 8.2]
fluxes = [0.11, 0.23, 0.45, 0.01, 0.67, 0.32, 0.88, 0.2]

# note that `errors`, `bands`, and `lightcurve_ids` are optional input arguments.
errors = [0.1, 0.023, 0.045, 0.1, 0.067, 0.032, 0.8, 0.02]
bands = np.array(["r"] * len(fluxes))
lightcurve_ids = [1, 1, 1, 1, 1, 1, 1, 1]

res = calc_sf2(times, fluxes, errors, bands, lightcurve_ids)

print(res)

In [None]:
# Same as above, but we explicitly create an argument_container and produce the same results as before

times = [1.11, 2.23, 3.45, 4.01, 5.67, 6.32, 7.88, 8.2]
fluxes = [0.11, 0.23, 0.45, 0.01, 0.67, 0.32, 0.88, 0.2]

# note again: `errors`, `bands`, `lightcurve_ids` and `arg_container` are optional input arguments.
errors = [0.1, 0.023, 0.045, 0.1, 0.067, 0.032, 0.8, 0.02]
bands = np.array(["r"] * len(fluxes))
lightcurve_ids = [1, 1, 1, 1, 1, 1, 1, 1]

arg_container = StructureFunctionArgumentContainer()

res = calc_sf2(times, fluxes, errors, bands, lightcurve_ids, argument_container=arg_container)

print(res)

In [None]:
# Same as before, but being more explicit about the arguments in the argument container

times = [1.11, 2.23, 3.45, 4.01, 5.67, 6.32, 7.88, 8.2]
fluxes = [0.11, 0.23, 0.45, 0.01, 0.67, 0.32, 0.88, 0.2]
errors = [0.1, 0.023, 0.045, 0.1, 0.067, 0.032, 0.8, 0.02]
bands = np.array(["r"] * len(fluxes))
lightcurve_ids = [1, 1, 1, 1, 1, 1, 1, 1]

# Note, not all arguments need to be provided, nor do they have to be set at the
# time the object is instantiated.
arg_container = StructureFunctionArgumentContainer(band_to_calc=None, combine=False)
arg_container.bins = None
arg_container.method = "size"

res = calc_sf2(times, fluxes, errors, bands, lightcurve_ids, argument_container=arg_container)

print(res)

### Using `calc_sf2` to run Structure Function calculations on an Ensemble

Here we show how one would work with `calc_sf2` using an `Ensemble` object.
For more information about `Ensembles` see the example code in the ["Working with the tape Ensemble object"](https://tape.readthedocs.io/en/latest/notebooks/working_with_the_ensemble.html) notebook.

In [None]:
from tape.ensemble import Ensemble

ens = Ensemble()  # initialize an ensemble object

# Read in data from a parquet file
ens.from_parquet(
    "../../tests/tape_tests/data/source/test_source.parquet",
    id_col="ps1_objid",
    time_col="midPointTai",
    flux_col="psFlux",
    err_col="psFluxErr",
    band_col="filterName",
)

In [None]:
# Call batch on the ensemble providing no additional arguments.
res = ens.batch(calc_sf2, compute=True)
res

In [None]:
# Create a StructureFunctionArgumentContainer with lots of non-default argument values

arg_container = StructureFunctionArgumentContainer()
arg_container.band_to_calc = ["r"]
arg_container.combine = True
arg_container.bin_method = "loglength"
arg_container.bin_count_target = 40

res = ens.batch(calc_sf2, compute=True, argument_container=arg_container)
res

### Using `calc_sf2` to run Structure Function calculations on a TimeSeries

Here we show how to work with `calc_sf2` using a `TimeSeries` object. 
For more information about working with `TimeSeries` objects, see the example code in the ["Working with the tape Timeseries object"](https://tape.readthedocs.io/en/latest/notebooks/working_with_the_timeseries.html) notebook.

In [None]:
# Pull out a TimeSeries object from the previous Ensemble
ts = ens.to_timeseries(88472935274829959)  # provided a target object id

# Calculate the Structure Function for this lightcurve.
res = ts.sf2()
print(res)

In [None]:
# Create a StructureFunctionArgumentContainer with lots of non-default argument values

arg_container = StructureFunctionArgumentContainer()
arg_container.band_to_calc = ["r"]
arg_container.bin_method = "size"
arg_container.bin_count_target = 100
arg_container.equally_weight_lightcurves = True
arg_container.number_lightcurve_samples = 10000
arg_container.calculation_repetitions = 100
arg_container.report_upper_lower_error_separately = True

# Calculate the Structure Function for this lightcurve.
res = ts.sf2(argument_container=arg_container)
print(res)

## Implementing a custom Structure Function calculator

There may be times when the provided set of Structure Function calculator methods is insufficient for your work.
It is possible to extend the `StructureFunctionCalculator` base class to create your own calculator method while making use of foundational code.

The example below uses the `calculate` method from the `basic` structure function calculator method, and shows how the data is typically manipulated.
Note that a unique name must be provided in the `name_id` static method.
This is the name that will be provided when calling the `calc_sf2` function to determine which SF calculator method to use.

Note that before this new SF calculator method must be registered before it is used.
We demonstrate how to register the new calculator method at the end of the next cell. 

In [None]:
# Creating a new subclass of `StructureFunctionCalculator` to implement a new SF calculator method
from tape.analysis.structure_function.base_calculator import StructureFunctionCalculator


class ExperimentalStructureFunctionCalculator(StructureFunctionCalculator):
    def calculate(self):
        # self._calculate_binned_statistic is provided by the parent class,
        # `StructureFunctionCalculator` and will operate on time and flux
        # differences contained in the StructureFunctionLightCurve objects by
        # default.
        dts, sfs = self._calculate_binned_statistics()

        return dts, sfs

    @staticmethod
    def name_id() -> str:
        return "experimental"

    @staticmethod
    def expected_argument_container() -> type:
        return StructureFunctionArgumentContainer


# Registering the new calculator method using `register_subclasses`
# from tape.analysis.structure_function import SF_METHODS
from tape.analysis.structure_function.calculator_registrar import update_sf_subclasses

print("Current list of Structure Function calculator methods")
print(SF_METHODS.keys())

# update the dictionary of structure functions to include the new subclass
update_sf_subclasses()

print("Updated list now includes the new 'experimental' method.")
print(SF_METHODS.keys())

We can now make use of the updated registry of Structure Function calculator methods as shown below.

In [None]:
# Create an instance of our new calculator method using the dictionary, and call the `calculate` method
experimental_calculator = SF_METHODS["experimental"]([], arg_container)

# given no inputs, we expect an empty output
experimental_calculator.calculate()

## Implementing a custom StructureFunctionArgumentContainer

The customization provided by the default `StructureFunctionArgumentContainer`
may need to be extended to support a bespoke `StructureFunctionCalculator` subclass.

Below we show the approach for implementing a new `StructureFunctionArgumentContainer`
as well as a mathematically trivial new `StructureFunctionCalculator` to show the
new container in action. 

First we'll create a new subclass of `StructureFunctionArgumentContainer`.

In [None]:
from dataclasses import dataclass


@dataclass
class HitchhikerStructureFunctionArgumentContainer(StructureFunctionArgumentContainer):
    # note, due to the way dataclass inheritance works, all parameters must have a default value!
    find_meaning_of_life: bool = False

Next we'll create and register a new `StructureFunctionCalculator`

In [None]:
from tape.analysis.structure_function.base_calculator import StructureFunctionCalculator


class HitchhikerStructureFunctionCalculator(StructureFunctionCalculator):
    def calculate(self):
        # self._calculate_binned_statistic is provided by the parent class,
        # `StructureFunctionCalculator` and will operate on time and flux
        # differences contained in the StructureFunctionLightCurve objects by
        # default.
        dts, sfs = self._calculate_binned_statistics()

        if self._argument_container.find_meaning_of_life:
            dts = np.full_like(dts, 42.0)
            sfs = np.full_like(sfs, 42.0)

        return dts, sfs

    @staticmethod
    def name_id() -> str:
        return "hitchhiker"

    # Note that we've specified the newly created argument container
    @staticmethod
    def expected_argument_container() -> type:
        return HitchhikerStructureFunctionArgumentContainer


# register the new subclass and make sure we see "hitchhiker" in our list
update_sf_subclasses()
print(SF_METHODS.keys())

Finally, we'll make use of our new calculator by calling it with a default argument
set, and then also providing non-default arguments to an instance of the
`HitchhikerStructureFunctionArgumentContainer`.

#### All default arguments
Here we see that our new Structure function returns some reasonable values using
all the default arguments.

In [None]:
times = [1.11, 2.23, 3.45, 4.01, 5.67, 6.32, 7.88, 8.2]
fluxes = [0.11, 0.23, 0.45, 0.01, 0.67, 0.32, 0.88, 0.2]

# note that `errors`, `bands`, and `lightcurve_ids` are optional input arguments.
errors = [0.1, 0.023, 0.045, 0.1, 0.067, 0.032, 0.8, 0.02]
bands = np.array(["r"] * len(fluxes))
lightcurve_ids = [1, 1, 1, 1, 1, 1, 1, 1]

res = calc_sf2(times, fluxes, errors, bands, lightcurve_ids, sf_method="hitchhiker")

print(res)

#### Non-default arguments
Here we've requested that the Structure Function calculator find the meaning of
life by setting `arg_container.find_meaning_of_life = True`.

Note: We also have set the preferred structure function calculator to use with
`arg_container.sf_method = "hitchhiker"`.

In [None]:
times = [1.11, 2.23, 3.45, 4.01, 5.67, 6.32, 7.88, 8.2]
fluxes = [0.11, 0.23, 0.45, 0.01, 0.67, 0.32, 0.88, 0.2]

# note that `errors`, `bands`, and `lightcurve_ids` are optional input arguments.
errors = [0.1, 0.023, 0.045, 0.1, 0.067, 0.032, 0.8, 0.02]
bands = np.array(["r"] * len(fluxes))
lightcurve_ids = [1, 1, 1, 1, 1, 1, 1, 1]

arg_container_class = SF_METHODS["hitchhiker"].expected_argument_container()
arg_container = arg_container_class()

# If using an argument container, define the requested structure function here.
arg_container.sf_method = "hitchhiker"
arg_container.find_meaning_of_life = True

res = calc_sf2(times, fluxes, errors, bands, lightcurve_ids, argument_container=arg_container)

print(res)

## The `StructureFunctionLightCurve` object

After data is passed to the `calc_sf2` function (either directly or via `Ensemble` or `TimeSeries` objects)
and validated, a list of `StructureFunctionLightCurve` objects are created
and passed to the requested Structure Function calculator.

By default, when a `StructureFunctionLightCurve` object is instantiated, the time
and flux differences, as well as the sum of squared errors, are automatically
calculated.

Users generally will not interact with `StructureFunctionLightCurve` objects outside
of the context of a `StructureFunctionCalculator` subclass, but for demonstration
we'll show how light curve object is created, as well as how to access the primary
and derived data.

In [None]:
from tape.analysis.structure_function.sf_light_curve import StructureFunctionLightCurve

# note: `StructureFunctionLightCurve`` expects numpy arrays as inputs
times = np.array([1.11, 2.23, 3.45, 4.01, 5.67, 6.32, 7.88, 8.2, np.nan])
fluxes = np.array([0.11, 0.23, 0.45, 0.01, 0.67, 0.32, 0.88, 0.2, np.nan])
errors = np.array([0.1, 0.023, 0.045, 0.1, 0.067, 0.032, 0.8, 0.02, np.nan])

# When instantiated, `nan` values will be removed and the resulting arrays
# are required to have the same length.
sf_lightcurve = StructureFunctionLightCurve(times=times, fluxes=fluxes, errors=errors)

# To retrieve the cleaned primary data, use the following 'protected' variables.
print("Cleaned input data")
print(type(sf_lightcurve._times), sf_lightcurve._times)
print(type(sf_lightcurve._fluxes), sf_lightcurve._fluxes)
print(type(sf_lightcurve._errors), sf_lightcurve._errors)

# Additionally, the pair-wise difference and summed errors will be automatically generated
# Notes:
# - Only pair-wise differences where time_1 - time_2 > 0 are retained.'
# - These are also considered 'protected' variables.
expected_number_of_differences = len(sf_lightcurve._times) * (len(sf_lightcurve._times) - 1) / 2
print(f"\nExpected number of differences (N * N-1)/2: {expected_number_of_differences}")
print(f"Number of differences found: {len(sf_lightcurve._all_d_times)}\n")
print(f"All time differences:\n{sf_lightcurve._all_d_times}")
print(f"All flux differences:\n{sf_lightcurve._all_d_fluxes}")
print(f"All summed errors:\n{sf_lightcurve._all_sum_squared_error}")

The full set of differences is maintained in memory, and can be sampled repeatedly
without having to recalculate.

By default, the full set of difference is made available as public variables.
These variables will change when the user requests a new sub-sample using the
`select_difference_samples` method as shown below.

Note that the randomly selected values will not have the same order as the original
but for any given index `i` in a sample_* array, it is guaranteed that the
corresponding indexes in the other sample_* arrays were produces from the same
pair of primary data values.

For instance if:
- `sf_lightcurve.sample_d_times[i] = sf_lightcurve._times[m] - sf_lightcurve._times[n]`

Then the following is guaranteed:
- `sf_lightcurve.sample_d_fluxes[i] = sf_lightcurve._fluxes[m] - sf_lightcurve._fluxes[n]`
- `sf_lightcurve.sample_sum_squared_error[i] = sf_lightcurve._errors[m] - sf_lightcurve._errors[n]`

In [None]:
# Print out the public variables without sub-sampling. The results are identical
# to the 'protected' variable values at this point.
print(f"All time differences:\n{sf_lightcurve.sample_d_times}")
print(f"All flux differences:\n{sf_lightcurve.sample_d_fluxes}")
print(f"All summed errors:\n{sf_lightcurve.sample_sum_squared_error}")

# Now we randomly select a sub-sample of the times/fluxes/errors without replacement
sf_lightcurve.select_difference_samples(12)

print("\nWe expect 12 values now")
print(f"All time differences:\n{sf_lightcurve.sample_d_times}")
print(f"All flux differences:\n{sf_lightcurve.sample_d_fluxes}")
print(f"All summed errors:\n{sf_lightcurve.sample_sum_squared_error}")

# Calling `select_difference_samples` again, will produce a new random sample
sf_lightcurve.select_difference_samples(12)

print("\nWe expect 12 different values now")
print(f"All time differences:\n{sf_lightcurve.sample_d_times}")
print(f"All flux differences:\n{sf_lightcurve.sample_d_fluxes}")
print(f"All summed errors:\n{sf_lightcurve.sample_sum_squared_error}")

## Accessing `StructureFunctionLightCurve` Data in a Custom Structure Function Calculator
If the standard difference values are not appropriate for a custom Structure Function
you can access the primary data via the `self._lightcurves` array in your calculator
as demonstrated here.

In the following Structure Function calculator we'll use the sum of errors
instead of the sum of squared error in our contrived calculation.

In [None]:
class SumErrorStructureFunctionCalculator(StructureFunctionCalculator):
    def calculate(self):
        # Here we gather data from public variables in the list of `StructureFunctionLightCurves`
        # and pass it to the `self._calculate_binned_statistic` method.
        values_to_be_binned = [np.abs(lc.sample_d_fluxes) for lc in self._lightcurves]
        dts, mean_d_flux_per_bin = self._calculate_binned_statistics(sample_values=values_to_be_binned)

        # Here we use directly access the 'protected' original data in the lightcurves
        # and perform a simple operation on them, similar to what occurs in
        # `StructureFunctionLightCurve._calculate_differences`, except that we
        # do not square the error values.
        values_to_be_binned = [self.error_sum(lc._times, lc._errors) for lc in self._lightcurves]
        _, mean_err_per_bin = self._calculate_binned_statistics(sample_values=values_to_be_binned)

        # perform some calculation using the time-binned data
        sfs = np.abs(np.square(mean_d_flux_per_bin) - mean_err_per_bin)

        return dts, sfs

    def error_sum(self, times, errors):
        # Calculate all delta times, and summed errors.
        # Keep only error sums that correspond to time differences > 0
        dt_matrix = times.reshape((1, times.size)) - times.reshape((times.size, 1))

        err2_matrix = np.abs(errors.reshape((1, errors.size)) + errors.reshape((errors.size, 1)))
        return err2_matrix[dt_matrix > 0].flatten()

    @staticmethod
    def name_id() -> str:
        return "sum_error"

    @staticmethod
    def expected_argument_container() -> type:
        return StructureFunctionArgumentContainer


# register the new subclass
update_sf_subclasses()

Now that the new calculator has been registered using `update_sf_subclasses()`
we can call `calc_sf2` and request the "sum_error" calculator as shown below.

The results are not meaningful, but the overall flow shows how we create a custom
`StructureFunctionCalculator` and access/manipulate data contained in the list of
`StructureFunctionLightCurve`s all of which is driven using the `calc_sf2` function.

In [None]:
times = [1.11, 2.23, 3.45, 4.01, 5.67, 6.32, 7.88, 8.2]
fluxes = [0.11, 0.23, 0.45, 0.01, 0.67, 0.32, 0.88, 0.2]
errors = [0.1, 0.023, 0.045, 0.1, 0.067, 0.032, 0.8, 0.02]
bands = np.array(["r"] * len(fluxes))

res = calc_sf2(times, fluxes, errors, bands, sf_method="sum_error")

print(res)