# Common Data Operations with `TAPE`

In this notebook, we'll highlight a handful of common dataframe operations that can be performed within `TAPE`. 

> **_Note:_**
`TAPE` extends the `Pandas`/`Dask` API, and so users familiar with those APIs can expect many operations to be near-identical when working with `TAPE`.

Let's consider a small example dataset of Stripe 82 RRLyrae:

In [None]:
from tape import Ensemble

ens = Ensemble()

ens.from_dataset("s82_rrlyrae", sorted=True)

## Inspection

These functions provide views into the contents of your `Ensemble` dataframe, especially important when dealing with large data volumes that cannot be brought into memory all at once.

### Lazy View of an `EnsembleFrame`

The most basic inspection method is to just call the EnsembleFrame (dataframe) objects themselves. This returns a lazy (no data is loaded) view of the EnsembleFrame.

In [None]:
ens.object

In [None]:
ens.source

### Using `Compute()` to view the data

When an `EnsembleFrame`'s contents are small enough to fit into memory, you can use `compute()` to view the actual data.

> **_Note:_**
`compute()` also involves actual computation of the in-memory data, working on any loading/filtering/analysis needed to produce the result, as such this can take a long time!  

In [None]:
ens.object.compute()

### Grab small in-memory views with `head()`

Often, you'll want to peek at your data even though the full-size is too large for memory.

> **_Note:_**
some partitions may be empty and `head` will have to traverse these empty partitions to find enough rows for your result. An empty table with many partitions (O(100)k) might be costly even for an ultimately empty result. 

In [None]:
ens.source.head(5)  # grabs the first 5 rows

# can also use tail to grab the last 5 rows

## Getting Individual Lightcurves

Several methods exist to access individual lightcurves within the `Ensemble`.

### Access using a known ID

If you'd like to access a particular lightcurve given an ID, you can use the `to_timeseries()` function. This allows you to supply a given object ID, and returns a `TimeSeries` object (see [working_with_the_timeseries](working_with_the_timeseries.ipynb)).

> **_Note:_**
that this loads data from all available bands.

In [None]:
ts = ens.to_timeseries(13350)
ts.data

In [None]:
import matplotlib.pyplot as plt

for band in ts.data.band.unique():
    plt.errorbar(
        ts.data.loc[band]["mjd"],
        ts.data.loc[band]["flux"],
        yerr=ts.data.loc[band]["error"],
        fmt=".",
        label=band,
    )

plt.ylim(16, 20)
plt.legend()
plt.title(ts.meta["id"])

### Access a random lightcurve

Alternatively, if you aren't interested in a particular lightcurve, you can draw a random one from the `Ensemble` using `Ensemble.select_random_timeseries()`.

In [None]:
ens.select_random_timeseries(seed=1).data

## Filtering


### Queries
Queries mirror the `Pandas` implementation. Specifically, the function takes a string that provides an expression indicating which rows to **keep**.

In [None]:
# define a query to remove the top 5% of flux values
highest_flux = ens.source[ens._flux_col].quantile(0.95).compute()
ens.source.query(f"{ens._flux_col} < {highest_flux}").compute()


> **_Note:_**
When filtering, or doing any operations that modify a dataframe, the result is a new dataframe that does not automically update the `Ensemble`. If you'd like to update the `Ensemble` with the result of any of the following operations, be sure to add `.update_ensemble()` to the end of the call.

### Filtering by Number of Observations

Filters based on number of observations are more directly supported within the TAPE API. First, using a dedicated function to calculate the number of observations per lightcurve, `Ensemble.calc_nobs()`:

In [None]:
ens.calc_nobs(by_band=True, temporary=False)

ens.object.head(5)[["nobs_u", "nobs_g", "nobs_r", "nobs_i", "nobs_z", "nobs_total"]]

You can then query on these columns as normal.

In [None]:
ens.object.query("nobs_total > 322")[["nobs_u", "nobs_g", "nobs_r", "nobs_i", "nobs_z", "nobs_total"]].head(5)

Alternatively, if you'd like to just quickly filter by the number of total observations, you can use `Ensemble.prune()`.

In [None]:
ens.prune(322)  # equivalent to the above
ens.object[["nobs_total"]].head(5)

### Removing NaNs

Removing Rows with NaN values follows the `Pandas` API, using `dropna()`:

In [None]:
# Remove any rows with a NaN value in any of the specified columns
ens.source.dropna(subset=["flux", "mjd", "error", "band"]).update_ensemble()
ens.source

## Analysis

### Applying Functions with `Ensemble.batch()`

The `Ensemble` provides a powerful batching interface, `Ensemble.batch()`, with in-built parallelization (provided the input data is in multiple partitions).

In [None]:
import numpy as np


# Defining a simple function
def my_flux_average(flux_array, band_array, method="mean", band=None):
    """Read in an array of fluxes, and return the average of the fluxes by band"""
    if band != None:
        mask = [band_array == band]  # Create a band by band mask
        band_flux = flux_array[tuple(mask)]  # Mask the flux array
        if method == "mean":
            res = np.mean(band_flux)
        elif method == "median":
            res = np.median(band_flux)
    else:
        res = None
    return res

With the function defined, we next supply it to `Ensemble.batch()`. The column labels of the `Ensemble` columns we want to use as arguments must be provided, as well as any keyword arguments. In this case, we pass along `"flux"` and `"band"`, so that the `Ensemble` will map those columns to `flux_array` and `band_array` respectively. We also pass `method='median'` and `band='g'`, which will pass those kwargs along to `my_flux_average`.

In [None]:
# Applying the function to the ensemble
res = ens.batch(my_flux_average, "flux", "band", meta=None, method="median", band="g")
res.compute()

`Ensemble.batch()` supports many different variations of custom user functions, and additionally has a small suite of tailored analysis functions designed for it. For more details on batch, see the [batch showcase](batch_showcase.ipynb).

### Column Assignment

The ensemble object supports assignment through the `Pandas` `assign` function. We can pass in either a callable or a series to assign to the new column. New column names are produced automatically from the argument name.

For example, if we want to compute the lower bound of an error range as the estimated flux minus twice the estimated error, we would use:

In [None]:
lower_bnd = ens.source.assign(lower_bnd=lambda x: x["flux"] - 2.0 * x["error"])
lower_bnd.head(5)

We can also assign our computed batch result as a new object column using the same methodology.

In [None]:
ens.object.assign(g_average=res["result"])[["ra", "dec", "g_average"]].head(5)

## Dask Tips


### Using `persist()` to Save Computation Time

When calling `compute()`, all work needed to produce the in-memory result is performed. This work is reperformed each time `compute()` is called, leading to the potential to duplicate a lot of computational work, especially in exploratory notebooks where you're testing different workflows. In such cases, it can be advantageous to call `persist()`. 

`persist()` returns a lazy view of a result, but actively begins computation of that result behind the scenes, leading to successive calls simply grabbing the result from `persist()` rather than needing to compute the result themselves. As a result, `persist()` should only be used when your data can fit into memory.

In [None]:
ens.source.persist()  # persist performs all queued data loading tasks
ens.source.compute()  # which allows compute to just pull the result immediately.

### Repartitioning

With `Dask` and `TAPE` data is stored in separate sub-containers called "partitions", [`Dask` has recommendations](https://docs.dask.org/en/stable/best-practices.html#dask-best-practices) for the optimal amount of data stored in a given partition, and even if the initial data follows these recommendations, filtering steps can cause partitions to contain very little data. In this case, it may be best to call `repartition()`.

In [None]:
ens.source.repartition(partition_size="100MB")  # 100MBs is generally recommended
# In this case, we have a small set of data that easily fits into one partition

### Sampling


In addition to filtering by specific constraints, it's possible to select a subset of your data to work with. `Ensemble.sample()` will randomly select a fraction of objects from the full object list. This will return a new
ensemble object to work with.

In [None]:
subset_ens = ens.sample(frac=0.5)  # select ~half of the objects

print("Number of pre-sampled objects: ", len(ens.object))
print("Number of post-sampled objects: ", len(subset_ens.object))

For reproducible results, you can also specify a random seed via the `random_state` parameter. By re-using the same seed in your `random_state`, you can ensure that a given `Ensemble` will always be sampled the same way.

In [None]:
subset_ens = ens.sample(
    frac=0.2,  # select a ~fifth of the objects
    random_state=53783594,  # set a random seed for reproducibility
)

print("Number of pre-sampled objects: ", len(ens.object))
print("Number of post-sampled objects: ", len(subset_ens.object))

> **_Note:_**
Using `Ensemble.sample` to filter large datasets is not recommended, as it does not handle repartitioning. Instead, using partition slicing, shown below.

In [None]:
# partition slicing

# specify a subset of partitions, propagates to the object table automatically
ens.source.partitions[0:1].update_ensemble()

### Saving Intermediate Results

In some situations, you may find yourself running a given workflow many times. Due to the nature of lazy-computation, this will involve repeated execution of data I/O, pre-processing steps, initial analysis, etc. In these situations, it may be effective to instead save the ensemble state to disk after completion of these initial processing steps. To accomplish this, we can use the `Ensemble.save_ensemble()` function.

In [None]:
ens.object.head(5)

In [None]:
ens.save_ensemble(".", "ensemble", additional_frames=False)  # Saves to disk

The above command creates an "ensemble" directory in the current working directory. This directory contains a subdirectory of parquet files for each `EnsembleFrame` object that was included in the `additional_frames` kwarg. Note that if `additional_frames` was set to True or False this would save all or none of the additional `EnsembleFrame` objects respectively, and that the object (unless it has no columns) and source frames are always saved.

From here, we can just load the ensemble from disk.

In [None]:
new_ens = Ensemble()
new_ens.from_ensemble("./ensemble")