# Common Data Operations with `TAPE`

In this notebook, we'll highlight a handful of common dataframe operations that can be performed within `TAPE`. 

> **_Note:_**
TAPE extends the `Pandas`/`Dask` API, and so users familiar with those APIs can expect many operations to be near-identical when working with `TAPE`.

Let's consider a small example dataset of Stripe 82 RRLyrae:

In [None]:
from tape import Ensemble

ens = Ensemble()

ens.from_dataset("s82_rrlyrae")

We can load these into the `Ensemble` using `Ensemble.from_pandas()`:

## Inspection

These functions provide views into the contents of your `Ensemble` dataframe, especially important when dealing with large data volumes that cannot be brought into memory all at once.

### Lazy View of an `EnsembleFrame`

The most basic inspection method is to just call the EnsembleFrame (dataframe) objects themselves. This returns a lazy (no data is loaded) view of the EnsembleFrame.

In [None]:
ens.object

In [None]:
ens.source

### Using `Compute()` to view the data

When an `EnsembleFrame` contents are small enough to fit into memory, you can use `compute()` to view the actual data.

> **_Note:_**
`Compute()` also involves actual computation of the in-memory data, working on any loading/filtering/analysis needed to produce the result, as such this can take a long time!  

In [None]:
ens.object.compute()

### Grab small in-memory views with `head()`

Often, you'll want to peek at your data even though the full-size is too large for memory.

> **_Note:_**
By default this only looks at the first partition of data, so any operations that remove all data from the first partition will produce an empty head result. Specify `npartitions=-1` to grab from all partitions.


In [None]:
ens.source.head(5, npartitions=-1) # grabs the first 5 rows 

# can also use tail to grab the last 5 rows

## Getting Individual Lightcurves

Several methods exist to access individual lightcurves within the `Ensemble`.

### Access using a known ID

If you'd like to access a particular lightcurve given an ID, you can use the `to_timeseries` function. This allows you to supply a given object ID, and returns a `TimeSeries` object (see <working_with_the_timeseries>).

In [None]:
ts = ens.to_timeseries(13350)
ts.data

In [None]:
import matplotlib.pyplot as plt

for band in ts.data.band.unique():
    plt.errorbar(ts.data.loc[band]["mjd"], 
                 ts.data.loc[band]["flux"], 
                 yerr=ts.data.loc[band]["error"],
                 fmt=".", 
                 label=band)

plt.ylim(16,20)
plt.legend()
plt.title(ts.meta["id"])

### Access a random lightcurve

Alternatively, if you aren't interested in a particular lightcurve, you can draw a random one from the `Ensemble` using `Ensemble.select_random_timeseries`.

In [None]:
ens.select_random_timeseries(seed=1).data

## Filtering


### Queries
Queries mirror the `Pandas` implementation. Specifically, the function takes a string that provides an expression indicating which rows to **keep**.

In [None]:
# define a query to remove the top 5% of flux values
highest_flux = ens.source[ens._flux_col].quantile(0.95).compute()
ens.source.query(f"{ens._flux_col} < {highest_flux}").compute()


> **_Note:_**
When filtering, or doing any operations that modify a dataframe, the result is a new dataframe that does not automically update the `Ensemble`. If you'd like to update the `Ensemble` with the result of any of the following operations, be sure to add `.update_ensemble()` to the end of the call.

### Filtering by Number of Observations

Filters based on number of observations are more directly supported within the TAPE API. First, using a dedicated function to calculate the number of observations per lightcurve, `Ensemble.calc_nobs()`

In [None]:
ens.calc_nobs(by_band=True)

ens.object[['nobs_u','nobs_g','nobs_r','nobs_i','nobs_z','nobs_total']].head(5)

You can then query on these columns as normal.

In [None]:
ens.object.query("nobs_total > 322")[['nobs_u','nobs_g','nobs_r','nobs_i','nobs_z','nobs_total']].head(5)

Alternatively, if you'd like to just quickly filter by the number of total observations, you can use `Ensemble.prune()`.

In [None]:
ens.prune(322) # equivalent to the above
ens.object[["nobs_total"]].head(5)

### Removing NaNs

Removing Rows with NaN values follows the `Pandas` API, using `dropna()`:

In [None]:
# Remove any rows with a NaN value in any of the specified columns
ens.source.dropna(subset=["flux", "mjd", "error", "band"]).update_ensemble()
ens.source

## Analysis

### Applying Functions with Batch

The `Ensemble` provides a powerful batching interface, `Ensemble.batch`, with in-built parallelization (provided the input data is in multiple partitions).

In [None]:
import numpy as np

# Defining a simple function
def my_flux_average(flux_array, band_array, method="mean", band=None):
    """Read in an array of fluxes, and return the average of the fluxes by band"""
    if band != None:
        mask = [band_array == band]  # Create a band by band mask
        band_flux = flux_array[tuple(mask)]  # Mask the flux array
        if method == "mean":
            res = np.mean(band_flux)
        elif method == "median":
            res = np.median(band_flux)
    else:
        res = None
    return res

With the function defined, we next supply it to `Ensemble.batch`. The column labels of the `Ensemble` columns we want to use as arguments must be provided, as well as any keyword arguments. In this case, we pass along `"flux"` and `"band"`, so that the `Ensemble` will map those columns to `flux_array` and `band_array` respectively. We also pass `method='mean'`, which will pass that kwarg along to `my_flux_average`.

In [None]:
# Applying the function to the ensemble
res = ens.batch(my_flux_average, "flux", "band", meta=None, method="median", band="g")
res.compute()

`Ensemble.batch()` supports many different variations of custom user functions, and additionally has a small suite of tailored analysis functions designed for it. For more details on batch, see the <batch_showcase>.

## Other Useful Functions


### Using `Persist()` to save computation time

### Repartitioning

### Sampling

### Saving Intermediate Results