# Tansu Nested-Dask Demo

A basic demo that takes an [outlier detection method provided by Tansu Daylan](https://www.google.com/url?q=https://github.com/tdaylan/miletos/blob/71f7d18542f3ef82808eba588bad6361c8297351/miletos/main.py%23L5192&sa=D&source=docs&ust=1715889313765253&usg=AOvVaw2elaC5NDhYWMIVOZa-rq90) and applies it to a subset of a HiPSCatted ZTF catalog.

# Load large catalog data with LSDB

Here we load a small part of ZTF DR14 stored as HiPSCat catalog using [LSDB](https://lsdb.readthedocs.io/).

The notebook is an adaptation of [the tutorial](https://github.com/lincc-frameworks/Rare_Gems_Demo/blob/main/Notebook_2_Basic_Time_Domain.ipynb) presented by Neven Caplar at the Rare Gems in Big Data conference, May 2024. 



## Install dependencies for the notebook

The notebook requires `nested-dask` and few other packages to be installed.
- `lsdb` to load and join "object" (pointing) and "source" (detection) ZTF catalogs
- `aiohttp` is `lsdb`'s optional dependency to download the data via web
- `light-curve` to extract features from light curves
- `matplotlib` to plot the results

In [1]:
# Uncomment the following line to install nested-dask
# %pip install nested-dask

# Comment the following line to skip dependencies installation
#%pip install --quiet lsdb matplotlib

In [2]:
import dask.array
import dask.distributed
import matplotlib.pyplot as plt
import nested_pandas as npd
import numpy as np
import pandas as pd
from dask_expr import from_legacy_dataframe
from lsdb import read_hipscat
from matplotlib.colors import LogNorm
from nested_dask import NestedFrame

## Load ZTF DR14
For the demonstration purposes we use a light version of the ZTF DR14 catalog distributed by LINCC Frameworks, a half-degree circle around RA=180, Dec=10.
We load the data from HTTPS as two LSDB catalogs: objects (metadata catalog) and source (light curve catalog).

In [3]:
catalogs_dir = "https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/ztf/"

lsdb_object = read_hipscat(
    f"{catalogs_dir}/ztf_object",
    columns=["ra", "dec", "ps1_objid"],
)
lsdb_source = read_hipscat(
    f"{catalogs_dir}/ztf_source",
    columns=["mjd", "mag", "magerr", "band", "ps1_objid", "catflags"],
)
lc_columns = ["mjd", "mag", "magerr", "band", "catflags"]

We need to merge these two catalogs to get the light curve data.
It is done with LSDB's `.join()` method which would give us a new catalog with all the columns from both catalogs. 

In [4]:
# We can ignore warning here - for this particular case we don't need margin cache
lsdb_joined = lsdb_object.join(
    lsdb_source,
    left_on="ps1_objid",
    right_on="ps1_objid",
    suffixes=("", ""),
)
joined_ddf = lsdb_joined._ddf
joined_ddf



Unnamed: 0_level_0,ra,dec,ps1_objid,mjd,mag,magerr,band,catflags
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7799953079628988416,float64,float64,int64,float64,float32,float32,string,int16
7812056503627546624,...,...,...,...,...,...,...,...
7824159927626104832,...,...,...,...,...,...,...,...
7836263351624663040,...,...,...,...,...,...,...,...
18446744073709551615,...,...,...,...,...,...,...,...


## Convert LSDB joined catalog to `nested_dask.NestedFrame`

First, we plan the computation to convert the joined Dask DataFrame to a NestedFrame.

In [5]:
def convert_to_nested_frame(df: pd.DataFrame, nested_columns: list[str]):
    other_columns = [col for col in df.columns if col not in nested_columns]

    # Since object rows are repeated, we just drop duplicates
    object_df = df[other_columns].groupby(level=0).first()
    nested_frame = npd.NestedFrame(object_df)

    source_df = df[nested_columns]
    # lc is for light curve
    return nested_frame.add_nested(source_df, "lc")


ddf = joined_ddf.map_partitions(
    lambda df: convert_to_nested_frame(df, nested_columns=lc_columns),
    meta=convert_to_nested_frame(joined_ddf._meta, nested_columns=lc_columns),
)
nested_ddf = NestedFrame.from_dask_dataframe(from_legacy_dataframe(ddf))
nested_ddf

Unnamed: 0_level_0,ra,dec,ps1_objid,lc
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7799953079628988416,float64,float64,int64,"nested<mjd: [double], mag: [float], magerr: [float], band: [large_string], catflags: [int16]>"
7812056503627546624,...,...,...,...
7824159927626104832,...,...,...,...
7836263351624663040,...,...,...,...
18446744073709551615,...,...,...,...


In [6]:
nested_ddf.index.compute()

Index([7800225727302860800, 7800225745451614208, 7800225756323250176,
       7800231075577331712, 7800231096259444736, 7800231096368496640,
       7800231110515884032, 7800231189779841024, 7800231189779841025,
       7800231193001066496,
       ...
       7836515719817723904, 7836515729640783872, 7836516281506332672,
       7836516286824710144, 7836516293317492736, 7836516302796619776,
       7836516325198397440, 7836516331062034432, 7836516334828519424,
       7836516348237709312],
      dtype='uint64', name='_hipscat_index', length=9817)

Now we filter our dataframe by the `catflags` column (0 flags correspond to the perfect observational conditions) and the `band` column to be equal to `r`.
After filtering the detections, we are going to count the number of detections per object and keep only those objects with more than 10 detections.

In [7]:
r_band = nested_ddf.query("lc.catflags == 0 and lc.band == 'r'")
nobs = r_band.reduce(np.size, "lc.mjd", meta={0: int}).rename(columns={0: "nobs"})
r_band = r_band[nobs["nobs"] > 100]
r_band

Unnamed: 0_level_0,ra,dec,ps1_objid,lc
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7799953079628988416,float64,float64,int64,"nested<mjd: [double], mag: [float], magerr: [float], band: [large_string], catflags: [int16]>"
7812056503627546624,...,...,...,...
7824159927626104832,...,...,...,...
7836263351624663040,...,...,...,...
18446744073709551615,...,...,...,...


Later we are going to extract features, so we need to prepare light-curve data to be in the same float format.

### Extract outliers from ZTF light curves

Here we have an example function courtesy of Tansu Daylan for outlier generation

In [8]:
def srch_outlperi(time,  # time of samples
                  flux, # relative flux of samples
                  stdvflux, # relative flux error of samples
                  n_outliers=5, # number of outliers to include in the search
                  verbose=True, # Boolean flag to diagnose
                 ):
    """
    Search for periodic outliers in a computationally efficient way
    """
    
    # indices of the outliers
    indxtimesort = np.argsort(flux)[::-1][:n_outliers]
    
    # the times of the outliers
    timeoutl = time[indxtimesort]
    
    # number of differences between times of outlier samples
    numbdiff = int(n_outliers * (n_outliers - 1) / 2)
    
    # differences between times of outlier samples
    difftimeoutl = np.empty(numbdiff)
    
    # compute the differences between times of outlier samples
    listtemp = []
    c = 0
    indxoutl = np.arange(n_outliers)
    for a in indxoutl:
        for b in indxoutl:
            if a >= b:
                continue
            listtemp.append([a, b])
            difftimeoutl[c] = abs(timeoutl[a] - timeoutl[b])
            c += 1
    
    # incides that sort the differences between times of outlier samples
    indxsort = np.argsort(difftimeoutl)
    
    # sorted differences between times of outlier samples
    difftimeoutlsort = difftimeoutl[indxsort]

    # fractional differences between differences of times of outlier samples
    frddtimeoutlsort = (difftimeoutlsort[1:] - difftimeoutlsort[:-1]) / ((difftimeoutlsort[1:] + difftimeoutlsort[:-1]) / 2.)

    # index of the minimum fractional difference between differences of times of outlier samples
    indxfrddtimeoutlsort = np.argmin(frddtimeoutlsort)
    
    # minimum fractional difference between differences of times of outlier samples
    minmfrddtimeoutlsort = frddtimeoutlsort[indxfrddtimeoutlsort]
    
    # estimate of the epoch
    epoccomp = timeoutl[0]
    
    # estimate of the period
    pericomp = difftimeoutlsort[indxfrddtimeoutlsort]
    
    # output dictionary
    dictoutp = dict()
    
    # populate the output dictionary
    if minmfrddtimeoutlsort < 0.1:
        dictoutp['boolposi'] = True
        dictoutp['pericomp'] = pericomp
        dictoutp['epocmtracomp'] = epoccomp
    else:
        dictoutp['boolposi'] = False
    dictoutp['minmfrddtimeoutlsort'] = minmfrddtimeoutlsort
    dictoutp['timeoutl'] = timeoutl 
    
    return dictoutp

Now we take that analysis function and use  `NestedFrame.reduce` to apply it across all of our r band subsample of ZTF lightcurves. 

In [9]:
outliers = r_band.reduce(
    srch_outlperi,
    "lc.mjd", 
    "lc.mag", 
    "lc.magerr",
    meta={
        "boolposi": bool,
        "pericomp": float,
        "epocmtracomp": float,
        "minmfrddtimeoutlsort": float,
        "timeoutl": object,
    },
)
outliers

Unnamed: 0_level_0,boolposi,pericomp,epocmtracomp,minmfrddtimeoutlsort,timeoutl
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7799953079628988416,bool,float64,float64,float64,object
7812056503627546624,...,...,...,...,...
7824159927626104832,...,...,...,...,...
7836263351624663040,...,...,...,...,...
18446744073709551615,...,...,...,...,...


We can apply a Dask `compute` to materialize the lazily evaluated `outliers` table

In [10]:
outliers_res = outliers.compute()
outliers_res

  frddtimeoutlsort = (difftimeoutlsort[1:] - difftimeoutlsort[:-1]) / ((difftimeoutlsort[1:] + difftimeoutlsort[:-1]) / 2.)


Unnamed: 0_level_0,boolposi,pericomp,epocmtracomp,minmfrddtimeoutlsort,timeoutl
_hipscat_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7800231189779841024,True,361.02131,58994.21023,0.024210,"[58994.21023, 59306.27599, 59699.26419, 58936...."
7800231189779841025,True,361.02131,58994.21023,0.024210,"[58994.21023, 59306.27599, 59699.26419, 58936...."
7800231210621337600,True,736.96942,58954.23602,0.052880,"[58954.23602, 58217.2666, 58513.43167, 58895.4..."
7800231224164745216,True,364.00500,58791.54719,0.005410,"[58791.54719, 59157.5267, 59521.5317, 58884.56..."
7800231386677248000,True,607.28716,58886.31573,0.030704,"[58886.31573, 58905.25268, 58443.47561, 59512...."
...,...,...,...,...,...
7836515150491287552,True,479.35273,58676.19122,0.005054,"[58676.19122, 59155.54395, 59199.50761, 59266...."
7836515455631097856,True,629.21083,58593.32809,0.087855,"[58593.32809, 59521.53174, 58447.49496, 58834...."
7836515485691674624,True,556.21288,58783.53799,0.007157,"[58783.53799, 58888.44479, 58593.32809, 59149...."
7836515553312243712,True,303.16333,59380.20809,0.016058,"[59380.20809, 59184.50576, 58881.34243, 58894...."
