# Run a Periodogram Across Full ZTF Sources

This notebook is an adaptation of the Nested Dask [tutorial for loading HiPSCat data](https://nested-dask.readthedocs.io/en/latest/tutorials/work_with_lsdb.html).

There we loaded a small subset of ZTF DR 14, and here we will try to load a full ZTF DR 20 and run a `light-curve` package periodogram across all of ZTF on epyc.

## Install dependencies for the notebook

The notebook requires `nested-dask` and few other packages to be installed.
- `lsdb` to load and join "object" (pointing) and "source" (detection) ZTF catalogs
- `aiohttp` is `lsdb`'s optional dependency to download the data via web
- `light-curve` to extract features from light curves
- `matplotlib` to plot the results

In [None]:
# Uncomment the following line to install nested-dask
# %pip install nested-dask

# Comment the following line to skip dependencies installation
%pip install --quiet lsdb aiohttp light-curve matplotlib

In [None]:
%pip install --quiet git+https://github.com/astronomy-commons/lsdb.git

In [None]:
import dask.array
import dask.distributed
import light_curve as licu
import matplotlib.pyplot as plt
import nested_pandas as npd
import numpy as np
import pandas as pd
from dask_expr import from_legacy_dataframe
from lsdb import read_hipscat
from matplotlib.colors import LogNorm
from nested_dask import NestedFrame

Some additional setup for using Dask on epyc:

In [None]:
from tqdm import tqdm
import dask
dask.config.set({"temporary-directory" :'/epyc/ssd/users/wbeebe/tmp'})

#from dask.diagnostics import ProgressBar
#ProgressBar().register()
# Unclear how we want shuffle compression configured
#dask.config.set({"dataframe.shuffle-compression": 'Snappy'})

## Load ZTF DR14
For the demonstration purposes we use a light version of the ZTF DR14 catalog distributed by LINCC Frameworks, a half-degree circle around RA=180, Dec=10.
We load the data from HTTPS as two LSDB catalogs: objects (metadata catalog) and source (light curve catalog).

## Setup cone search
Here we use a cone search for the LSDB load to keep our data small

In [None]:
from lsdb.core.search import ConeSearch
search_area = ConeSearch(ra=254, dec=35, radius_arcsec=0.6 * 3600)


In [None]:
#half_degree_catalogs_dir = "https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/ztf/"
# catalogs_dir = "https://epyc.astro.washington.edu/~lincc-frameworks/hipscat_surveys"
catalogs_dir = "/data3/epyc/data3/hipscat/catalogs/ztf_axs"


lsdb_object = read_hipscat(
    f"{catalogs_dir}/ztf_dr14",
    columns=["ra", "dec", "ps1_objid"],
    search_filter=search_area,
)
lsdb_source = read_hipscat(
    f"{catalogs_dir}/ztf_zource",
    columns=["mjd", "ra", "dec", "mag", "magerr", "band", "ps1_objid", "catflags"],
    search_filter=search_area,
)
lc_columns = ["mjd", "mag", "magerr", "band", "catflags"]

In [None]:
lsdb_object.head(100)

In [None]:
lsdb_source.head(100)

In [None]:
print(len(lsdb_object._ddf), len(lsdb_source._ddf))

We need to merge these two catalogs to get the light curve data.
It is done with LSDB's `.join_nested()` method which would give us a new catalog with a nested frame of ZTF sources. For this tutorial we'll just use the underlying nested dataframe for the rest of the analysis rather than the LSDB catalog directly.

In [None]:
# Nesting Sources into Object
nested_ddf = lsdb_object.join_nested(lsdb_source, left_on="ps1_objid", right_on="ps1_objid", nested_column_name="lc")

# TODO remove once have added LSDB wrappers for nested_dask (reduce, dropna, etc)
nested_ddf = nested_ddf._ddf

## Convert LSDB joined catalog to `nested_dask.NestedFrame`

First, we plan the computation to convert the joined Dask DataFrame to a NestedFrame.

Now we filter our dataframe by the `catflags` column (0 flags correspond to the perfect observational conditions) and the `band` column to be equal to `r`.
After filtering the detections, we are going to count the number of detections per object and keep only those objects with more than 10 detections.

In [None]:
r_band = nested_ddf.query("lc.catflags == 0 and lc.band == 'r'")
nobs = r_band.reduce(np.size, "lc.mjd", meta={0: int}).rename(columns={0: "nobs"})
r_band = r_band[nobs["nobs"] > 10]
r_band

Later we are going to extract features, so we need to prepare light-curve data to be in the same float format.

### Extract features from ZTF light curves

Now we are going to extract some features:
- Top periodogram peak
- Mean magnitude
- Von Neumann's eta statistics
- Excess variance statistics
- Number of observations

We are going to use [`light-curve`](https://github.com/light-curve/light-curve-python) package for this purposes

In [None]:
extractor = licu.Extractor(
    licu.Periodogram(
        peaks=1,
        max_freq_factor=1.0, # Currently 1.0 for fast runs, will raise for more interesting graphs later
        fast=True,
    ),  # Would give two features: peak period and signa-to-noise ratio of the peak
)


# light-curve requires all arrays to be the same dtype.
# It also requires the time array to be ordered and to have no duplicates.
def extract_features(mjd, mag, **kwargs):
    # We offset date, so we still would have <1 second precision
    t = np.asarray(mjd - 60000, dtype=np.float32)
    _, sort_index = np.unique(t, return_index=True)
    features = extractor(
        t[sort_index],
        mag[sort_index],
        **kwargs,
    )
    # Return the features as a dictionary
    return dict(zip(extractor.names, features))


features = r_band.reduce(
    extract_features,
    "lc.mjd",
    "lc.mag",
    meta={name: np.float32 for name in extractor.names},
)

df_w_features = r_band.join(features)
df_w_features

Before we are going next and actually run the computation, let's create a Dask client which would allow us to run the computation in parallel.

In [None]:
client = dask.distributed.Client()

Now we can collect some statistics and plot it. 

In [None]:
lg_period_bins = np.linspace(-1, 2, 101)
lg_snr_bins = np.linspace(0, 2, 101)

lg_period = dask.array.log10(df_w_features["period_0"].to_dask_array())
lg_snr = dask.array.log10(df_w_features["period_s_to_n_0"].to_dask_array())

hist2d = dask.array.histogram2d(
    lg_period,
    lg_snr,
    bins=[lg_period_bins, lg_snr_bins],
)
# Run the computation
hist2d = hist2d[0].compute()

# Plot the 2D histogram
plt.imshow(
    hist2d.T,
    extent=(lg_period_bins[0], lg_period_bins[-1], lg_snr_bins[0], lg_snr_bins[-1]),
    origin="lower",
    norm=LogNorm(vmin=1, vmax=hist2d.max()),
)
plt.colorbar(label="Number of stars")
plt.xlabel("lg Period/day")
plt.ylabel("lg S/N")

Let's select a bright star with a high signal-to-noise ratio and plot its phase light curve. We also filter periods ~1 day, because they are very likely to be bogus.

In [None]:
bright_periodic_stars = df_w_features.query(
    "period_s_to_n_0 > 20 and weighted_mean < 15 and (period_0 < 0.9 or period_0 > 1.1)"
)
df = bright_periodic_stars.compute()

obj = df.iloc[0]
lc = obj["lc"]
period = obj["period_0"]
ra = obj["ra"]
dec = obj["dec"]

plt.errorbar(lc["mjd"] % period / period, lc["mag"], lc["magerr"], fmt="o")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Magnitude")
plt.xlim([0, 1])
plt.title(f"RA={ra:.4f}, Dec={dec:.4f} period={period:.4f} days")

print("Search this object for on the SNAD ZTF Viewer:")
print(f"https://ztf.snad.space/search/{ra}%20{dec}/1")

It looks like a nice RR Lyrae star!

In [None]:
client.close()