# Load large catalog data from the LSDB

Here we load a small part of ZTF DR14 stored as HiPSCat catalog using the [LSDB](https://lsdb.readthedocs.io/).

## Install LSDB and its dependencies and import the necessary modules

We also need `aiohttp`, which is an optional LSDB's dependency, needed to access the catalog data from the web.

In [None]:
import pandas as pd

# Comment the following line to skip LSDB installation
%pip install aiohttp lsdb

In [None]:
import nested_pandas as npd
from lsdb import read_hipscat
from nested_dask import NestedFrame
from nested_pandas.series.packer import pack

## Load ZTF DR14
For the demonstration purposes we use a light version of the ZTF DR14 catalog distributed by LINCC Frameworks, a half-degree circle around RA=180, Dec=10.
We load the data from HTTPS as two LSDB catalogs: objects (metadata catalog) and source (light curve catalog).

In [None]:
catalogs_dir = "https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/ztf/"

lsdb_object = read_hipscat(
    f"{catalogs_dir}/ztf_object",
    columns=["ra", "dec", "ps1_objid"],
)
lsdb_source = read_hipscat(
    f"{catalogs_dir}/ztf_source",
    columns=["mjd", "mag", "magerr", "band", "ps1_objid", "catflags"],
)

We need to merge these two catalogs to get the light curve data.
It is done with LSDB's `.join()` method which would give us a new catalog with all the columns from both catalogs. 

In [None]:
# We can ignore warning here - for this particular case we don't need margin cache
lsdb_joined = lsdb_object.join(
    lsdb_source,
    left_on="ps1_objid",
    right_on="ps1_objid",
    suffixes=("", ""),
)
joined_ddf = lsdb_joined._ddf
joined_ddf

## Convert LSDB joined catalog to `nested_dask.NestedFrame`

First, we plan the computation to convert the joined Dask DataFrame to a NestedFrame.

In [None]:
def convert_to_nested_frame(df: pd.DataFrame, nested_columns: list[str]):
    other_columns = [col for col in df.columns if col not in nested_columns]

    # Since object rows are repeated, we just drop duplicates
    object_df = df[other_columns].groupby(level=0).first()
    nested_frame = npd.NestedFrame(object_df)

    source_df = df[nested_columns]
    # lc is for light curve
    # https://github.com/lincc-frameworks/nested-pandas/issues/88
    # nested_frame.add_nested(source_df, 'lc')
    nested_frame["lc"] = pack(source_df, name="lc")

    return nested_frame


ddf = joined_ddf.map_partitions(
    lambda df: convert_to_nested_frame(df, nested_columns=lsdb_source.columns),
    meta=convert_to_nested_frame(joined_ddf._meta, nested_columns=lsdb_source.columns),
)
nested_ddf = NestedFrame.from_dask_dataframe(ddf)
nested_ddf

Second, we compute the NestedFrame.

In [None]:
ndf = nested_ddf.compute()
ndf