# Sprint Demo 2025-23-18 - `from_dataframe` arg naming

In [8]:
# Just to get rid of the squiggly lines in the code examples--they aren't really meant to be executed.

import lsdb
import math
small_sky_source_df = None
self = None
partition_size = None

## Question: if we specified a bytewise partitioning threshold to `from_dataframe`...would we do it like this or like this?

![dog-pants](https://www.researchgate.net/profile/Ray-Lc/publication/352806710/figure/fig1/AS:11431281124566317@1677990839084/Thought-experiment-If-a-dog-wore-pants-how-would-the-dog-wear-them-2.ppm)


### Currently, we have options `partition_size` and `threshold`

- `threshold`: just used as-is
- `partition_size`: turned into some kind of averaged-out value that's similar to what was set, but not quite
- If neither is specified, we estimate how many rows per pixel would make each be around 1 Gib

In [1]:
!pip install mermaid-py



In [5]:
from mermaid import Mermaid

graph_definition = """
flowchart TD
    A([_calculate_threshold called]) --> B{Are both threshold AND partition_size set?}
    B -->|Yes| E([Error: threshold and partition_size cannot both be set])

    B -->|No| C{Is threshold set?}
    C -->|Yes| T([Use threshold]) --> Z

    C -->|No| D{Is partition_size set?}
    D -->|Yes| P1([Adjust so each partition will have same or very similar row count]) --> Z

    D -->|No| N([Estimate a row count s.t. each partition is about 1 GiB]) --> Z
    
    Z{Set self.threshold}
"""
Mermaid(graph_definition)

In [None]:
# Setting partition_size:

if partition_size is not None:
    # Round the number of partitions to the next integer, otherwise the
    # number of pixels per partition may exceed the threshold
    num_partitions = math.ceil(len(self.dataframe) / partition_size)
    return len(self.dataframe) // num_partitions

# And then that gets set to self.threshold

### However, I've added an option to specify a memory-size partioning threshold

- I'm currently calling this `partition_size_bytes`, (but open to alternatives)
- Operates just like hats-import bytewise partitioning: enforces a *maximum* size (in bytes)

## Option A: remove `threshold` as an argument

- I like this because I find `partition_size` and `partition_size_bytes` to be a little more descriptive
- Note: I'd also want to make `partition_size` be a simple maximum size here (instead of the adjusted size)
- Also, we have the argument `margin_threshold`, which is an entirely different kind of threshold, so I like creating conceptual difference
- However, this is a somewhat breaking change. Within the spirit of the upcoming `v0.8.0`, but still a consideration.

In [None]:
# Option A:

lsdb.from_dataframe(
    small_sky_source_df,
    ra_column="source_ra",
    dec_column="source_dec",
    partition_size=None,
    partition_size_bytes=(1 << 30),  # 1 GiB
    margin_threshold=None,
)

## Option B: the same, but with `threshold` and `threshold_bytes` instead

- Pros of this include being a little more similar to the arguments used in hats-import (`pixel_threshold` and `bytes_pixel_threshold`)
  - I guess an option B2 would be just using these same arg names, but...even more of a breaking change
- Cons, we have `threshold` and `margin_threshold` as unrelated args
- Still a somewhat breaking change


In [None]:
# Option B:

lsdb.from_dataframe(
    small_sky_source_df,
    ra_column="source_ra",
    dec_column="source_dec",
    threshold=None,
    threshold_bytes=(1 << 30),  # 1 GiB
    margin_threshold=None,
)

## Option C: keep all existing args (preserving behavior), and add some `partition_size_bytes` 

- :(

In [None]:
# Option C:

lsdb.from_dataframe(
    small_sky_source_df,
    ra_column="source_ra",
    dec_column="source_dec",
    threshold=None,
    partition_size=None,
    partition_size_bytes=(1 << 30),  # 1 GiB
    margin_threshold=None,
)

## Option D: ???

Idk, you tell me! Want to avoid Chesterton's fence here