scan pandas improvement #2405

acquamarin · 2023-11-14T03:17:13Z

Panda dataframe can be a dictionary. (Duckdb binds it using numpy::bind)
Support numpy array of objects.
Implement import/variable cache mechanism in pybind to optimize performance.
Support category type.

prrao87 · 2024-01-22T15:36:34Z

@acquamarin and @ray6080, I'm wondering if this feature can be prioritized? I'm looking at showcasing the pandas dataframe scan functionality via a Kùzu byte, but in my view this functionality not very useful right now because it can only handle integers and float arrays (the example code below fails when you input a list of strings).

Also, from a usability and DevEx perspective, I think we should auto-cast Python lists to the corresponding numpy type, without the user having to manually specify numpy arrays, like we currently show in the docs.

person = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6, 7],
        # "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Fred", "George"],
        "age": [42, 23, 33, 57, 67, 39, 11],
        "height": [167, 172, 183, 199, 149, 154, 165],
        "is_student": [False, True, False, False, False, False, True],
    }
)

result = conn.execute(
    """
    CALL READ_PANDAS("person")
    RETURN age as age, height / 2.54 as height_in_inch
    """
).get_as_df()

"""
The above code fails when we uncomment the string column.
"""

To make this a useful feature, I'm recommending that we allow the user to specify all inputs to a pandas DataFrame as Python lists, rather than manually converting to numpy arrays on their end. If there's an error in casting the Python type to a numpy array (for e.g., if the user mistakenly adds a mixed-type list ([1, "a"]), the error message should be informative enough to know that there's a type issue, so it might make sense to perform this check/test on the Python side.

ray6080 · 2024-01-23T02:46:15Z

@prrao87 Yeah, I totally agree on this. We should prioritize the support of this to make the feature more usable. I think we can discuss with @acquamarin on how much needs to be done, and when we can schedule this to be done before the next release.

acquamarin mentioned this issue Nov 14, 2023

Implement scan pandas #2403

Merged

semihsalihoglu-uw assigned ray6080 and hououou Jan 8, 2024

semihsalihoglu-uw added feature New features or missing components of existing features apis labels Jan 8, 2024

ray6080 unassigned ray6080 and hououou Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan pandas improvement #2405

scan pandas improvement #2405

acquamarin commented Nov 14, 2023 •

edited by andyfengHKU

Loading

prrao87 commented Jan 22, 2024 •

edited

Loading

ray6080 commented Jan 23, 2024

scan pandas improvement #2405

scan pandas improvement #2405

Comments

acquamarin commented Nov 14, 2023 • edited by andyfengHKU Loading

prrao87 commented Jan 22, 2024 • edited Loading

ray6080 commented Jan 23, 2024

acquamarin commented Nov 14, 2023 •

edited by andyfengHKU

Loading

prrao87 commented Jan 22, 2024 •

edited

Loading