Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan pandas improvement #2405

Open
2 of 4 tasks
acquamarin opened this issue Nov 14, 2023 · 2 comments
Open
2 of 4 tasks

scan pandas improvement #2405

acquamarin opened this issue Nov 14, 2023 · 2 comments
Labels
apis feature New features or missing components of existing features

Comments

@acquamarin
Copy link
Collaborator

acquamarin commented Nov 14, 2023

  • Panda dataframe can be a dictionary. (Duckdb binds it using numpy::bind)
  • Support numpy array of objects.
  • Implement import/variable cache mechanism in pybind to optimize performance.
  • Support category type.
@semihsalihoglu-uw semihsalihoglu-uw added feature New features or missing components of existing features apis labels Jan 8, 2024
@prrao87
Copy link
Member

prrao87 commented Jan 22, 2024

@acquamarin and @ray6080, I'm wondering if this feature can be prioritized? I'm looking at showcasing the pandas dataframe scan functionality via a Kùzu byte, but in my view this functionality not very useful right now because it can only handle integers and float arrays (the example code below fails when you input a list of strings).

Also, from a usability and DevEx perspective, I think we should auto-cast Python lists to the corresponding numpy type, without the user having to manually specify numpy arrays, like we currently show in the docs.

person = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6, 7],
        # "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Fred", "George"],
        "age": [42, 23, 33, 57, 67, 39, 11],
        "height": [167, 172, 183, 199, 149, 154, 165],
        "is_student": [False, True, False, False, False, False, True],
    }
)

result = conn.execute(
    """
    CALL READ_PANDAS("person")
    RETURN age as age, height / 2.54 as height_in_inch
    """
).get_as_df()

"""
The above code fails when we uncomment the string column.
"""

To make this a useful feature, I'm recommending that we allow the user to specify all inputs to a pandas DataFrame as Python lists, rather than manually converting to numpy arrays on their end. If there's an error in casting the Python type to a numpy array (for e.g., if the user mistakenly adds a mixed-type list ([1, "a"]), the error message should be informative enough to know that there's a type issue, so it might make sense to perform this check/test on the Python side.

@ray6080
Copy link
Contributor

ray6080 commented Jan 23, 2024

@prrao87 Yeah, I totally agree on this. We should prioritize the support of this to make the feature more usable. I think we can discuss with @acquamarin on how much needs to be done, and when we can schedule this to be done before the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apis feature New features or missing components of existing features
Projects
None yet
Development

No branches or pull requests

5 participants