Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added implementation of nunique function #29

Merged
merged 10 commits into from
Jan 22, 2024

Conversation

chraberturas
Copy link

@chraberturas chraberturas commented Jan 16, 2024

Feature

What does this change introduce?

An implementation of the nunique function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html

This is intended to be a 1:1 implementation of the nunique function from pandas.

There are two changes:

  • This implementation returns a dictionary instead of a Series.
  • In accordance with the correspondence between q nulls and Python nulls established in the .pd conversion, the empty string is regarded as a null value for columns of type symbol.

General

  • Has an example been added to demo the new feature?
  • Have existing examples been updated or tested?
  • Have you added any new Environment Variables/Configuration Options? If yes please tick the boxes below as applicable
    • Addition to reimporter logic within src/pykx/pykx.q and src/pykx/reimporter.py
    • Have updated the src/pykx/util.py logic which is used for environment variable
  • If there have been any dependency updates have they been reflected in all files?
    • pyproject.toml
    • docs/getting-started/installing.md
    • conda-recipe/meta.yaml
    • README.md
  • If any examples have been updated has it's associated .zip been updated

Code

  • Has all temporary code used during development been removed?
  • Has all commented out (unused) code been removed?
  • Where reasonable have you ensured there is no duplication of existing code?
  • If applicable for your use-case have you ensured that the code is performant?

Testing

  • Have unit tests been created or existing ones updated to test this new functionality?

Documentation

  • Has documentation been added for all public code?
  • Has a release note been included for the new feature?
  • Has any documentation which would benefit from this feature been updated to use the most up to date functionality?
  • If a new class has been added has a documentation stub .md file associated with it been created?
  • If any documentation page has been created has it been added to mkdocs.yml
  • Have you checked your changes with a spell checker? (US English)

@chraberturas chraberturas added documentation Improvements or additions to documentation python tests work in progress Working on it labels Jan 16, 2024
@chraberturas chraberturas added this to the Pandas API 2nd Block milestone Jan 16, 2024
@chraberturas chraberturas self-assigned this Jan 16, 2024
@chraberturas chraberturas added Ready to review and removed work in progress Working on it labels Jan 16, 2024
@chraberturas chraberturas marked this pull request as ready for review January 16, 2024 14:26
@chraberturas chraberturas linked an issue Jan 16, 2024 that may be closed by this pull request
Copy link

@nipsn nipsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall OK, but there are some adjustments that I think would need to be made before merging.

Also, some interesting flags have been raised in regards to mixed value columns that will need addressing at some point.

src/pykx/pandas_api/pandas_meta.py Outdated Show resolved Hide resolved
tests/test_pandas_api.py Outdated Show resolved Hide resolved
tests/test_pandas_api.py Outdated Show resolved Hide resolved
tests/test_pandas_api.py Outdated Show resolved Hide resolved
docs/user-guide/advanced/Pandas_API.ipynb Show resolved Hide resolved
Copy link

@nipsn nipsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Your solution to type checking on columns looks more robust than what I proposed originally.

@nipsn
Copy link

nipsn commented Jan 18, 2024

Just spoke with the PyKX team. They told me that for this specific case, they would expect the behavior to be closer to what in q would be count distinct.

I think that as it stands right now the implementation follows these lines pretty closely. However, I think that in case we had a mixed type column we should no longer raise a NotImplementedError, so it closely resembles the q behavior. If it fails in q, it should fail in this version as well. We should also change the unit test that monitored this error to instead expect a QError.

TLDR: If we have several nulls of different types on a single column, they should all count as distinct values.

@chraberturas chraberturas merged commit abcc1b4 into feature/pandas-api-2nd-block Jan 22, 2024
1 check passed
tortolavivo23 pushed a commit that referenced this pull request Mar 8, 2024
* Added implementation of nunique function
* Added test for handling strings nulls (" "), differentiating behavior between Python and kdb+
* Suggested changes. Error with mixed lists and tests for this case.
* QError for mixed lists (suggested by Kx)
* minor: rename filternan (suggested)
---------

Co-authored-by: chraberturas <christian.aberturas@hablapps.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python Ready to review tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Implement nunique from Pandas API
6 participants