Skip to content

Conversation

@idanmoradarthas
Copy link
Owner

@idanmoradarthas idanmoradarthas commented Nov 2, 2025

Overview

This pull request enhances the visualize_feature function in the ds_utils/preprocess.py module by introducing new parameters for displaying count labels on bar plots and customizing the order of categories. These changes address the limitations highlighted in issue #77, where users noted the difficulty in estimating exact counts from the y-axis in categorical visualizations. By replacing the Seaborn countplot with a custom Matplotlib-based implementation, we gain greater flexibility, including optional count annotations and sorting options. This improves readability, especially for high-cardinality data, and makes the visualizations more suitable for reports and presentations.

The updates also include dependency upgrades, a package version bump, refreshed documentation with examples, updated test cases, and new baseline images to reflect the changes.

Key Changes

New Features in visualize_feature

  • Added show_counts Parameter:

    • Type: Boolean (default: True)
    • Functionality: When enabled, displays the exact count values on top of each bar in count plots for categorical, object, boolean, and integer features.
    • This directly resolves the core request from issue Add count labels to bar plots in visualize_feature method #77 by making frequencies explicit without relying on axis scales.
    • For backward compatibility, users can set show_counts=False to revert to the previous behavior.
    • Additional considerations: Counts are formatted with thousand separators (e.g., "742,454") for readability. Font sizes can be adjusted via Matplotlib if needed for dense plots.
  • Added order Parameter:

    • Type: String, List of strings, or None (default: None)
    • Functionality: Allows custom sorting of categories in bar plots.
      • String options:
        • "count_desc": Sort by descending count (most frequent first).
        • "count_asc": Sort by ascending count.
        • "alpha_asc": Sort alphabetically in ascending order.
        • "alpha_desc": Sort alphabetically in descending order.
      • List: Provide an explicit list of category names for custom ordering.
    • If None, uses the default index order from value_counts().
    • This feature enhances data exploration by allowing users to focus on high/low frequency items or maintain a specific sequence.
  • Added ax Parameter:

    • Type: Matplotlib Axes (optional)
    • Functionality: Allows plotting on a user-provided Axes object for integration into larger figures.

Code Refactoring

  • Introduced a new helper function _plot_count_bar in ds_utils/preprocess.py:

    • Handles the creation of bar charts from a pandas.Series of value counts.
    • Supports the order and show_counts logic, including sorting the series and annotating bars with ax.bar_label().
    • Replaces the direct call to sns.countplot, improving maintainability and customization without Seaborn dependencies for this plot.
  • Updated visualize_feature to use _plot_count_bar for relevant feature types (categorical, object, boolean, integer).

  • Minor adjustments for handling high-cardinality features (e.g., limiting to top 10 categories with a warning, as before).

Documentation Updates

  • README.md:
    • Expanded the section on visualize_feature to describe the new parameters (show_counts, order).
  • Updated image links to new baselines.
  • docs/source/preprocess.rst:
    • Detailed docstring updates for visualize_feature, including parameter descriptions and behavior notes.
    • Added examples for high-cardinality handling and integration with remove_na.
    • Refreshed image references for float, datetime, integer, categorical, and boolean visualizations to show count labels.

Testing Enhancements

  • Added parameterized tests in relevant test files:
    • Validate show_counts=True/False outputs.
    • Test various order values, including string options, lists, and invalid inputs (raises ValueError with a clear message).
    • Cover edge cases like high-cardinality data and different feature types.
  • Adjusted plot sizes in tests for better visual comparison.
  • Updated baseline images to match new plot formats with labels and ordering.
  • New tests for _plot_count_bar behavior indirectly through visualize_feature.

Testing and Validation

  • All tests pass with the new changes.
  • Visual inspections confirm count labels appear correctly and ordering applies as expected.
  • Backward compatibility: Setting show_counts=False and order=None reproduces the original behavior.

Related Issue

Closes #77 by adding the requested count labels and extending functionality with ordering options for better usability.

This PR improves the overall utility of the DataScienceUtils package for data exploration and visualization tasks. Feedback welcome!

- Bump pyarrow to version 22.0.0 and ruff to version 0.14.3 in pyproject.toml.
- Increment package version to 1.10.0rc2 in __init__.py.
- Adjust version test in test_version.py to reflect the new version.
- Introduced a new parameter `show_counts` to the `visualize_feature` function, allowing users to display count values on top of bars in count plots.
- Replaced the seaborn countplot with a custom matplotlib bar plot for improved flexibility and control over the visualization.
- Updated documentation to reflect the new parameter and its functionality.
- Added an `order` parameter to the `visualize_feature` function, allowing users to specify the order of categorical levels in count plots.
- Updated documentation to detail the new ordering options, including sorting by count and alphabetical order, as well as accepting explicit lists.
- Modified tests to validate the new ordering functionality for various feature types.
- Adjusted plot sizes in tests for better visualization of results.
- Changed the boolean feature visualization image in README.md and preprocess documentation to reflect the new count display format.
- Updated the test for boolean feature visualization to use the new parameter and added parameterization for testing both display options.
- Removed the old boolean visualization image as it is no longer needed.
- Replaced outdated visualization images in README.md and preprocess documentation with new images reflecting updated float, integer, datetime, and categorical feature visualizations.
- Modified tests to accommodate new visualization formats and added parameterization for object and category features.
- Removed obsolete image files that are no longer needed.
- Updated the `visualize_feature` function to sort value counts in descending order when the `order` parameter is set to "count_desc".
- Added a new test to validate the `visualize_feature` function with various ordering options, including "count_desc", "count_asc", and "alpha_asc".
- Adjusted plot size in the test for improved visualization.
- Introduced a new test to validate the behavior of the `visualize_feature` function when provided with an invalid order parameter, ensuring it raises a ValueError with an appropriate message.
- This enhances the robustness of the feature visualization by confirming that incorrect inputs are properly managed.
- Introduced a new test to validate the `visualize_feature` function when provided with a list of order parameters, enhancing the testing coverage for ordering functionality.
- Adjusted plot size in the test for improved visualization consistency.
- This update ensures that the function behaves correctly with multiple ordering configurations.
- Updated README.md and preprocess documentation to include detailed descriptions of the `visualize_feature` function's capabilities, particularly for handling high-cardinality categorical features and customizing sorting and count display options.
- Added examples demonstrating the use of the `remove_na`, `show_counts`, and `order` parameters for various feature types.
- Improved formatting and clarity in the documentation to better guide users in utilizing the visualization features effectively.
- Adjusted the plot size in the `test_visualize_feature_float_datetime_int` test to improve visualization clarity, changing the height from 8 to 11 inches.
- Updated the corresponding baseline image to reflect this change in the test output.
- Introduced a new helper function `_plot_count_bar` to streamline the creation of bar charts for categorical data, allowing for customizable ordering and optional count labels.
- Refactored the `visualize_feature` function to utilize `_plot_count_bar`, enhancing code clarity and maintainability.
- This update improves the flexibility of visualizations by consolidating bar plotting logic into a dedicated function.
@idanmoradarthas idanmoradarthas self-assigned this Nov 2, 2025
@idanmoradarthas idanmoradarthas linked an issue Nov 2, 2025 that may be closed by this pull request
@idanmoradarthas idanmoradarthas changed the title 77 add count labels to bar plots in visualize feature method Add count labels and ordering options to visualize_feature bar plots Nov 2, 2025
@idanmoradarthas idanmoradarthas merged commit 719873a into master Nov 2, 2025
40 checks passed
@idanmoradarthas idanmoradarthas deleted the 77-add-count-labels-to-bar-plots-in-visualize_feature-method branch November 2, 2025 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add count labels to bar plots in visualize_feature method

2 participants