Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert Cardinality Requirement for Histograms #301

Merged
merged 23 commits into from
Mar 22, 2021

Conversation

jinimukh
Copy link
Member

Overview

Now that we have a new mechanism of finding bin widths, we can remove the cardinality requirement that was put into place to fix a divide-by-zero error that occurred if the range of values was 0.

Changes

  • remove cardinality requirement in univariate.py
  • take the absolute value for the default markbar in histogram.py
  • adjust tests to current output

Example Output

Now, we can see uniform distributions. Here is an example:

Screen Shot 2021-03-11 at 11 33 23 PM

@@ -53,7 +53,7 @@ def initialize_chart(self):

# Default when bin too small
if markbar < (x_range / 24):
markbar = (x_max - x_min) / 12
markbar = abs(x_max - x_min) / 12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there ever be a case such that calling abs is necessary? By construction, shouldn't x_max be greater than or equal to x_min?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I added this as a final check for integer overflow so that at least our code doesn't error and it just means the user inputs are too extreme (positive or negative) to handle. @dorisjlee should we even worry about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think xmax will always be larger than xmin.

@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #301 (6fa19d7) into master (127806f) will increase coverage by 0.01%.
The diff coverage is 88.88%.

❗ Current head 6fa19d7 differs from pull request most recent head 853844a. Consider uploading reports for the commit 853844a to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #301      +/-   ##
==========================================
+ Coverage   80.96%   80.98%   +0.01%     
==========================================
  Files          50       50              
  Lines        3615     3618       +3     
==========================================
+ Hits         2927     2930       +3     
  Misses        688      688              
Impacted Files Coverage Δ
lux/interestingness/interestingness.py 86.48% <0.00%> (ø)
lux/action/univariate.py 90.38% <100.00%> (ø)
lux/executor/PandasExecutor.py 96.01% <100.00%> (+0.04%) ⬆️
lux/vislib/altair/Histogram.py 89.58% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 127806f...853844a. Read the comment docs.

c
for c in ldf.columns
if ldf.data_type[c] == "quantitative" and ldf.cardinality[c] > 5 and c != "Number of Records"
c for c in ldf.columns if ldf.data_type[c] == "quantitative" and c != "Number of Records"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, does ldf.cardinality[c] return the number of rows for a given column? In this case, I wonder if removing this check would lead to some irregular behavior when there are very few rows. Is there some other part in the code that checks for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. it returns the unique number of values -- we only compute histograms for dataframes > 5 rows.

@dorisjlee
Copy link
Member

Thanks @jinimukh! Could you explain in the PR what is the benefits of removing the cardinality check? In particular, what was the issue the original implementation did for columns with cardinality lower than 5? Would the charts simply not display or is there an error? Also this change seems like it was applied to both bars and histograms (for both quantitative and nominal in univariate.py)? If that's the case, we can update the description accordingly.

@jinimukh
Copy link
Member Author

@dorisjlee If we had cardinality lower than 5, the histogram/bar graph would not show. We talked about this with respect to series — if they all have the same value, I think it would be nice to show a uniform distribution. Was there an initial reason to not even compute it in the first place other than having one (or a few) less visualizations to compute?

@dorisjlee
Copy link
Member

dorisjlee commented Mar 15, 2021

Yes, in this case what you are showing i.e., the uniform distribution (as a bar chart) makes the most sense. I think the original concern was that if it was a single column with floats, i.e. Units = 4.0, it might be recognized as a quantitative column --> plotting a histogram where the range is 0 causes errors (I believe we already fixed this in #99). Could you add a few low cardinality test cases case for each of the data types (quantitative, nominal, etc.) that you modified part of the PR?

@dorisjlee dorisjlee merged commit 950eba6 into lux-org:master Mar 22, 2021
@dorisjlee
Copy link
Member

Thanks @jinimukh!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants