-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert Cardinality Requirement for Histograms #301
Conversation
lux/vislib/altair/Histogram.py
Outdated
@@ -53,7 +53,7 @@ def initialize_chart(self): | |||
|
|||
# Default when bin too small | |||
if markbar < (x_range / 24): | |||
markbar = (x_max - x_min) / 12 | |||
markbar = abs(x_max - x_min) / 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there ever be a case such that calling abs
is necessary? By construction, shouldn't x_max
be greater than or equal to x_min
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I added this as a final check for integer overflow so that at least our code doesn't error and it just means the user inputs are too extreme (positive or negative) to handle. @dorisjlee should we even worry about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think xmax will always be larger than xmin.
Codecov Report
@@ Coverage Diff @@
## master #301 +/- ##
==========================================
+ Coverage 80.96% 80.98% +0.01%
==========================================
Files 50 50
Lines 3615 3618 +3
==========================================
+ Hits 2927 2930 +3
Misses 688 688
Continue to review full report at Codecov.
|
c | ||
for c in ldf.columns | ||
if ldf.data_type[c] == "quantitative" and ldf.cardinality[c] > 5 and c != "Number of Records" | ||
c for c in ldf.columns if ldf.data_type[c] == "quantitative" and c != "Number of Records" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, does ldf.cardinality[c]
return the number of rows for a given column? In this case, I wonder if removing this check would lead to some irregular behavior when there are very few rows. Is there some other part in the code that checks for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. it returns the unique number of values -- we only compute histograms for dataframes > 5 rows.
Thanks @jinimukh! Could you explain in the PR what is the benefits of removing the cardinality check? In particular, what was the issue the original implementation did for columns with cardinality lower than 5? Would the charts simply not display or is there an error? Also this change seems like it was applied to both bars and histograms (for both quantitative and nominal in univariate.py)? If that's the case, we can update the description accordingly. |
@dorisjlee If we had cardinality lower than 5, the histogram/bar graph would not show. We talked about this with respect to series — if they all have the same value, I think it would be nice to show a uniform distribution. Was there an initial reason to not even compute it in the first place other than having one (or a few) less visualizations to compute? |
Yes, in this case what you are showing i.e., the uniform distribution (as a bar chart) makes the most sense. I think the original concern was that if it was a single column with floats, i.e. Units = 4.0, it might be recognized as a quantitative column --> plotting a histogram where the range is 0 causes errors (I believe we already fixed this in #99). Could you add a few low cardinality test cases case for each of the data types (quantitative, nominal, etc.) that you modified part of the PR? |
Thanks @jinimukh! |
Overview
Now that we have a new mechanism of finding bin widths, we can remove the cardinality requirement that was put into place to fix a divide-by-zero error that occurred if the range of values was 0.
Changes
univariate.py
histogram.py
Example Output
Now, we can see uniform distributions. Here is an example: