Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert Cardinality Requirement for Histograms #301

Merged
merged 23 commits into from
Mar 22, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 3 additions & 9 deletions lux/action/univariate.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,7 @@ def univariate(ldf, *args):
ignore_rec_flag = False
if data_type_constraint == "quantitative":
possible_attributes = [
c
for c in ldf.columns
if ldf.data_type[c] == "quantitative" and ldf.cardinality[c] > 5 and c != "Number of Records"
c for c in ldf.columns if ldf.data_type[c] == "quantitative" and c != "Number of Records"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, does ldf.cardinality[c] return the number of rows for a given column? In this case, I wonder if removing this check would lead to some irregular behavior when there are very few rows. Is there some other part in the code that checks for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. it returns the unique number of values -- we only compute histograms for dataframes > 5 rows.

]
intent = [lux.Clause(possible_attributes)]
intent.extend(filter_specs)
Expand All @@ -65,9 +63,7 @@ def univariate(ldf, *args):
ignore_rec_flag = True
elif data_type_constraint == "nominal":
possible_attributes = [
c
for c in ldf.columns
if ldf.data_type[c] == "nominal" and ldf.cardinality[c] > 5 and c != "Number of Records"
c for c in ldf.columns if ldf.data_type[c] == "nominal" and c != "Number of Records"
]
examples = ""
if len(possible_attributes) >= 1:
Expand All @@ -81,9 +77,7 @@ def univariate(ldf, *args):
}
elif data_type_constraint == "geographical":
possible_attributes = [
c
for c in ldf.columns
if ldf.data_type[c] == "geographical" and ldf.cardinality[c] > 5 and c != "Number of Records"
c for c in ldf.columns if ldf.data_type[c] == "geographical" and c != "Number of Records"
]
examples = ""
if len(possible_attributes) >= 1:
Expand Down
2 changes: 1 addition & 1 deletion lux/vislib/altair/Histogram.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def initialize_chart(self):

# Default when bin too small
if markbar < (x_range / 24):
markbar = (x_max - x_min) / 12
markbar = abs(x_max - x_min) / 12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there ever be a case such that calling abs is necessary? By construction, shouldn't x_max be greater than or equal to x_min?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I added this as a final check for integer overflow so that at least our code doesn't error and it just means the user inputs are too extreme (positive or negative) to handle. @dorisjlee should we even worry about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think xmax will always be larger than xmin.


self.data = AltairChart.sanitize_dataframe(self.data)
end_attr_abv = str(msr_attr.attribute) + "_end"
Expand Down
2 changes: 1 addition & 1 deletion tests/test_pandas_coverage.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@ def test_transform(global_var):
df["Year"] = pd.to_datetime(df["Year"], format="%Y")
new_df = df.iloc[:, 1:].groupby("Origin").transform(sum)
new_df._repr_html_()
assert list(new_df.recommendation.keys()) == ["Correlation", "Occurrence"]
assert list(new_df.recommendation.keys()) == ["Correlation", "Distribution", "Occurrence"]
assert len(new_df.cardinality) == 7


Expand Down