-
Notifications
You must be signed in to change notification settings - Fork 9
Add Nodes in Slice Before Printing Tree #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Error found from #182 |
ilumsden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Just a couple of minor changes to ensure good performance and prevent unintended modification of the Thicket data structure.
thicket/thicket.py
Outdated
| # Slices the DataFrame to simulate a single-level index | ||
| try: | ||
| # _fill_perfdata to make sure number of nodes in df == graph | ||
| slice_df = _fill_perfdata(self.dataframe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before calling _fill_perfdata, I would filter the dataframe down to just the columns that will be involved in the printing. Otherwise, you could be filling in the entire dataframe when you only need to fill in one or two columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tldr: checking for these columns is not pretty, but there is up to 20% performance improvement
I tried this in 650007a. It is not trivial to select the columns, because there are 4 different dataframe columns as arguments to tree, default arguments may not exist, and metric_column can be a list or a str.
I collected some performance numbers below for this specific change. I tried to make up examples where _fill_perfdata would be doing significant work.
Testing how long Thicket.tree takes to return (without printing)
1 LBANN file + adding arbitrary columns to stress test (6558 nodes, 110 cols)
without change (3 trials):
- 3.34s, 3.36s, 3.45s
with 650007a (3 trials):
- 2.9s, 2.8s, 3.16s
Contrived example with two different datasets (1 LBANN + 16 AMG) + adding arbitrary columns (6575 nodes, 114 columns)
without: 3.25s
with 650007a: 2.56s
ilumsden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
However, @pearce8, when you do your review, you should take a look at the comment @michaelmckinsey1 left regarding performance.
thicket/utils.py
Outdated
| Returns: | ||
| (DataFrame): filled DataFrame | ||
| """ | ||
| new_df = df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, although it may cause issues in the future if we ever start including data beyond POD. That's because DataFrame.copy with default arguments performs a shallow copy. So, any by-reference objects in new_df that are edited would be edited in df as well.
However, we currently don't deal with by-reference objects (besides the nodes, but that can be safely ignored in this case), and I can't see any scenario in the near future where we would. So, we should keep this in mind, but I don't think anything has to be done about this in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the solution you propose be our own function for deepcopying Pandas DataFrames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the solution would be new_df = df.copy(deep=True). However, don't worry about adding that in this PR. We'll only run into this is we make pretty major changes to the types of data being stored in the dataframe. I can't see that happening in the foreseeable future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you were referring to the fact that deep=True does not copy recursively. deep=True is the default so that change wouldn't do anything.
However, copy.deepcopy is apparently recursive.
| context_column, | ||
| ] | ||
| if col in self.dataframe.columns | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want to shorten the code (at the expense of some readability), you could do this:
df_cols = [
col
for col in [
*metric_column if isinstance(metric_column, list) else metric_column,
annotation_column,
name_column,
context_column,
]
if col in self.dataframe.columns
]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*metric_column if isinstance(metric_column, list) else metric_column isn't valid syntax here. closest I can get is:
tree_cols = (
metric_column
if isinstance(metric_column, list)
else [metric_column]
+ [
annotation_column,
name_column,
context_column,
]
)
df_cols = [col for col in tree_cols if col in self.dataframe.columns]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. You learn something new every day. Apparently, the * unpack operator has to apply to the entire expression. You could rewrite what I have above as the following, and it should work:
df_cols = [
col
for col in [
*(metric_column if isinstance(metric_column, list) else [metric_column]),
annotation_column,
name_column,
context_column,
]
if col in self.dataframe.columns
]This works because the * operator is applied to all results of the ternary. To make sure that works correctly, I wrap the value in the else clause in square braces to force it to be a list that can be unpacked by *. Note that you have to use [] for this. If you use the list() constructor instead, the contents of your column name will be turned into a list. So, if the column name is "time", using [] produces ["time"], but using list() produces ["t", "i", "m", "e"].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me 7d8b397
34198e9 to
e67deb4
Compare
* Fill perfdata to be able to print tree * Undo black change * Undo black change * Make copy of df * select only necessary columns * Shorten code * Shorten code * explicitly set default value --------- Co-authored-by: Michael Richard Mckinsey <mckinsey@quartz764.llnl.gov> Co-authored-by: Michael Richard Mckinsey <mckinsey@quartz1154.llnl.gov>
This PR applies when
fill_perfdatais off in the Thicket constructor, since some profiles may be missing certain nodes. This PR inserts NaNs for missing nodes, so we can still print the tree.fill_perfdata=False_fill_perfdatafunction toutils.pyso it can be called intree