Add Nodes in Slice Before Printing Tree #181

michaelmckinsey1 · 2024-06-24T17:22:20Z

This PR applies when fill_perfdata is off in the Thicket constructor, since some profiles may be missing certain nodes. This PR inserts NaNs for missing nodes, so we can still print the tree.

Allows tree to be printed when fill_perfdata=False
Moves the _fill_perfdata function to utils.py so it can be called in tree

michaelmckinsey1 · 2024-06-25T21:08:32Z

Error found from #182

ilumsden

Mostly looks good. Just a couple of minor changes to ensure good performance and prevent unintended modification of the Thicket data structure.

ilumsden · 2024-06-26T22:04:14Z

thicket/thicket.py

        # Slices the DataFrame to simulate a single-level index
        try:
+            # _fill_perfdata to make sure number of nodes in df == graph
+            slice_df = _fill_perfdata(self.dataframe)


Before calling _fill_perfdata, I would filter the dataframe down to just the columns that will be involved in the printing. Otherwise, you could be filling in the entire dataframe when you only need to fill in one or two columns.

tldr: checking for these columns is not pretty, but there is up to 20% performance improvement

I tried this in 650007a. It is not trivial to select the columns, because there are 4 different dataframe columns as arguments to tree, default arguments may not exist, and metric_column can be a list or a str.

I collected some performance numbers below for this specific change. I tried to make up examples where _fill_perfdata would be doing significant work.

Testing how long Thicket.tree takes to return (without printing)

1 LBANN file + adding arbitrary columns to stress test (6558 nodes, 110 cols)

without change (3 trials):

3.34s, 3.36s, 3.45s

with 650007a (3 trials):

2.9s, 2.8s, 3.16s

Contrived example with two different datasets (1 LBANN + 16 AMG) + adding arbitrary columns (6575 nodes, 114 columns)

without: 3.25s
with 650007a: 2.56s

thicket/utils.py

ilumsden

LGTM.

However, @pearce8, when you do your review, you should take a look at the comment @michaelmckinsey1 left regarding performance.

ilumsden · 2024-07-02T21:07:59Z

thicket/utils.py

+    Returns:
+        (DataFrame): filled DataFrame
+    """
+    new_df = df.copy()


This is fine for now, although it may cause issues in the future if we ever start including data beyond POD. That's because DataFrame.copy with default arguments performs a shallow copy. So, any by-reference objects in new_df that are edited would be edited in df as well.

However, we currently don't deal with by-reference objects (besides the nodes, but that can be safely ignored in this case), and I can't see any scenario in the near future where we would. So, we should keep this in mind, but I don't think anything has to be done about this in this PR.

Would the solution you propose be our own function for deepcopying Pandas DataFrames?

No, the solution would be new_df = df.copy(deep=True). However, don't worry about adding that in this PR. We'll only run into this is we make pretty major changes to the types of data being stored in the dataframe. I can't see that happening in the foreseeable future.

I thought you were referring to the fact that deep=True does not copy recursively. deep=True is the default so that change wouldn't do anything.

However, copy.deepcopy is apparently recursive.

ilumsden · 2024-07-02T21:09:58Z

thicket/thicket.py

+                        context_column,
+                    ]
+                    if col in self.dataframe.columns
+                ]


If you really want to shorten the code (at the expense of some readability), you could do this:

df_cols = [ col for col in [ *metric_column if isinstance(metric_column, list) else metric_column, annotation_column, name_column, context_column, ] if col in self.dataframe.columns ]

*metric_column if isinstance(metric_column, list) else metric_column isn't valid syntax here. closest I can get is:

tree_cols = ( metric_column if isinstance(metric_column, list) else [metric_column] + [ annotation_column, name_column, context_column, ] ) df_cols = [col for col in tree_cols if col in self.dataframe.columns]

Huh. You learn something new every day. Apparently, the * unpack operator has to apply to the entire expression. You could rewrite what I have above as the following, and it should work:

df_cols = [ col for col in [ *(metric_column if isinstance(metric_column, list) else [metric_column]), annotation_column, name_column, context_column, ] if col in self.dataframe.columns ]

This works because the * operator is applied to all results of the ternary. To make sure that works correctly, I wrap the value in the else clause in square braces to force it to be a list that can be unpacked by *. Note that you have to use [] for this. If you use the list() constructor instead, the contents of your column name will be turned into a list. So, if the column name is "time", using [] produces ["time"], but using list() produces ["t", "i", "m", "e"].

works for me 7d8b397

* Fill perfdata to be able to print tree * Undo black change * Undo black change * Make copy of df * select only necessary columns * Shorten code * Shorten code * explicitly set default value --------- Co-authored-by: Michael Richard Mckinsey <mckinsey@quartz764.llnl.gov> Co-authored-by: Michael Richard Mckinsey <mckinsey@quartz1154.llnl.gov>

michaelmckinsey1 added area-visualization Issues and PRs involving any of Thicket's provided visualizations priority-normal Normal priority issues and PRs status-work-in-progress PR is currently being worked on type-bug Identifies bugs in issues and identifies bug fixes in PRs labels Jun 24, 2024

michaelmckinsey1 self-assigned this Jun 24, 2024

michaelmckinsey1 added status-ready-for-review This PR is ready to be reviewed by assigned reviewers and removed status-work-in-progress PR is currently being worked on labels Jun 26, 2024

michaelmckinsey1 requested a review from ilumsden June 26, 2024 20:55

ilumsden requested changes Jun 26, 2024

View reviewed changes

michaelmckinsey1 requested a review from ilumsden July 2, 2024 01:47

ilumsden approved these changes Jul 2, 2024

View reviewed changes

ilumsden added status-approved No more revisions are required on this PR and it is ready for merge and removed status-ready-for-review This PR is ready to be reviewed by assigned reviewers labels Jul 2, 2024

michaelmckinsey1 mentioned this pull request Jul 9, 2024

Run Unit Tests for Different Parameters #182

Merged

michaelmckinsey1 and others added 8 commits July 9, 2024 17:29

Fill perfdata to be able to print tree

04e673f

Undo black change

6a7892c

Undo black change

3962db5

Make copy of df

9bea5ab

select only necessary columns

5402fea

Shorten code

699d38b

Shorten code

5944b48

explicitly set default value

e67deb4

michaelmckinsey1 force-pushed the fix-tree_nofill branch from 34198e9 to e67deb4 Compare July 9, 2024 22:29

pearce8 approved these changes Jul 10, 2024

View reviewed changes

pearce8 merged commit 52b3c4a into llnl:develop Jul 10, 2024

michaelmckinsey1 mentioned this pull request Jul 22, 2024

Fix Indexing Issue in Tree for MultiIndex #197

Merged

slabasan added this to the 2024.2.0 milestone Sep 6, 2024

Add Nodes in Slice Before Printing Tree #181

Add Nodes in Slice Before Printing Tree #181

Uh oh!

Conversation

michaelmckinsey1 commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelmckinsey1 commented Jun 25, 2024

Uh oh!

ilumsden left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelmckinsey1 Jul 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Testing how long Thicket.tree takes to return (without printing)

1 LBANN file + adding arbitrary columns to stress test (6558 nodes, 110 cols)

Contrived example with two different datasets (1 LBANN + 16 AMG) + adding arbitrary columns (6575 nodes, 114 columns)

Uh oh!

Uh oh!

ilumsden left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilumsden Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelmckinsey1 commented Jun 24, 2024 •

edited

Loading

michaelmckinsey1 Jul 1, 2024 •

edited

Loading

ilumsden Jul 5, 2024 •

edited

Loading