Take snapshots of graphframe #137

slabasan · 2020-04-01T23:34:28Z

Adds new gf.save() API to enable writing out processed graphframes to a file. This can be useful if a user doesn't want to wait for input data to be read in, or wants to save out a resulting graphframe after many operations.

One of the challenges here is given that all the readers have different formats, what is the minimum info we want to include in this file, so as not to lose important data?

The current output format I'm proposing is similar to caliper's json file:

{
    "data": [
        [0.0, 130.0, 0.0],
        [5.0, 20.0, 1.0],
        [5.0, 5.0, 2.0],
        [10.0, 10.0, 3.0]
    ],  
    "columns": [
        "time",
        "time (inc)",
        "nid"
    ],  
    "nodes": [
        {   
            "name": "foo"
        },  
        {   
            "name": "bar",
            "parent": 0
        },  
        {   
            "name": "baz",
            "parent": 1
        },  
        {   
            "name": "grault",
            "parent": 1
        }
    ]
}

So far, I've looked at outputting a literal graphframe and an HPCToolkit graphframe. These are good to compare because it already identifies the differences in the data each reader creates. The nodes are dumped by traversing the graph, outputting the frame attributes, and grabbing the _hatchet_nid of the parent.
literal:

dataframe columns: 'name', 'time', 'time (inc)'
dataframe index: 'node'
node: 'name', 'parent'

HPCToolkit:

dataframe columns: 'time (inc)', 'time', 'nid', 'file', 'line', 'module', 'type'
dataframe index: 'node', 'rank'
node: 'type', 'name', 'parent'

- add gf.save(), which produces a JSON snapshot file - read in hatchet snapshot files with new reader: from_hatchet_snapshot() - add tests for save and snapshot, confirming the snapshot files exist, and that the number of nodes and the column types are the same as the original - convert unicode columns output in python2 to strings

slabasan · 2020-04-27T22:27:09Z

This is ready for review (resolves #114). The format follows that of caliper's JSON format, but we aren't able to directly use everything from the caliper reader. The user saves a graphframe with gf.save(fname="hatchet-snapshot"). This outputs a JSON file called hatchet-snapshot.json, which can be read back into hatchet with from_hatchet_snapshot("hatchet-snapshot.json").

The snapshot file appends a new column to the data field called _hnid, which is used to merge the data and node dataframes. This column is removed from the final read in result. The hatchet snapshot reader has some logic to handle unicode and str formats in python2 and python3.

Tests cover outputting a snapshot file from known data (checking if the resulting file exists), validating the read in graphframe before and after a series of operations.

slabasan · 2020-05-08T16:43:01Z

Feedback:

duplicate name field in data and nodes sections of JSON file
if we assume that _hatchet_nid is the order of the fields in the nodes section, then we do not need the _hnid column in the data section (we remove _hnid anyways after merging the meta data and nodes together)
only need parent field on node
If a user does a groupby/aggregate, what are the resulting node types?

*Something to think about: What if our snapshot file was just the dataframe? Then we could export in any of the native types supported by pandas (i.e. csv, hdf5, pickle). To do this, we need to append the following columns to the dataframe:

parent
node type (e.g., statement, function)
hierarchy cols (i.e., if node type is statement, then hierarchy cols are file,line; if node type is function, then hierarchy cols is name)

slabasan added the WIP label Apr 1, 2020

slabasan force-pushed the gf-snapshots branch 4 times, most recently from 1efb08f to 9b24d6e Compare April 26, 2020 02:50

slabasan force-pushed the gf-snapshots branch from 9b24d6e to 9a10561 Compare April 27, 2020 21:40

slabasan added ready-for-review and removed WIP labels Apr 27, 2020

slabasan requested review from bhatele and tgamblin April 27, 2020 22:27

slabasan linked an issue May 7, 2020 that may be closed by this pull request

Dump snapshot of graphframe for checkpoint #114

Closed

slabasan added the WIP label May 8, 2020

slabasan mentioned this pull request May 8, 2020

What are the node types after a groupby/aggregate? #149

Closed

slabasan removed the ready-for-review label May 8, 2020

slabasan closed this Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take snapshots of graphframe #137

Take snapshots of graphframe #137

slabasan commented Apr 1, 2020 •

edited

slabasan commented Apr 27, 2020 •

edited

slabasan commented May 8, 2020

Take snapshots of graphframe #137

Take snapshots of graphframe #137

Conversation

slabasan commented Apr 1, 2020 • edited

slabasan commented Apr 27, 2020 • edited

slabasan commented May 8, 2020

slabasan commented Apr 1, 2020 •

edited

slabasan commented Apr 27, 2020 •

edited