Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take snapshots of graphframe #137

Closed
wants to merge 1 commit into from
Closed

Conversation

slabasan
Copy link
Collaborator

@slabasan slabasan commented Apr 1, 2020

Adds new gf.save() API to enable writing out processed graphframes to a file. This can be useful if a user doesn't want to wait for input data to be read in, or wants to save out a resulting graphframe after many operations.

One of the challenges here is given that all the readers have different formats, what is the minimum info we want to include in this file, so as not to lose important data?

The current output format I'm proposing is similar to caliper's json file:

{
    "data": [
        [0.0, 130.0, 0.0],
        [5.0, 20.0, 1.0],
        [5.0, 5.0, 2.0],
        [10.0, 10.0, 3.0]
    ],  
    "columns": [
        "time",
        "time (inc)",
        "nid"
    ],  
    "nodes": [
        {   
            "name": "foo"
        },  
        {   
            "name": "bar",
            "parent": 0
        },  
        {   
            "name": "baz",
            "parent": 1
        },  
        {   
            "name": "grault",
            "parent": 1
        }
    ]
}

So far, I've looked at outputting a literal graphframe and an HPCToolkit graphframe. These are good to compare because it already identifies the differences in the data each reader creates. The nodes are dumped by traversing the graph, outputting the frame attributes, and grabbing the _hatchet_nid of the parent.
literal:

  • dataframe columns: 'name', 'time', 'time (inc)'
  • dataframe index: 'node'
  • node: 'name', 'parent'

HPCToolkit:

  • dataframe columns: 'time (inc)', 'time', 'nid', 'file', 'line', 'module', 'type'
  • dataframe index: 'node', 'rank'
  • node: 'type', 'name', 'parent'

@slabasan slabasan added the WIP label Apr 1, 2020
@slabasan slabasan force-pushed the gf-snapshots branch 4 times, most recently from 1efb08f to 9b24d6e Compare April 26, 2020 02:50
- add gf.save(), which produces a JSON snapshot file
- read in hatchet snapshot files with new reader: from_hatchet_snapshot()
- add tests for save and snapshot, confirming the snapshot files exist, and
  that the number of nodes and the column types are the same as the original
- convert unicode columns output in python2 to strings
@slabasan
Copy link
Collaborator Author

slabasan commented Apr 27, 2020

This is ready for review (resolves #114). The format follows that of caliper's JSON format, but we aren't able to directly use everything from the caliper reader. The user saves a graphframe with gf.save(fname="hatchet-snapshot"). This outputs a JSON file called hatchet-snapshot.json, which can be read back into hatchet with from_hatchet_snapshot("hatchet-snapshot.json").

The snapshot file appends a new column to the data field called _hnid, which is used to merge the data and node dataframes. This column is removed from the final read in result. The hatchet snapshot reader has some logic to handle unicode and str formats in python2 and python3.

Tests cover outputting a snapshot file from known data (checking if the resulting file exists), validating the read in graphframe before and after a series of operations.

@slabasan slabasan linked an issue May 7, 2020 that may be closed by this pull request
@slabasan slabasan added the WIP label May 8, 2020
@slabasan
Copy link
Collaborator Author

slabasan commented May 8, 2020

Feedback:

  • duplicate name field in data and nodes sections of JSON file
  • if we assume that _hatchet_nid is the order of the fields in the nodes section, then we do not need the _hnid column in the data section (we remove _hnid anyways after merging the meta data and nodes together)
  • only need parent field on node
  • If a user does a groupby/aggregate, what are the resulting node types?

*Something to think about: What if our snapshot file was just the dataframe? Then we could export in any of the native types supported by pandas (i.e. csv, hdf5, pickle). To do this, we need to append the following columns to the dataframe:

  • parent
  • node type (e.g., statement, function)
  • hierarchy cols (i.e., if node type is statement, then hierarchy cols are file,line; if node type is function, then hierarchy cols is name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dump snapshot of graphframe for checkpoint
1 participant