Parsing function signatures in a codebase to build a data flow graph #392

ericmjl · 2021-01-18T14:41:25Z

Is your feature request related to a problem? Please describe.

This isn't related to a problem, it's an idea that came up in my mind after seeing Pandera's pydantic-style schemas.

Describe the solution you'd like

If we use type annotations on our functions, and our schemas are declared using the pydantic-style classes, then this opens up the opportunity to construct data flow graphs from our codebase.

Here is a concise example:

from .schemas import Schema1, Schema2, Schema3

def function1(df1: Schema1, df2: Schema2) -> Schema3:
    # stuff happens
    return df3

As a result of parsing the function signature, we should be able to see the following graph:

(Schema1, Schema2) --> function1 --> (Schema3)

Amidst taking care of a newborn, I squeezed out some time to build a prototype, but I think I might struggle to find time to do a fully-fledged implementation. The prototype is found here, and uses only the built-in Python inspect module and NetworkX. Thus, I'm leaving the prototype up on a Gist, and raising an issue here, to see if there's anybody who wants to work together to make this happen.

Additional context

I think the best place for this feature might be a sibling package to pandera, as pandera is all about data validation, while the idea I have described is data flow mapping. I just decided to post here because most of the development surrounding pandera takes place here.

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-01-19T02:54:51Z

thanks @ericmjl, this is a great start to something I've also been thinking about :)

Amidst taking care of a newborn

Congrats 🎉🎉🎉

I squeezed out some time to build a prototype, but I think I might struggle to find time to do a fully-fledged implementation.

This is awesome, very much aligned with what I've been discussing with @jeffzi re: a sister package that provides utilities for analyzing one's ETL codebase.

For now I'll dub the project pandera-profiling a.k.a. "profile your ETL pipeline" (a slight nod to pandas-profiling). Would be happy to find another suitable name once the idea has legs.

Things are still in brain-storm phase, but we're thinking of providing this sort of functionality:

show all the schemas used in your codebase and which functions use them (with the graph-rendering capabilities this is pretty much what you're proposing)
a utility for aggregating all schema validation errors and producing a human-readable report
show percent of functions in the source code that have dataframe inputs and outputs that are typed with a pandera schema (the "data coverage" piece)

The idea would be that pandera-profiling would be used in concert with pytest-pandera in a similar way that people use coverage and pytest-cov, i.e. run pytest --pandera-report and it generates a bunch of artifacts (html, json, etc.) that can be viewed locally or uploaded to some external service (also TBD(eveloped))

There are further integrations needed here to realize a vision of "Data CI for DS/ML practitioners", e.g. a Github Actions integration, but I think pandera-profiling and pytest-pandera would be the first two big steps to get there.

Let me know if you have any further thoughts/ideas!

ericmjl · 2021-01-19T03:43:23Z

@cosmicBboy this is cool! I like where things are going 😄.

By a stroke of fortune, I had an entire afternoon and evening to hack. I am going to invite both you and @jeffzi to a private repo containing a prototype of how I think parsing of signatures could be done to get graph creation capabilities. Inside there should be enough starter code to fulfill the first pointer on "showing all schemas" used in a codebase and the functions that use them. I'm particularly happy about the recursive search through modules and recursive parsing of function signatures and return annotations, as it's been a while since I last had the chance to program anything recursive lol. That said, it still feels like a hack, so there might be better ways of handling the problem.

Please feel free to raid it for anything you like!

jeffzi · 2021-01-19T21:11:45Z

Thanks @ericmjl and congrats ! This is indeed a great starting point 🏁

Using annotations makes it easier to discover schemas usage (static analysis). In the long run, I think we'd also want to support plain DataFrameSchemas with @check_input, etc. decorators. That would require a tool that monitors the program running, similar to coverage.

ericmjl · 2021-01-21T02:18:49Z

Yes, thanks @jeffzi for the term: Parsing signatures is static analysis of data flow. I was struggling to describe it. Runtime analysis - how would you do it though?

jeffzi · 2021-01-23T23:53:52Z

I don't have a detailed answer. We're still in the brainstorming stage.

That said, we'd need an external program similarly to coverage or monkeytype. The standard python library offers some hooks for:

debugging: sys.settrace (coverage)
profiling: sys.setprofile (monkeytype)

Getting familiar with the internals of those tools would hopefully guide us in the right direction.

cosmicBboy · 2021-01-25T14:55:27Z

Re: runtime analysis, we should def look to other tools for inspiration as @jeffzi mentioned. I have a local working prototype that decorates all functions/callables in a specified set of modules and finds the ones that are decorated with check_{input/output/io}. This could then be used to annotate the function signature, and from there your graph-rendering prototype @ericmjl should be pretty straight-forward to construct. It's not a complete solution yet, and probably not the best one, but does work!

cosmicBboy · 2021-02-25T21:50:58Z

@ericmjl @jeffzi FYI I just placed this issue under the sentinel milestone, which is tracked under the corresponding project, which describes the goals of this effort.

For now I wanted to release as package extension pandera[profiling] as opposed to a separate package (a) because I think the functionality is closely related to pandera as a data testing tool and (b) I'm too lazy to set up all the CI/repo infrastructure in another git repo 🙃.

ericmjl added the enhancement New feature or request label Jan 18, 2021

ericmjl mentioned this issue Jan 24, 2021

Generate SchemaModels from DataFrameSchemas #393

Open

cosmicBboy added this to Backlog in Release Roadmap Feb 19, 2021

cosmicBboy added this to the Sentinel milestone Feb 25, 2021

cosmicBboy added this to TODO in Sentinel Feb 25, 2021

cosmicBboy removed this from Backlog in Release Roadmap Feb 26, 2021

cosmicBboy mentioned this issue Feb 26, 2021

Implement CLI for pipeline profiling and reports #426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing function signatures in a codebase to build a data flow graph #392

Parsing function signatures in a codebase to build a data flow graph #392

ericmjl commented Jan 18, 2021

cosmicBboy commented Jan 19, 2021

ericmjl commented Jan 19, 2021 •

edited

jeffzi commented Jan 19, 2021

ericmjl commented Jan 21, 2021

jeffzi commented Jan 23, 2021 •

edited

cosmicBboy commented Jan 25, 2021

cosmicBboy commented Feb 25, 2021

Parsing function signatures in a codebase to build a data flow graph #392

Parsing function signatures in a codebase to build a data flow graph #392

Comments

ericmjl commented Jan 18, 2021

cosmicBboy commented Jan 19, 2021

ericmjl commented Jan 19, 2021 • edited

jeffzi commented Jan 19, 2021

ericmjl commented Jan 21, 2021

jeffzi commented Jan 23, 2021 • edited

cosmicBboy commented Jan 25, 2021

cosmicBboy commented Feb 25, 2021

ericmjl commented Jan 19, 2021 •

edited

jeffzi commented Jan 23, 2021 •

edited