Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing function signatures in a codebase to build a data flow graph #392

Open
ericmjl opened this issue Jan 18, 2021 · 7 comments
Open
Labels
enhancement New feature or request
Projects

Comments

@ericmjl
Copy link
Contributor

ericmjl commented Jan 18, 2021

Is your feature request related to a problem? Please describe.

This isn't related to a problem, it's an idea that came up in my mind after seeing Pandera's pydantic-style schemas.

Describe the solution you'd like

If we use type annotations on our functions, and our schemas are declared using the pydantic-style classes, then this opens up the opportunity to construct data flow graphs from our codebase.

Here is a concise example:

from .schemas import Schema1, Schema2, Schema3

def function1(df1: Schema1, df2: Schema2) -> Schema3:
    # stuff happens
    return df3

As a result of parsing the function signature, we should be able to see the following graph:

(Schema1, Schema2) --> function1 --> (Schema3)

Amidst taking care of a newborn, I squeezed out some time to build a prototype, but I think I might struggle to find time to do a fully-fledged implementation. The prototype is found here, and uses only the built-in Python inspect module and NetworkX. Thus, I'm leaving the prototype up on a Gist, and raising an issue here, to see if there's anybody who wants to work together to make this happen.

Additional context

I think the best place for this feature might be a sibling package to pandera, as pandera is all about data validation, while the idea I have described is data flow mapping. I just decided to post here because most of the development surrounding pandera takes place here.

@ericmjl ericmjl added the enhancement New feature or request label Jan 18, 2021
@cosmicBboy
Copy link
Collaborator

thanks @ericmjl, this is a great start to something I've also been thinking about :)

Amidst taking care of a newborn

Congrats 🎉🎉🎉

I squeezed out some time to build a prototype, but I think I might struggle to find time to do a fully-fledged implementation.

This is awesome, very much aligned with what I've been discussing with @jeffzi re: a sister package that provides utilities for analyzing one's ETL codebase.

For now I'll dub the project pandera-profiling a.k.a. "profile your ETL pipeline" (a slight nod to pandas-profiling). Would be happy to find another suitable name once the idea has legs.

Things are still in brain-storm phase, but we're thinking of providing this sort of functionality:

  • show all the schemas used in your codebase and which functions use them (with the graph-rendering capabilities this is pretty much what you're proposing)
  • a utility for aggregating all schema validation errors and producing a human-readable report
  • show percent of functions in the source code that have dataframe inputs and outputs that are typed with a pandera schema (the "data coverage" piece)

The idea would be that pandera-profiling would be used in concert with pytest-pandera in a similar way that people use coverage and pytest-cov, i.e. run pytest --pandera-report and it generates a bunch of artifacts (html, json, etc.) that can be viewed locally or uploaded to some external service (also TBD(eveloped))

There are further integrations needed here to realize a vision of "Data CI for DS/ML practitioners", e.g. a Github Actions integration, but I think pandera-profiling and pytest-pandera would be the first two big steps to get there.

Let me know if you have any further thoughts/ideas!

@ericmjl
Copy link
Contributor Author

ericmjl commented Jan 19, 2021

@cosmicBboy this is cool! I like where things are going 😄.

By a stroke of fortune, I had an entire afternoon and evening to hack. I am going to invite both you and @jeffzi to a private repo containing a prototype of how I think parsing of signatures could be done to get graph creation capabilities. Inside there should be enough starter code to fulfill the first pointer on "showing all schemas" used in a codebase and the functions that use them. I'm particularly happy about the recursive search through modules and recursive parsing of function signatures and return annotations, as it's been a while since I last had the chance to program anything recursive lol. That said, it still feels like a hack, so there might be better ways of handling the problem.

Please feel free to raid it for anything you like!

@jeffzi
Copy link
Collaborator

jeffzi commented Jan 19, 2021

Thanks @ericmjl and congrats ! This is indeed a great starting point 🏁

Using annotations makes it easier to discover schemas usage (static analysis). In the long run, I think we'd also want to support plain DataFrameSchemas with @check_input, etc. decorators. That would require a tool that monitors the program running, similar to coverage.

@ericmjl
Copy link
Contributor Author

ericmjl commented Jan 21, 2021

Yes, thanks @jeffzi for the term: Parsing signatures is static analysis of data flow. I was struggling to describe it. Runtime analysis - how would you do it though?

@jeffzi
Copy link
Collaborator

jeffzi commented Jan 23, 2021

I don't have a detailed answer. We're still in the brainstorming stage.

That said, we'd need an external program similarly to coverage or monkeytype. The standard python library offers some hooks for:

Getting familiar with the internals of those tools would hopefully guide us in the right direction.

@cosmicBboy
Copy link
Collaborator

Re: runtime analysis, we should def look to other tools for inspiration as @jeffzi mentioned. I have a local working prototype that decorates all functions/callables in a specified set of modules and finds the ones that are decorated with check_{input/output/io}. This could then be used to annotate the function signature, and from there your graph-rendering prototype @ericmjl should be pretty straight-forward to construct. It's not a complete solution yet, and probably not the best one, but does work!

@cosmicBboy cosmicBboy added this to Backlog in Release Roadmap Feb 19, 2021
@cosmicBboy cosmicBboy added this to the Sentinel milestone Feb 25, 2021
@cosmicBboy
Copy link
Collaborator

@ericmjl @jeffzi FYI I just placed this issue under the sentinel milestone, which is tracked under the corresponding project, which describes the goals of this effort.

For now I wanted to release as package extension pandera[profiling] as opposed to a separate package (a) because I think the functionality is closely related to pandera as a data testing tool and (b) I'm too lazy to set up all the CI/repo infrastructure in another git repo 🙃.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

3 participants