-
-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing function signatures in a codebase to build a data flow graph #392
Comments
thanks @ericmjl, this is a great start to something I've also been thinking about :)
Congrats 🎉🎉🎉
This is awesome, very much aligned with what I've been discussing with @jeffzi re: a sister package that provides utilities for analyzing one's ETL codebase. For now I'll dub the project Things are still in brain-storm phase, but we're thinking of providing this sort of functionality:
The idea would be that There are further integrations needed here to realize a vision of "Data CI for DS/ML practitioners", e.g. a Github Actions integration, but I think Let me know if you have any further thoughts/ideas! |
@cosmicBboy this is cool! I like where things are going 😄. By a stroke of fortune, I had an entire afternoon and evening to hack. I am going to invite both you and @jeffzi to a private repo containing a prototype of how I think parsing of signatures could be done to get graph creation capabilities. Inside there should be enough starter code to fulfill the first pointer on "showing all schemas" used in a codebase and the functions that use them. I'm particularly happy about the recursive search through modules and recursive parsing of function signatures and return annotations, as it's been a while since I last had the chance to program anything recursive lol. That said, it still feels like a hack, so there might be better ways of handling the problem. Please feel free to raid it for anything you like! |
Thanks @ericmjl and congrats ! This is indeed a great starting point 🏁 Using annotations makes it easier to discover schemas usage (static analysis). In the long run, I think we'd also want to support plain |
Yes, thanks @jeffzi for the term: Parsing signatures is static analysis of data flow. I was struggling to describe it. Runtime analysis - how would you do it though? |
I don't have a detailed answer. We're still in the brainstorming stage. That said, we'd need an external program similarly to coverage or monkeytype. The standard python library offers some hooks for:
Getting familiar with the internals of those tools would hopefully guide us in the right direction. |
Re: runtime analysis, we should def look to other tools for inspiration as @jeffzi mentioned. I have a local working prototype that decorates all functions/callables in a specified set of modules and finds the ones that are decorated with |
@ericmjl @jeffzi FYI I just placed this issue under the For now I wanted to release as package extension |
Is your feature request related to a problem? Please describe.
This isn't related to a problem, it's an idea that came up in my mind after seeing Pandera's pydantic-style schemas.
Describe the solution you'd like
If we use type annotations on our functions, and our schemas are declared using the pydantic-style classes, then this opens up the opportunity to construct data flow graphs from our codebase.
Here is a concise example:
As a result of parsing the function signature, we should be able to see the following graph:
Amidst taking care of a newborn, I squeezed out some time to build a prototype, but I think I might struggle to find time to do a fully-fledged implementation. The prototype is found here, and uses only the built-in Python
inspect
module and NetworkX. Thus, I'm leaving the prototype up on a Gist, and raising an issue here, to see if there's anybody who wants to work together to make this happen.Additional context
I think the best place for this feature might be a sibling package to
pandera
, as pandera is all about data validation, while the idea I have described is data flow mapping. I just decided to post here because most of the development surrounding pandera takes place here.The text was updated successfully, but these errors were encountered: