Skip to content
This repository has been archived by the owner on Jan 1, 2023. It is now read-only.

Provenance tracking #4

Open
prjemian opened this issue Mar 29, 2016 · 21 comments
Open

Provenance tracking #4

prjemian opened this issue Mar 29, 2016 · 21 comments

Comments

@prjemian
Copy link
Owner

Brian Toby started an email discussion on Provenance tracking in Python. He included code that would be an enhancement for this prototype project.

To contribute that code, make a pull request.

Here follows the discussion...

@prjemian
Copy link
Owner Author

the initial email from Brian:

...

Attached is a short bit of code to provide a dictionary that tracks versions of 
Python packages for provenance tracking as well as a a few things that we 
track in GSAS-II.  I am not sure if there is an easy way to also introspect the 
.dll/.pyd files that an application uses, but it would be worthwhile to see 
... any additional ideas. 

I am not sure this needs to be a stand-alone .py file, but I am also not sure 
where it might make sense to stick this in our skeleton application. 

@prjemian
Copy link
Owner Author

Pete's reply:

How common is this particular code used in various Python Packages?
What is *its* provenance (where did you get it)?

Compare this with:
https://www.euroscipy.org/2015/schedule/presentation/16/

If I understand how this is used, the module is added to project 
and is then called upon when writing a data file, to add additional 
info to that file describing the suite of packages in use at the time.

@prjemian
Copy link
Owner Author

Pete also wrote about the typical git workflow for collaborators to contribute:

In git, you would:

  • fork the prjemian/PyPrototype repository to briantoby/PyPrototype
    • easiest to do this with the GitHub web interface
    • there's a "fork" control at the top of the page
  • clone briantoby/PyPrototype from GitHub to your Mac
  • add the proposed file
    • copy that file into /src/PyPrototype/
    • commit this change on your hard disk
    • push your local repo back to GitHub: briantoby/PyPrototype
  • make a "pull request" in prjemian/PyPrototype
  • I'll look at your "PR"
  • we'll talk via GitHub's issue management service
  • end result: either I merge your PR or you close the PR

@prjemian
Copy link
Owner Author

Brian's response to "from where did this code come"?

For GSAS-II we use something much cruder — with hardcoded package names 
for the things we track, but I cannot tell you how valuable it is to store this sort of 
info in a project file for when someone sends us something to debug. 

I just wrote what I sent. 

Also:

ReciPy looks very interesting, but is also a much heavier-weight package and 
my quick reading is that it only looks for certain packages that it cares about. 
I am not sure how parallel they are, but there are probably some cool things 
to [consider] from that.

@prjemian
Copy link
Owner Author

Pete commented on Brian's code example:

Your code is a good example to refactor from a method into a class.  
If I correctly understood its intent. 

@prjemian
Copy link
Owner Author

Brian wrote to Doga:

Doga, 

  A while back you raised the issue of provenance tracking in the context 
of next practices for package creation. I’d like to ask you to revisit this to 
better evaluate what is needed. A prototype package for APS projects has 
been created (http://pyprototype.readthedocs.org/en/latest/ and 
https://github.com/prjemian/PyPrototype). As best as I can tell, this takes care 
of the problem of interacting with Github to establish provenance of code from 
the current repository, but as you know tracking external package versions is 
also quite important. Towards that, Pete has located the following and I wrote the attached routine. 
* https://github.com/recipy/recipy
* https://www.euroscipy.org/2015/schedule/presentation/16/

   My feeling is that ReciPy is both too limited (tracked packages are hard-coded) 
and too heavy-weight for our use and I don’t see why one would use that and 
versioneer, but my code should also likely do more than it does now. I’d like to 
ask you to review package/computing environment provenance and see what 
else might be needed in an APS-tailored package. It would be great to have you 
contribute code to the prototype, but even coming up with a list of useful features 
to steal from ReciPy would be of value. On a related note, I have struck out on 
finding introspection mechanisms for profiling the versions of the most important 
.so/.dynlib/.dll libraries relevant to a Python app’s results, but if you have any ideas 
on that this would be useful. Even knowing the most relevant library names by 
platform could be of value. 

Brian

@prjemian
Copy link
Owner Author

provenanceTracker.py

'''Code to record provenance information for a Python app
This assumes that all significant imports have been done before
routine provenanceTracker.provenanceTracker() is called.
'''

import sys
import platform

__version__ = '0.0.1'
def provenanceTracker():
    '''Provides a dict listing versions of imported packages

    :returns: dict where key is name of package and value is the
      __version__ string or for a few known outliers some other variable
      that indicates the version.
    '''
    PackageVersions = {}
    PackageVersions['Python'] = sys.version.split()[0]
    PackageVersions['Platform'] = sys.platform+'|'+platform.architecture()[0]+'|'+platform.machine()
    for name,pkg in sys.modules.iteritems():
        try:
            PackageVersions[name] = pkg.__version__
            continue
        except AttributeError:
            pass
        # deal with a few known ideosyncratic packages
        if name == 'Image':
            PackageVersions[name] = pkg.VERSION
        elif name == 'PIL':
            PackageVersions[name] = pkg.PILLOW_VERSION
    return PackageVersions

# test this by calling it directly
if __name__ == '__main__':
    import provenanceTracker
    import matplotlib as mpl
    import sys
    import PIL
    import numpy
    import Image
    provenance = provenanceTracker.provenanceTracker()
    for p in sorted(provenance): print p,provenance[p]

@prjemian
Copy link
Owner Author

Brian wrote back:

On Mar 28, 2016, at 11:58 AM, Pete Jemian <jemian@anl.gov> wrote:
>
> Your code is a good example to refactor from a method into a class. 

If you get a chance, I’d like to have you explain to me why you would do this, 
so I learn more. I have always believed in writing the most simple code that 
gets the job done, so my feeling would be to stick with a simple function 
unless a class adds more features, simplifies use or maintenance. 

@prjemian
Copy link
Owner Author

Pete responds to Brian:

It all depends on how it is intended to be used.

@prjemian
Copy link
Owner Author

Doga Gursoy (welcome to the discussion) wrote:

For prototyping this is also an easy way: https://cookiecutter.readthedocs.org

Specifically: https://github.com/audreyr/cookiecutter-pypackage

@prjemian
Copy link
Owner Author

Now that the discussion is up to date, I'll continue...

What is the intended purpose of the addition of provenance code in this prototype?

  • for people who package python code
  • for people who use python code for data analysis and wish to document the components of the code at the time of the analysis?
  • to document what is in the package?

Will this method be called more than once each time the python package is used?

@prjemian
Copy link
Owner Author

CookieCutter looks like a very useful tool. Different, I believe, than Brian's idea of provenance tracking.

Thinking of that old lesson: do the fishing or teach the fishing, the CookieCutter project and
this PyPrototype project are on opposite ends of the lesson.

The PyPrototype project demonstrates the layout of a prototypical Python project. It is on the end of teaching how to fish. There will be some find/replace work to change each new copy of the prototype into a useful new project. Maybe that's too much work.

CookieCutter is very much on the end where the fishing is actually done. One uses it to create a new skeleton Python project with all the right names and such (or some other metaphorical cookie shape) according to a customized template.

Consider this: the PyPrototype project shows the pattern of the end result.
We could create a CookieCutter template to recreate the steps that provide the customized project.

@briantoby
Copy link
Collaborator

What is the intended purpose of the addition of provenance code in this prototype?

I see this as allowing people to recreate the code environment that gave a particular result. I am assuming that thanks to versioneer, one knows what version the current code one is running (perhaps that should be integrated in provenanceTracker.py.) but if a result is changed by for example a change in numpy, how does one track/recreate that?

I envision calling the one function in provenanceTracker.py before saving output. By including the returned dictionary one would document as much as possible of the software stack.

@dgursoy
Copy link
Collaborator

dgursoy commented Mar 29, 2016

@prjemian
Copy link
Owner Author

Some pep8 standards but not all of them.

For example, errors on "E221 multiple spaces before operator" are just goofy. Sometimes we humans want to line up the equal signs in a block of assignments (such as init.py).

Trailing whitespace on a line (W291) is benign
"E402 module level import not at top of file" flagged the init.py again for its handling of versioneer.

Mostly, pep8 is advisory but should be taken with a healthy skepticism.
Another code, pylint, has a differing opinion and provides a better diagnostic (IMO) of code problems. It gives a score that can be used to measure improvement against a recommendation. Some projects require a minimum score for acceptance. The init.py file has a bad score: -0.67/10. I can improve that probably to +5/10 (or so I hope).

@briantoby
Copy link
Collaborator

I hate W291

Is there a way to configure in automatic checking of some PEP8 standards?

@dgursoy
Copy link
Collaborator

dgursoy commented Mar 29, 2016

http://pep8.readthedocs.org

[pep8]
ignore = W291

@prjemian
Copy link
Owner Author

conda install pylint

much more valuable feedback and coaching from this tool than pep8

prjemian added a commit that referenced this issue Mar 29, 2016
@dgursoy
Copy link
Collaborator

dgursoy commented Mar 29, 2016

https://codeclimate.com is also useful and does this and some other error and readability checks for you.

@dgursoy
Copy link
Collaborator

dgursoy commented Mar 31, 2016

@nicholas-aps does any project you've been involved use provenance tracking for data processing? Any ideas or suggestions?

@prjemian
Copy link
Owner Author

One project (of which I am aware) actively tracks provenance:

OK, I take that back. The Irena IgorPro macros maintained by Jan
Ilavsky for SAS data maintain a journal record as part of the IgorPro
project file. This is behind the scenes provenance logging.

Here, the provenance is recorded as data values in "wavenotes" (metadata
string connected to every IgorPro array data object) and as a data
processing activity logging notebook that Jan created within an IgorPro
notebook structure.

Irena:
http://usaxs.xray.aps.anl.gov/staff/ilavsky/irena.html

Otherwise, it has been discussed in two data standards projects:
NeXus : http://download.nexusformat.org/doc/html/search.html?q=provenance
canSAS : http://cansas-org.github.io/canSAS2012/search.html?q=provenance

The most progress of these two was to assert the desire and importance
of documenting provenance and to establish a location within a NeXus
file to record it. That location would be within a NXprocess group (an
event of data processing, reconstruction, or analysis) as a NXnote group.

:NXprocess:
http://download.nexusformat.org/doc/html/classes/base_classes/NXprocess.html

:NXnote:
http://download.nexusformat.org/doc/html/classes/base_classes/NXnote.html

Pete

On 3/31/2016 11:33 AM, Doga Gursoy wrote:

@nicholas-aps https://github.com/nicholas-aps does any project you've
been involved use provenance tracking for data processing? Any ideas or
suggestions?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#4 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants