New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_sample_data still broken on v.1.1.x #498
Conversation
Cannot replicate, but I have a guess. Could you send me your /export/home/johnh/.matplotlib.linux/sample_data/cache.pck file? |
Hey Jouni, I will email you the cache.pck file momentairly. But I want to clarify how I am using this and the undersirable behavior (to me) that I am seeing. Perhaps I am abusing the code and there is no sane way to do what I am doing, but here goes. I have my sample_data in my local MPLCONFIGDIR. I frequently build the docs etc with examples.download=False and point to this directory. I usually get this directory as a github checkout from matplotlib.sample_data But sometimes I am running as a normal user, eg not building the docs, and I still have my MPLCONFIGDIR pointing to the same place and examples.download=True. When I run in that environment, the sample_data code removes everything that is not under it's control, eg it wipes the githup checkout clean including the .git directory. This seems a bit heavy handed to me. I'm including another shell session below which shows the workflow with debug verbosity. Let me know if you think this is the right behavior. flush sample_data, get a clean gihub checkout for my local copylettuce:/export/home/johnh/.matplotlib.linux $ rm -rf sample_data/ run the image demo that pulls lena.jpg for the first time; note how it wipes the dir cleanlettuce:/export/home/johnh/matplotlib.matplotlib/examples/pylab_examples $ MPLCONFIGDIR=/export/home/johnh/.matplotlib.linux python image_demo3.test.py --verbose-debug -dagg now rerun a second time to reproduce errorlettuce:/export/home/johnh/matplotlib.matplotlib/examples/pylab_examples $ ls /export/home/johnh/.matplotlib.linux/sample_data/axes_grid cache.pck lena.jpg screenshots testdir |
I've updated with a comment on github -- here is the pck file. Thanks On Sat, Oct 1, 2011 at 2:20 PM, Jouni K. Seppnen
|
I think you accidentally sent the pck file to the github email address of this issue instead of to me, and github doesn't show the attachment. In any case, which exact version of Python is this? I have Python 2.7.1 on OS X, and the httplib line numbers in the traceback don't match. Update: I installed 2.7.2, and the line numbers still don't match, and I still can't reproduce your error. Does the OpenSUSE version have some patches that are not in standard Python? Update 2: I don't see any patches in python-2.7.2-7.1.src.rpm that would explain this, but I have no idea if that is the same rpm that you have. |
About your usage: it's not how this code was intended to be used. The original idea was to be able to download example files from Sourceforge without packaging all of them in the tarball and without requiring the user to have a Subversion client. How I implemented it was to use HTTP instead of the Subversion-specific protocol, since any reasonable HTTP server has support for caching and Python has useful HTTP-related packages in its standard library. HTTP caching works by having the server send one or two optional response headers (ETag based on content, and Last-Modified based on modification date) that the client can later use to ask the server if the file has been changed. In particular, Github seems to only send ETag, not Last-Modified, and ETag is a completely opaque identifier that the server is allowed to generate in any way. Now when you get files in some other way (by pointing the cache directory at a git checkout) you don't get the cache-related HTTP headers, so the retrieval code has no way of asking the HTTP server if this file is up-to-date. That's why it decides that all the files in that directory are out-of-date. We could add a check to see if your sample_data directory is actually a git checkout, and in that case just do a |
Attempts to fix issue matplotlib#498. I can't reproduce the issue myself, so I don't know if this is the real culprit, but it shouldn't do any harm.
Pull request #501 attempts to fix this, but since I can't reproduce this, I don't know if it really helps. |
I'm starting to think we should make httplib2 a dependency and use it for |
I just tested your pull request #501 and it appears to be working fine. On the issue of trying to detect a git repo, I agree this is probably not a good idea because of the additional complexity. But what do you think about not removing files in the directory. It seems like our managed files could live besides files they know nothing about, which would allow my use case to work reasonably well. Again, blowing away all the files feels a bit heavy handed and may lead to unhappy surprises. On the issue of httplib2, I agree it should wait, and we would probably have to distribute it ourselves which causes it's own problems. Alternatively, we could consider moving the sample data back into the main tree. It's <11M at this point, and we could simply distribute it. But let's see how well things work with the new fixes and not upset the apple cart at this point. Thanks for the fixes. |
I removed the release_critical label since the immediate problem got fixed, but let's not close this yet since we should fix this properly on master. Not deleting files is doable as you say, but shipping the sample data as part of the matplotlib package might be the simplest solution. |
@jkseppan: Any idea where this ticket is at? Anything left to be done? |
I would like to suggest that we drop the current get_sample_data mechanism and start including the sample data as part of the downloadable matplotlib packages. The root cause of problems like #478 is that we get the files from github (or sourceforge, which we used previously) and don't really control the server side: the hoster can move the files elsewhere and leave a redirect behind, or perhaps only a "404 Not Found" and a human-readable explanation. The original rationale for get_sample_data was that the gallery can include new examples with new data, and you could use them with older versions of matplotlib that didn't come with that data file. In practice it doesn't seem that we get a lot of new data in the repository. There are no commits from 2012; three commits from 2011 that add data files (and some more that increment version numbers or similarly modify the infrastructure); no commits from 2010; a lot of activity in 2009). My guess is that there is sufficient sample data there for demoing the various plot types matplotlib has, and almost all new examples can use the existing data. I seem to recall that the Debian packager wanted a self-contained package that doesn't download data during the documentation build. I guess there's some special patch for Debian to handle this. |
… files from an installed sample_data directory. Include the sample data locally. Remove sample data that is no longer used.
I've attached a commit that includes all of the sample data locally and makes |
@mdboom: Looks good. I would be prepared to merge it. One concern however: how big is the sample data being added? |
@pelson, it is only 1.4 MB. |
It appears that cbook.get_sample_data is still broken, at least in some configurations. I start with an empty cache (custom MPLCONFIGDIR) and run an example that requires the sample data once, and it pulls from github. All is well. When I run the example a second time, it get a failure (traceback below)
Running in v.1.1.x branch with commit 0c7f83d on opensuse python2.7 64 bit. herre is the --verbose-debug output:
remove the sample_data cache
johnh@lettuce:doc> rm -rf /export/home/johnh/.matplotlib.linux/sample_data
first run is OK
johnh@lettuce:doc> MPLCONFIGDIR=/export/home/johnh/.matplotlib.linux PYTHONPATH=/export/home/johnh/devlinux/lib64/python2.7/site-packages/ python ../examples/pylab_examples/image_demo3.py -dGTKAgg --verbose-debug
$HOME=/home/titan/johnh
matplotlib data path /export/home/johnh/devlinux/lib64/python2.7/site-packages/matplotlib/mpl-data
loaded rc file /export/home/johnh/matplotlib.matplotlib/doc/matplotlibrc
matplotlib version 1.1.0
verbose.level debug
interactive is False
platform is linux2
loaded modules: ['heapq', snip...]
CONFIGDIR=/export/home/johnh/.matplotlib.linux
Using fontManager instance from /export/home/johnh/.matplotlib.linux/fontList.cache
backend GTKAgg version 2.22.0
ViewVCCachedServer: files listed in cache.pck: set([])
ViewVCCachedServer: files in cache directory: set([])
ViewVCCachedServer: retrieving https://raw.github.com/matplotlib/sample_data/master/lena.jpg
ViewVCCachedServer: received response 200: OK
second run crashes
johnh@lettuce:doc> MPLCONFIGDIR=/export/home/johnh/.matplotlib.linux PYTHONPATH=/export/home/johnh/devlinux/lib64/python2.7/site-packages/ python ../examples/pylab_examples/image_demo3.py -dGTKAgg --verbose-debug
$HOME=/home/titan/johnh
matplotlib data path /export/home/johnh/devlinux/lib64/python2.7/site-packages/matplotlib/mpl-data
loaded rc file /export/home/johnh/matplotlib.matplotlib/doc/matplotlibrc
matplotlib version 1.1.0
verbose.level debug
interactive is False
platform is linux2
loaded modules: ['heapq', snip...]
CONFIGDIR=/export/home/johnh/.matplotlib.linux
Using fontManager instance from /export/home/johnh/.matplotlib.linux/fontList.cache
backend GTKAgg version 2.22.0
ViewVCCachedServer: files listed in cache.pck: set(['/export/home/johnh/.matplotlib.linux/sample_data/lena.jpg'])
ViewVCCachedServer: files in cache directory: set(['/export/home/johnh/.matplotlib.linux/sample_data/lena.jpg', '/export/home/johnh/.matplotlib.linux/sample_data/cache.pck'])
ViewVCCachedServer: retrieving https://raw.github.com/matplotlib/sample_data/master/lena.jpg
Traceback (most recent call last):
File "../examples/pylab_examples/image_demo3.py", line 10, in
datafile = cbook.get_sample_data('lena.jpg')
File "/export/home/johnh/devlinux/lib64/python2.7/site-packages/matplotlib/cbook.py", line 688, in get_sample_data
return myserver.get_sample_data(fname, asfileobj=asfileobj)
File "/export/home/johnh/devlinux/lib64/python2.7/site-packages/matplotlib/cbook.py", line 617, in get_sample_data
response = self.opener.open(url)
File "/usr/lib64/python2.7/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1197, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1158, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/usr/lib64/python2.7/httplib.py", line 946, in request
self._send_request(method, url, body, headers)
File "/usr/lib64/python2.7/httplib.py", line 986, in _send_request
self.putheader(hdr, value)
File "/usr/lib64/python2.7/httplib.py", line 924, in putheader
str = '%s: %s' % (header, '\r\n\t'.join(values))
TypeError: sequence item 0: expected string, NoneType found
johnh@lettuce:doc>