Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache page text until next edit operation #80

Merged
merged 1 commit into from Sep 27, 2015

Conversation

AdamWill
Copy link
Member

Store the results of page.text() operations in a simple cache dict. This avoids unnecessary remote roundtrips. Cache is cleared on each successful page save operation; we might want to also have a method to clear the cache, TTL, whatever, if folks are likely to keep Page instances around a long time and need to refresh them for possible third party edits.

This also gets rid of the section attribute, which looks bogus to me. All it appears to achieve is that if you retrieve the text of a particular section, then run a 'save' operation without explicitly specifying a section, the save operation is applied to the same section. This seems a completely wrong and potentially dangerous assumption. Maybe I'm missing something, though.

This is a pretty simple-minded implementation I wouldn't necessarily expect to get merged as-is, view it as a proof-of-concept / prototype rather than something really mergeable, I think. Just wanted to provide a bit more flesh than simply a feature request. My use case for this is a case where I want to sort of assemble a set of distinct edits to a page before firing them all together, and in order to figure out all the edits, I have to poke through the existing page text for each distinct one, finding the appropriate bit to change.

@danmichaelo
Copy link
Member

Agree that the section part is dangerous and should be removed, but it should be done in a separate commit. Could you make a separate pull request for it?

Caching also seems like a valid feature to add. There should probably also be some way to set cache size limit, so you can limit memory usage when gathering data from a large number of pages without editing. Or perhaps just a flag to disable caching?

For bypassing the cache on single requests, perhaps we could just add an argument to the text() method?

def text(self, section=None, expandtemplates=False, nocache=False):

@AdamWill
Copy link
Member Author

Ah, the travis error is genuine too (PEP-8 in my stuff): I'll fix that.

Agree on the separate PR for the section stuff, will do. I'm guessing it goes back to the era where there was edit() and save() and it was assumed that you were going to call save() right after you called edit().

Honestly I'm looking for feedback from you on how you'd like the design refined/improved, I'm just a monkey with a bunch of Stack Overflow tabs :P I'm using pretty much this code in my project ATM (in my subclass of Page), but my project is fairly specific in the kinds of page reads and edits it usually makes, so I don't have an awesome feel for what would be most appropriate for other uses. I'd be fine with adding arguments for __init__ and text which would disable caching on a per-instance or per-operation basis, if that's what you think is best. Or the conservative thing would be to have the per-instance arg's default as False, and require the caller to set the arg True to use caching. WDYT?

@AdamWill AdamWill force-pushed the text-cache branch 2 times, most recently from 87c3ae1 to f148b59 Compare March 25, 2015 05:35
@AdamWill
Copy link
Member Author

There's a version which makes the caching optional (per instance) and also allows the setting to be overridden with each text() call. wdyt? untested as yet, sorry. oh, it preserves the existing section behaviour for now.

@waldyrious
Copy link
Member

ping @danmichaelo

@AdamWill
Copy link
Member Author

I think I'll try rebasing this once Dan lets me know how he'd like #81 done and we get that merged - how this should behave will be clearer then.

@danmichaelo
Copy link
Member

Just merged #81 now. Seems like I can merge this one too if you do a rebase.

Two questions:

  • Do we really need a cache instance property? It adds a little extra work to finding the default value since you don't get it from the Page.text docstring.
  • Should we keep the default value as False or change it to True? As long as the cache is cleared on save, I don't see a big problem in having True as the default.

@AdamWill
Copy link
Member Author

In answer to both questions I guess I was imagining some kind of use case where you kept the Page instance lying around for some time, especially for a high-traffic page; it might get edited by someone else during its lifetime, and the contents of the cache would no longer be valid. I suppose just requiring people in that case to remember to do page.text(cache=False) isn't too onerous, though.

I was also just being a wuss and figuring 'well, if this breaks, at least if it's not the default it doesn't hurt too many people'. :P

@AdamWill AdamWill force-pushed the text-cache branch 2 times, most recently from 6aef4ee to f5566a4 Compare August 29, 2015 23:21
@AdamWill
Copy link
Member Author

OK, I re-diffed this with a very simple approach. Things to consider:

  • Should we store the result in the cache even when called with cache=False?
  • Should we provide a method for clearing the cache?

I'm not quite sure how best to add this to the tests, advice welcome.

@danmichaelo
Copy link
Member

In answer to both questions I guess I was imagining some kind of use case where you kept the Page instance lying around for some time, especially for a high-traffic page; it might get edited by someone else during its lifetime, and the contents of the cache would no longer be valid. I suppose just requiring people in that case to remember to do page.text(cache=False) isn't too onerous, though.

I think the most important is that the docstring is clear. It should tell if caching is default or not, and explain that caching is only for the lifetime of the object..

(Btw. If someone else makes an edit after you instantiated the page object, we still have the problem that Page.revision, Page.touched, etc. still retains their old, out-dated values.)

I was also just being a wuss and figuring 'well, if this breaks, at least if it's not the default it doesn't hurt too many people'. :P

I get that :)

@danmichaelo
Copy link
Member

  • Should we store the result in the cache even when called with cache=False?

Hm, not sure, but I think I would go for no. Perhaps there might be cases where you would like to save memory?

  • Should we provide a method for clearing the cache?

Doesn't really seem necessary to me.

I'm not quite sure how best to add this to the tests, advice welcome.

Perhaps

  • test that only one (mocked) api call is made if you call .text() twice with cache=True
  • test that two (mocked) api calls are made if you call .text() twice with cache=False

? Ask if you need help getting the tests working

@AdamWill AdamWill force-pushed the text-cache branch 2 times, most recently from a8512f3 to db886a0 Compare August 30, 2015 00:03
@AdamWill
Copy link
Member Author

Sorry to be dim, but it seems like calling page.text() twice in one function in test_page.py fails so long as one of the calls is with cache=False, and I can't really figure why. It's a weird crash in revs.next:

=================================== FAILURES ===================================
______________________ TestPageApiArgs.test_get_page_text ______________________

self = <test_page.TestPageApiArgs testMethod=test_get_page_text>

    def test_get_page_text(self):
        # Check that page.text() works, and that a correct API call is made
        text = self.page.text(cache=False)
>       text = self.page.text()

tests/test_page.py:230: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py34/lib/python3.4/site-packages/mwclient/page.py:157: in text
    rev = revs.next()
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:64: in next
    return self.__next__(full)
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:60: in __next__
    return List.next(self, full=full)
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:64: in next
    return self.__next__(full)
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:45: in __next__
    item['timestamp'] = parse_timestamp(item['timestamp'])
.tox/py34/lib/python3.4/site-packages/mwclient/util.py:7: in parse_timestamp
    return time.strptime(t, '%Y-%m-%dT%H:%M:%SZ')
/usr/lib64/python3.4/_strptime.py:494: in _strptime_time
    tt = _strptime(data_string, format)[0]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data_string = time.struct_time(tm_year=2014, tm_mon=8, tm_mday=29, tm_hour=22, tm_min=25, tm_sec=15, tm_wday=4, tm_yday=241, tm_isdst=-1)
format = '%Y-%m-%dT%H:%M:%SZ'

    def _strptime(data_string, format="%a %b %d %H:%M:%S %Y"):
        """Return a 2-tuple consisting of a time struct and an int containing
        the number of microseconds based on the input string and the
        format string."""

        for index, arg in enumerate([data_string, format]):
            if not isinstance(arg, str):
                msg = "strptime() argument {} must be str, not {}"
>               raise TypeError(msg.format(index, type(arg)))
E               TypeError: strptime() argument 0 must be str, not <class 'time.struct_time'>

/usr/lib64/python3.4/_strptime.py:306: TypeError
=============== 1 failed, 39 passed, 12 skipped in 0.35 seconds ================
ERROR: InvocationError: '/home/adamw/local/mwclient-adamwill/.tox/py34/bin/py.test -v --pep8 mwclient tests'

This does seem to be specific to the test environment, a very simple test script:

#!/bin/python

import mwclient
site = mwclient.Site(('https', 'en.wikipedia.org'), '/w/')
page = site.pages['ISO_8601']
print(int(len(page.text(cache=False))))
print page._textcache
print(int(len(page.text())))
print page._textcache

(or anything similar to that) works fine.

@danmichaelo
Copy link
Member

I've struggled quite a bit with mocking… In this case, if you do

        print(self.site.api.return_value)
        text = self.page.text()
        print(self.site.api.return_value)

you'll notice that the object is altered; the timestamp is expanded

{'query': {'pages': {'2': {'title': 'Some page', 'ns': 0, 'pageid': 2, 'revisions': [{'timestamp': '2014-08-29T22:25:15Z', '*': 'Hello world'}]}}}}
{'query': {'pages': {'2': {'title': 'Some page', 'ns': 0, 'pageid': 2, 'revisions': [{'timestamp': time.struct_time(tm_year=2014, tm_mon=8, tm_mday=29, tm_hour=22, tm_min=25, tm_sec=15, tm_wday=4, tm_yday=241, tm_isdst=-1), '*': 'Hello world'}]}}}}

To fix that, we could deep-copy the original object before the call, and re-store it after the call to self.page.text():

        response = deepcopy(self.site.api.return_value)
        text = self.page.text(…)
        self.site.api.return_value = response

This is getting quite messy though.. perhaps it can be done slightly more cleanly by stepping one step up, mocking Page.revisions(), something like this:

    def test_get_page_text_cached(self):
        self.page.revisions = mock.Mock(return_value=iter([]))
        self.page.text()
        self.page.text()
        assert self.page.revisions.call_count == 2

Store the results of page.text() operations in a simple cache
dict. This avoids unnecessary remote roundtrips. Cache is
cleared on each successful page.save() operation. cache
argument can be set to 'False' to disable use of the cache.
@AdamWill
Copy link
Member Author

So your idea works, but only if we change rev = revs.next() to rev = next(revs) in page.py. This is because the iterator created by iter in Python 3 has no next method.

next(foo) works in both Python 2 and Python 3, but it was only added to Python 2 in 2.6. If you're trying to keep mwclient compatible with 2.5 this could be a problem - are you?

I generally try to keep my stuff working with 2.6, but 2.5 seems like just too much work...

edit: I see index.rst states 2.6, so I guess this should be fine.

@danmichaelo
Copy link
Member

Yeah, keeping 2.5 compability was too hard, so I dropped that.

@danmichaelo
Copy link
Member

Looks good now 😎 Sorry for the delay!

danmichaelo added a commit that referenced this pull request Sep 27, 2015
Cache page text until next edit operation
@danmichaelo danmichaelo merged commit 970592b into mwclient:master Sep 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants