cache page text until next edit operation #80

AdamWill · 2015-03-23T23:12:55Z

Store the results of page.text() operations in a simple cache dict. This avoids unnecessary remote roundtrips. Cache is cleared on each successful page save operation; we might want to also have a method to clear the cache, TTL, whatever, if folks are likely to keep Page instances around a long time and need to refresh them for possible third party edits.

This also gets rid of the section attribute, which looks bogus to me. All it appears to achieve is that if you retrieve the text of a particular section, then run a 'save' operation without explicitly specifying a section, the save operation is applied to the same section. This seems a completely wrong and potentially dangerous assumption. Maybe I'm missing something, though.

This is a pretty simple-minded implementation I wouldn't necessarily expect to get merged as-is, view it as a proof-of-concept / prototype rather than something really mergeable, I think. Just wanted to provide a bit more flesh than simply a feature request. My use case for this is a case where I want to sort of assemble a set of distinct edits to a page before firing them all together, and in order to figure out all the edits, I have to poke through the existing page text for each distinct one, finding the appropriate bit to change.

danmichaelo · 2015-03-24T23:29:47Z

Agree that the section part is dangerous and should be removed, but it should be done in a separate commit. Could you make a separate pull request for it?

Caching also seems like a valid feature to add. There should probably also be some way to set cache size limit, so you can limit memory usage when gathering data from a large number of pages without editing. Or perhaps just a flag to disable caching?

For bypassing the cache on single requests, perhaps we could just add an argument to the text() method?

def text(self, section=None, expandtemplates=False, nocache=False):

AdamWill · 2015-03-25T03:44:54Z

Ah, the travis error is genuine too (PEP-8 in my stuff): I'll fix that.

Agree on the separate PR for the section stuff, will do. I'm guessing it goes back to the era where there was edit() and save() and it was assumed that you were going to call save() right after you called edit().

Honestly I'm looking for feedback from you on how you'd like the design refined/improved, I'm just a monkey with a bunch of Stack Overflow tabs :P I'm using pretty much this code in my project ATM (in my subclass of Page), but my project is fairly specific in the kinds of page reads and edits it usually makes, so I don't have an awesome feel for what would be most appropriate for other uses. I'd be fine with adding arguments for __init__ and text which would disable caching on a per-instance or per-operation basis, if that's what you think is best. Or the conservative thing would be to have the per-instance arg's default as False, and require the caller to set the arg True to use caching. WDYT?

AdamWill · 2015-03-25T05:35:50Z

There's a version which makes the caching optional (per instance) and also allows the setting to be overridden with each text() call. wdyt? untested as yet, sorry. oh, it preserves the existing section behaviour for now.

waldyrious · 2015-06-20T10:17:15Z

ping @danmichaelo

AdamWill · 2015-08-29T05:48:52Z

I think I'll try rebasing this once Dan lets me know how he'd like #81 done and we get that merged - how this should behave will be clearer then.

danmichaelo · 2015-08-29T20:16:12Z

Just merged #81 now. Seems like I can merge this one too if you do a rebase.

Two questions:

Do we really need a cache instance property? It adds a little extra work to finding the default value since you don't get it from the Page.text docstring.
Should we keep the default value as False or change it to True? As long as the cache is cleared on save, I don't see a big problem in having True as the default.

AdamWill · 2015-08-29T21:37:48Z

In answer to both questions I guess I was imagining some kind of use case where you kept the Page instance lying around for some time, especially for a high-traffic page; it might get edited by someone else during its lifetime, and the contents of the cache would no longer be valid. I suppose just requiring people in that case to remember to do page.text(cache=False) isn't too onerous, though.

I was also just being a wuss and figuring 'well, if this breaks, at least if it's not the default it doesn't hurt too many people'. :P

AdamWill · 2015-08-29T23:24:04Z

OK, I re-diffed this with a very simple approach. Things to consider:

Should we store the result in the cache even when called with cache=False?
Should we provide a method for clearing the cache?

I'm not quite sure how best to add this to the tests, advice welcome.

danmichaelo · 2015-08-29T23:25:29Z

In answer to both questions I guess I was imagining some kind of use case where you kept the Page instance lying around for some time, especially for a high-traffic page; it might get edited by someone else during its lifetime, and the contents of the cache would no longer be valid. I suppose just requiring people in that case to remember to do page.text(cache=False) isn't too onerous, though.

I think the most important is that the docstring is clear. It should tell if caching is default or not, and explain that caching is only for the lifetime of the object..

(Btw. If someone else makes an edit after you instantiated the page object, we still have the problem that Page.revision, Page.touched, etc. still retains their old, out-dated values.)

I was also just being a wuss and figuring 'well, if this breaks, at least if it's not the default it doesn't hurt too many people'. :P

I get that :)

danmichaelo · 2015-08-29T23:33:55Z

Should we store the result in the cache even when called with cache=False?

Hm, not sure, but I think I would go for no. Perhaps there might be cases where you would like to save memory?

Should we provide a method for clearing the cache?

Doesn't really seem necessary to me.

I'm not quite sure how best to add this to the tests, advice welcome.

Perhaps

test that only one (mocked) api call is made if you call .text() twice with cache=True
test that two (mocked) api calls are made if you call .text() twice with cache=False

? Ask if you need help getting the tests working

AdamWill · 2015-08-30T00:07:06Z

Sorry to be dim, but it seems like calling page.text() twice in one function in test_page.py fails so long as one of the calls is with cache=False, and I can't really figure why. It's a weird crash in revs.next:

=================================== FAILURES ===================================
______________________ TestPageApiArgs.test_get_page_text ______________________

self = <test_page.TestPageApiArgs testMethod=test_get_page_text>

    def test_get_page_text(self):
        # Check that page.text() works, and that a correct API call is made
        text = self.page.text(cache=False)
>       text = self.page.text()

tests/test_page.py:230: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py34/lib/python3.4/site-packages/mwclient/page.py:157: in text
    rev = revs.next()
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:64: in next
    return self.__next__(full)
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:60: in __next__
    return List.next(self, full=full)
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:64: in next
    return self.__next__(full)
.tox/py34/lib/python3.4/site-packages/mwclient/listing.py:45: in __next__
    item['timestamp'] = parse_timestamp(item['timestamp'])
.tox/py34/lib/python3.4/site-packages/mwclient/util.py:7: in parse_timestamp
    return time.strptime(t, '%Y-%m-%dT%H:%M:%SZ')
/usr/lib64/python3.4/_strptime.py:494: in _strptime_time
    tt = _strptime(data_string, format)[0]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data_string = time.struct_time(tm_year=2014, tm_mon=8, tm_mday=29, tm_hour=22, tm_min=25, tm_sec=15, tm_wday=4, tm_yday=241, tm_isdst=-1)
format = '%Y-%m-%dT%H:%M:%SZ'

    def _strptime(data_string, format="%a %b %d %H:%M:%S %Y"):
        """Return a 2-tuple consisting of a time struct and an int containing
        the number of microseconds based on the input string and the
        format string."""

        for index, arg in enumerate([data_string, format]):
            if not isinstance(arg, str):
                msg = "strptime() argument {} must be str, not {}"
>               raise TypeError(msg.format(index, type(arg)))
E               TypeError: strptime() argument 0 must be str, not <class 'time.struct_time'>

/usr/lib64/python3.4/_strptime.py:306: TypeError
=============== 1 failed, 39 passed, 12 skipped in 0.35 seconds ================
ERROR: InvocationError: '/home/adamw/local/mwclient-adamwill/.tox/py34/bin/py.test -v --pep8 mwclient tests'

This does seem to be specific to the test environment, a very simple test script:

#!/bin/python

import mwclient
site = mwclient.Site(('https', 'en.wikipedia.org'), '/w/')
page = site.pages['ISO_8601']
print(int(len(page.text(cache=False))))
print page._textcache
print(int(len(page.text())))
print page._textcache

(or anything similar to that) works fine.

danmichaelo · 2015-08-30T14:50:34Z

I've struggled quite a bit with mocking… In this case, if you do

        print(self.site.api.return_value)
        text = self.page.text()
        print(self.site.api.return_value)

you'll notice that the object is altered; the timestamp is expanded

{'query': {'pages': {'2': {'title': 'Some page', 'ns': 0, 'pageid': 2, 'revisions': [{'timestamp': '2014-08-29T22:25:15Z', '*': 'Hello world'}]}}}}
{'query': {'pages': {'2': {'title': 'Some page', 'ns': 0, 'pageid': 2, 'revisions': [{'timestamp': time.struct_time(tm_year=2014, tm_mon=8, tm_mday=29, tm_hour=22, tm_min=25, tm_sec=15, tm_wday=4, tm_yday=241, tm_isdst=-1), '*': 'Hello world'}]}}}}

To fix that, we could deep-copy the original object before the call, and re-store it after the call to self.page.text():

        response = deepcopy(self.site.api.return_value)
        text = self.page.text(…)
        self.site.api.return_value = response

This is getting quite messy though.. perhaps it can be done slightly more cleanly by stepping one step up, mocking Page.revisions(), something like this:

    def test_get_page_text_cached(self):
        self.page.revisions = mock.Mock(return_value=iter([]))
        self.page.text()
        self.page.text()
        assert self.page.revisions.call_count == 2

Store the results of page.text() operations in a simple cache dict. This avoids unnecessary remote roundtrips. Cache is cleared on each successful page.save() operation. cache argument can be set to 'False' to disable use of the cache.

AdamWill · 2015-09-19T22:42:55Z

So your idea works, but only if we change rev = revs.next() to rev = next(revs) in page.py. This is because the iterator created by iter in Python 3 has no next method.

next(foo) works in both Python 2 and Python 3, but it was only added to Python 2 in 2.6. If you're trying to keep mwclient compatible with 2.5 this could be a problem - are you?

I generally try to keep my stuff working with 2.6, but 2.5 seems like just too much work...

edit: I see index.rst states 2.6, so I guess this should be fine.

danmichaelo · 2015-09-27T18:56:16Z

Yeah, keeping 2.5 compability was too hard, so I dropped that.

danmichaelo · 2015-09-27T18:58:41Z

Looks good now 😎 Sorry for the delay!

Cache page text until next edit operation

AdamWill force-pushed the text-cache branch 2 times, most recently from 87c3ae1 to f148b59 Compare March 25, 2015 05:35

AdamWill force-pushed the text-cache branch from f148b59 to 0ba4c04 Compare March 25, 2015 05:41

danmichaelo force-pushed the master branch from 20e514d to 20428a1 Compare July 18, 2015 19:57

danmichaelo mentioned this pull request Aug 29, 2015

drop Page.section attribute: it does wrong things #81

Closed

AdamWill force-pushed the text-cache branch 2 times, most recently from 6aef4ee to f5566a4 Compare August 29, 2015 23:21

AdamWill force-pushed the text-cache branch 2 times, most recently from a8512f3 to db886a0 Compare August 30, 2015 00:03

AdamWill force-pushed the text-cache branch from db886a0 to e8fc54d Compare September 19, 2015 22:40

danmichaelo added a commit that referenced this pull request Sep 27, 2015

Merge pull request #80 from AdamWill/text-cache

970592b

Cache page text until next edit operation

danmichaelo merged commit 970592b into mwclient:master Sep 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache page text until next edit operation #80

cache page text until next edit operation #80

AdamWill commented Mar 23, 2015

danmichaelo commented Mar 24, 2015

AdamWill commented Mar 25, 2015

AdamWill commented Mar 25, 2015

waldyrious commented Jun 20, 2015

AdamWill commented Aug 29, 2015

danmichaelo commented Aug 29, 2015

AdamWill commented Aug 29, 2015

AdamWill commented Aug 29, 2015

danmichaelo commented Aug 29, 2015

danmichaelo commented Aug 29, 2015

AdamWill commented Aug 30, 2015

danmichaelo commented Aug 30, 2015

AdamWill commented Sep 19, 2015

danmichaelo commented Sep 27, 2015

danmichaelo commented Sep 27, 2015

cache page text until next edit operation #80

cache page text until next edit operation #80

Conversation

AdamWill commented Mar 23, 2015

danmichaelo commented Mar 24, 2015

AdamWill commented Mar 25, 2015

AdamWill commented Mar 25, 2015

waldyrious commented Jun 20, 2015

AdamWill commented Aug 29, 2015

danmichaelo commented Aug 29, 2015

AdamWill commented Aug 29, 2015

AdamWill commented Aug 29, 2015

danmichaelo commented Aug 29, 2015

danmichaelo commented Aug 29, 2015

AdamWill commented Aug 30, 2015

danmichaelo commented Aug 30, 2015

AdamWill commented Sep 19, 2015

danmichaelo commented Sep 27, 2015

danmichaelo commented Sep 27, 2015