Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337) #554

eleweek · 2015-11-24T23:02:14Z

Hi, I've used this trick quite a lot, so I took some time and implemented a helper for praw. I hope you'll find it useful!

Reddit seems to use created attribute for "timestamp" queries. created seems to be a sort of broken timestamp. I think this is very confusing, and I always forget to convert time.time() output to "reddit timestamp". That's why I chose to make highest_timestamp and lowest_timestamp to be normal unix timestamp(even though it is inconsistent with r.search)
I am not a native speaker, so sorry in advance for possible grammatical errors in the docstring.
The basic algorithm is fairly simple, it just slides a dynamic length window over reddit timestamps and tries to keep 30..99 results within the window. Whenever this condition breaks, it increases or decreases the window length by a factor of 2x. I never really bothered to optimize these parameters, they are good enough for me. Maybe it is a good idea to expose them as function parameters.
Naming! TBH, I am not sure if all_submissions is a good name.

Any feedback is appereciated!

timestamps (Implements: praw-dev#337)

13steinj · 2015-11-24T23:14:13Z

praw/helpers.py

+
+    if highest_timestamp is None:
+        # convert normal unix timestamps into broken reddit timestamps
+        highest_timestamp = int(time.time()) + 28800


Slightly confused: can you clarify what you mean by 'broken' and, if it is indeed broken, you may want to file an issue on reddit/reddit

Broken == looks like unix timestamp, but isn't. :) Compare "created" and "created_utc" attributes.

I doubt they are going to change the meaning of created attribute, especially given that you can get the proper timestamp from created_utc

https://www.reddit.com/r/redditdev/comments/29991t/whats_the_difference_between_created_and_created/

Ah :P

If cloudsearch is still using the broken created, I'd assume that's a bug since that's just weird and confusing for devs in general.

It's been reported before: https://www.reddit.com/r/bugs/comments/29ajsg/cloudsearch_syntax_timestamp_seaches_are/

It doesn't look like reddit admins care about this problem. :/

lowest_timestamp as a lower bound

eleweek · 2015-11-25T22:16:20Z

I've updated the diff and addressed some of the feedback from @13steinj . I also realized that I haven't actually made use of lowest_timestamp, so I fixed this(and tested the fix)

13steinj · 2015-11-25T23:01:33Z

praw/helpers.py

+    elif not isinstance(subreddit, six.string_types):
+        lowest_timestamp = subreddit.created
+    elif subreddit != "all":
+        lowest_timestamp = reddit_session.get_subreddit(subreddit).created


You may want to change this to elif subreddit not in ("all", "contrib", "mod", "friend",) (maybe even add "front" and put it in the doc so people can just search the front page, and if it is front, make subreddit=None be passed in the search). Other that, looks cool :P

for "all", "mod", "contrib", "friend"

eleweek · 2015-11-29T22:41:20Z

Bump!

bboe · 2015-11-29T23:57:15Z

praw/helpers.py

@@ -111,6 +111,104 @@ def valid_redditors(redditors, sub):
            if resp['ok']]


+def all_submissions(reddit_session,
+                    subreddit,
+                    highest_timestamp=None,


Does it make sense to switch lowest and highest?

Yep. I'll fix this.

bboe · 2015-11-30T00:08:47Z

This addition looks awesome. I really like this helper. There are a few order based things I asked questions about otherwise I think the next step is to add some tests. Given that there is a lot of code in this function, I'd really like to see 100% branch coverage. Great work!

review feedback

eleweek · 2015-12-04T21:57:53Z

Hi, I've address some of your feedback! I am not sure how to achieve 100% branch coverage though with betamax, since the time is constantly going forward and the recorded requests will become outdated the next second.

13steinj · 2015-12-05T00:55:53Z

@eleweek betamax should be caching any and all requests made. Meaning; if you say, search via new and ensure that the first post is the newest post, and you do it on a small subreddit (like one of the api test subreddits as listed in PrawTest's configure), there shouldn't be a change in what is listed. Any further times the test is run, instead of making the request, betamax would be used to read from the cache.

eleweek · 2015-12-05T18:54:18Z

Any further times the test is run, instead of making the request, betamax would be used to read from the cache.

Unfortunately, this isn't going to work: time goes forward => time.time() return value changes => the set of requests changes.

bboe · 2015-12-05T19:17:05Z

@eleweek You can mock out the calls to time so that it always return the values you want them to return when you run the tests. It's a little tricky if you've never done it before, a very useful tool to have for writing automated tests. There is some information here if you're up to it:
http://stackoverflow.com/questions/4481954/python-trying-to-mock-datetime-date-today-but-not-working

eleweek · 2015-12-05T20:12:51Z

@bboe Thanks for the pointer, I've tried to make it work, but unfortunately it leads to this output when using nose:

Ran 261 tests in -422558.602s

I'm mocking time as:

import time

def mock_time():
    return 1448923624
time.time = mock_time

eleweek · 2015-12-05T20:32:54Z

Actually, mocking time in setUp() and de-mocking it in tearDown() seems to work, although this leads to incorrect time reported if you press Ctrl+C during the time this test case is run. Hopefully this is acceptable.

eleweek · 2015-12-05T20:51:49Z

Okay, actually, I managed to fully resolve the issue with incorrect running time values! I've updated the PR

Damgaard · 2015-12-06T01:56:06Z

Hey

As far as I remember from way back the created value is the time in the servers timezone. Since reddit may move servers from one timezone to another at any point and may have servers in multiple distinct timezones, the created value should not be trusted to be a consistent difference from utc. As such only created_utc should ever be used, since it is the only reliable timestamp. IIRC the created attribute is a legacy value that is only kept around, because removing it would be a backwards incompatible change that may break an unknown number of apps that expect this value to exist. Also, due to the openness of reddits API, there is no known way of being able to contact all consumers of the API to inform of a backwards incompatible change. In other words, removing created was considered such a complicated, dangerous change that it was postponed in favor of getting more pressing changes out. With that in mind created should be considered deprecated, dangerous and not to be used.

I'm not 100% on this information, it's been quite a while since I've been in the nitty gritty of PRAW/reddit. But I wanted to pipe in, since it looks like this PR assumes a fixed relationship between created and unix time, which could change at any time and not even be guaranteed to be a constant. This could prove to be a very serious maintenance headache years in the future if reddit ever arbitrarily decides to change the servers timezone.

Damgaard · 2015-12-06T02:09:22Z

I did another quick read through and it looks like the server expects the timestamp arguments in the timezone of created which is what causes the conversions between unix time and reddit time. In which case, and if my memory of the above is correct, then what is being done seems to be the only way to do this from PRAW's POV. But it seems quite fragile for something that is part of PRAW and not an external script.

The proper solution would then be to update the respective endpoint at reddit to accept utc instead of timezone specific time or as an alternative. This would embrace the deprecated nature of created and allow server timezone to be changed arbitrarily without breaking stuff. PRAW could then use created_utc safely in the knowledge that it will be consistent forever.

eleweek · 2015-12-06T11:53:29Z

@Damgaard I do agree with you that it would be better to use created_utc rather than deprecated and confusing created. The issue was raised before. Alas, reddit admins aren't particularly responsive.

The proper solution would then be to update the respective endpoint at reddit to accept utc instead of timezone specific time or as an alternative.

Personally, I find reddit admins extremely unresponsive. I've reported multiple bugs(got zero response) and tried to start a discussion on slightly changing a different endpoint(again, zero response), so even though I don't mind contributing to reddit(I actually fixed a bug in reddit once), I have little desire trying to make a breaking changes or significant modifications.

eleweek · 2015-12-08T15:23:17Z

Bump! Is there any chance that this could be merged without changing reddit's source?

bboe · 2015-12-08T17:22:47Z

Is there any chance that this could be merged without changing reddit's source?

Yes, while I agree with @Damgaard that UTC timestamps should be used, I think something that works initially is better than nothing.

Regarding timing, it may be a few days until I can take a solid look before merging. Maybe @Damgaard or @voussoir can before then? If there's no activity before Monday, please bump again, and thanks for the contribution.

voussoir · 2015-12-09T02:31:17Z

tests/test_helpers_all_submissions.py

+                                                      verbosity=3))
+
+            def get_submissions_data(submissions_list):
+                return [(s.created_utc, s.title, s.url)


Would it be sufficient to use the item's ID / fullname instead of these attributes? While it's unlikely that two posts would collide here, these aren't necessarily identifiers.

Instead of creating and comparing two lists of IDs, maybe we can zip the search results? Like:

def assert_equal_submissions(list1, list2): self.assertEqual(len(list1), len(list2)) for (item1, item2) in zip(list1, list2): self.assertEqual(item1.fullname, item2.fullname) assert_equal_submissions(all_subs, all_subs_sr_object) assert_equal_submissions(all_subs, reversed(all_subs_reversed)) ...

I think the intention is more clear this way, let me know what you think.

Two lists of submissions should be directly comparable:

[<praw.objects.Submission at 0x110755ed0>, <praw.objects.Submission at 0x110755fd0>, <praw.objects.Submission at 0x110755150>, <praw.objects.Submission at 0x110755210>, <praw.objects.Submission at 0x1106c6b10>] In [12]: b Out[12]: [<praw.objects.Submission at 0x1106ca850>, <praw.objects.Submission at 0x1106bb0d0>, <praw.objects.Submission at 0x1106bb3d0>, <praw.objects.Submission at 0x1106bb290>, <praw.objects.Submission at 0x1106bb190>] In [13]: a == b Out[13]: True

Note that the objects are at different addresses, but their built-in comparison enables them to compare as equal if they're the same type and have the same id.

Ah, woops! It's always the simplest answer that I forget. This is much better.

@bboe Oh thanks, not sure why I didn't realize

voussoir · 2015-12-09T03:06:24Z

It's a shame the API isn't using the UTC timestamp, but it used to be much worse! I don't know when the created-created_utc offset became 28,800 globally but it must have been recent.

I haven't run any of your code, but it looks great to me, and the tests seem to agree. I might try to get a little more playful with the window_size modifiers instead of using a hard x2 multiplier though. For quick results, we want to get as close to our search_limit as we can on every loop, and avoid dawdling in the 30-90 range if possible.

There's probably no perfect solution to this, but maybe you can increase and decrease by a certain percentage, and only get aggressive after repeated failures.

As for the function name, maybe just timestamp_search? I'm having a hard time coming up with something descriptive and concise, haha.

For bonus points, do you think this function can accept some other search parameters? Cloudsearch is kind of annoying to format, but it's like a prefix expression I guess:

if username is not None:
    query = '(and author:"%s" timestamp:%d..%d)' % (username, lower, upper)
else:
    query = 'timestamp:%d..%d' % (lower, upper)

This opens up a lot of possibilities, so if the formatting doesn't get too crazy it may be worth the effort.

Sorry I haven't been too active recently, this is a great contribution and I wish I had participated more.

bboe · 2015-12-15T02:37:21Z

@eleweek Overall, I think we're just about ready. Perhaps we can make a simplification to the test with respect to directly comparing lists of subreddit objects, and rename the method. Then it should be good to go!

Thanks for the persistence and great work.

eleweek · 2015-12-15T19:41:10Z

@voussoir @bboe Thanks for your feedback, I was busy last week, so I didn't get to updating the PR. I'll probably do this over the next couple of days.

bboe · 2015-12-15T19:47:12Z

I'll probably do this over the next couple of days.

Awesome! Please mention us when you've done that so we know when to take a look.

params, extra filtering by cloud search fields

eleweek · 2015-12-17T23:37:54Z

@bboe @voussoir I've updated the PR! Changed window sliding params + added extra filtering by other cloud search fields

bboe · 2015-12-17T23:53:37Z

Looks awesome. Just two bits of feedback, then I think we're good to go.

make tests stable

voussoir · 2015-12-18T01:24:00Z

@eleweek I'm really glad you added the additional cloudsearch parameter support, thanks a lot! All of the changes get a thumbs up from me.

I'm still seeing extra_cloudsearch_fields={} in the latest commit, 601fb8a, though. Can you make sure that gets changed to None before we merge? Following the timestamps it looks like the commit was made after Bryce's comment so I think you forgot to change it.

up unittests a bit, extra_cloudsearch_fields is now None by default

eleweek · 2015-12-18T17:05:00Z

@voussoir It was late where I live, and I figured I could push a commit that fixes CI problems and fix those issues later.

I fixed it now though.

Also pinging @bboe just in case

voussoir · 2015-12-18T21:08:21Z

Great, thanks! Just had to double check.

@bboe, I'm happy with the PR when you are.

Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337)

bboe · 2015-12-18T21:31:32Z

Awesome work @eleweek! Thanks for reviewing @voussoir, @13steinj and @Damgaard. 👏

eleweek · 2015-12-18T22:50:04Z

Thank you @bboe , @voussoir and @13steinj . This was fun :)

Add helper for downloading all submissions in a subreddit between two

f4653e5

timestamps (Implements: praw-dev#337)

13steinj reviewed Nov 24, 2015
View reviewed changes

lowest_timestamp is subreddit's created_utc; actually use

cf62934

lowest_timestamp as a lower bound

eleweek added 2 commits November 26, 2015 01:23

Fix "line too long" issues

c4ae1f1

Remove debug output

94e0b85

13steinj reviewed Nov 25, 2015
View reviewed changes

eleweek added 2 commits November 26, 2015 22:16

all_submissions: when calculating lowest_timestamp, assign it to zero

4604b68

for "all", "mod", "contrib", "friend"

Actually add offset to lowest_timestamp and highest_timestamp

f7c4318

bboe reviewed Nov 29, 2015
View reviewed changes

all_submissions helper: add tests, add flag for sorting order & address

470fe59

review feedback

eleweek added 3 commits December 5, 2015 16:24

all_submissions: forgot to commit file

ed0651c

all_submissions: remove new_episodes flag from betamax

e603335

all_submissions: Fix flake8 complaints; remove debug printing

3d8bf81

all_submissions: more style fixes

e0df127

all_submissions: fix python3-related issues

883a3d0

all_submissions test: mock time to improve coverage

836c0d7

voussoir reviewed Dec 9, 2015
View reviewed changes

eleweek mentioned this pull request Dec 15, 2015

Implement __hash__ for RedditContentObject #563

Merged

all_submissions -> submissions_between: somewhat better window sliding

480be4c

params, extra filtering by cloud search fields

submissions_between: python3 compatibility + sort cloudsearch fields to

601fb8a

make tests stable

submissions_between helper: split unittests into multiple files, speed

97e0c25

up unittests a bit, extra_cloudsearch_fields is now None by default

bboe added a commit that referenced this pull request Dec 18, 2015

Merge pull request #554 from eleweek/master

cb33a4e

Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337)

bboe merged commit cb33a4e into praw-dev:master Dec 18, 2015

Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337) #554

Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337) #554

Conversation

eleweek commented Nov 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eleweek commented Nov 25, 2015

Choose a reason for hiding this comment

eleweek commented Nov 29, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboe commented Nov 30, 2015

eleweek commented Dec 4, 2015

13steinj commented Dec 5, 2015

eleweek commented Dec 5, 2015

bboe commented Dec 5, 2015

eleweek commented Dec 5, 2015

eleweek commented Dec 5, 2015

eleweek commented Dec 5, 2015

Damgaard commented Dec 6, 2015

Damgaard commented Dec 6, 2015

eleweek commented Dec 6, 2015

eleweek commented Dec 8, 2015

bboe commented Dec 8, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

voussoir commented Dec 9, 2015

bboe commented Dec 15, 2015

eleweek commented Dec 15, 2015

bboe commented Dec 15, 2015

eleweek commented Dec 17, 2015

bboe commented Dec 17, 2015

voussoir commented Dec 18, 2015

eleweek commented Dec 18, 2015

voussoir commented Dec 18, 2015

bboe commented Dec 18, 2015

eleweek commented Dec 18, 2015