New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337) #554
Conversation
timestamps (Implements: praw-dev#337)
|
||
if highest_timestamp is None: | ||
# convert normal unix timestamps into broken reddit timestamps | ||
highest_timestamp = int(time.time()) + 28800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused: can you clarify what you mean by 'broken' and, if it is indeed broken, you may want to file an issue on reddit/reddit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken == looks like unix timestamp, but isn't. :) Compare "created" and "created_utc" attributes.
I doubt they are going to change the meaning of created
attribute, especially given that you can get the proper timestamp from created_utc
https://www.reddit.com/r/redditdev/comments/29991t/whats_the_difference_between_created_and_created/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah :P
If cloudsearch is still using the broken created
, I'd assume that's a bug since that's just weird and confusing for devs in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been reported before: https://www.reddit.com/r/bugs/comments/29ajsg/cloudsearch_syntax_timestamp_seaches_are/
It doesn't look like reddit admins care about this problem. :/
lowest_timestamp as a lower bound
I've updated the diff and addressed some of the feedback from @13steinj . I also realized that I haven't actually made use of |
elif not isinstance(subreddit, six.string_types): | ||
lowest_timestamp = subreddit.created | ||
elif subreddit != "all": | ||
lowest_timestamp = reddit_session.get_subreddit(subreddit).created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to change this to elif subreddit not in ("all", "contrib", "mod", "friend",)
(maybe even add "front" and put it in the doc so people can just search the front page, and if it is front, make subreddit=None be passed in the search). Other that, looks cool :P
for "all", "mod", "contrib", "friend"
Bump! |
@@ -111,6 +111,104 @@ def valid_redditors(redditors, sub): | |||
if resp['ok']] | |||
|
|||
|
|||
def all_submissions(reddit_session, | |||
subreddit, | |||
highest_timestamp=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to switch lowest and highest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I'll fix this.
This addition looks awesome. I really like this helper. There are a few order based things I asked questions about otherwise I think the next step is to add some tests. Given that there is a lot of code in this function, I'd really like to see 100% branch coverage. Great work! |
Hi, I've address some of your feedback! I am not sure how to achieve 100% branch coverage though with betamax, since the time is constantly going forward and the recorded requests will become outdated the next second. |
@eleweek betamax should be caching any and all requests made. Meaning; if you say, search via new and ensure that the first post is the newest post, and you do it on a small subreddit (like one of the api test subreddits as listed in |
Unfortunately, this isn't going to work: time goes forward => time.time() return value changes => the set of requests changes. |
@eleweek You can mock out the calls to time so that it always return the values you want them to return when you run the tests. It's a little tricky if you've never done it before, a very useful tool to have for writing automated tests. There is some information here if you're up to it: |
@bboe Thanks for the pointer, I've tried to make it work, but unfortunately it leads to this output when using
I'm mocking time as:
|
Actually, mocking time in |
Okay, actually, I managed to fully resolve the issue with incorrect running time values! I've updated the PR |
Hey As far as I remember from way back the I'm not 100% on this information, it's been quite a while since I've been in the nitty gritty of PRAW/reddit. But I wanted to pipe in, since it looks like this PR assumes a fixed relationship between |
I did another quick read through and it looks like the server expects the timestamp arguments in the timezone of The proper solution would then be to update the respective endpoint at reddit to accept utc instead of timezone specific time or as an alternative. This would embrace the deprecated nature of |
@Damgaard I do agree with you that it would be better to use
Personally, I find reddit admins extremely unresponsive. I've reported multiple bugs(got zero response) and tried to start a discussion on slightly changing a different endpoint(again, zero response), so even though I don't mind contributing to reddit(I actually fixed a bug in reddit once), I have little desire trying to make a breaking changes or significant modifications. |
Bump! Is there any chance that this could be merged without changing reddit's source? |
Yes, while I agree with @Damgaard that UTC timestamps should be used, I think something that works initially is better than nothing. Regarding timing, it may be a few days until I can take a solid look before merging. Maybe @Damgaard or @voussoir can before then? If there's no activity before Monday, please bump again, and thanks for the contribution. |
verbosity=3)) | ||
|
||
def get_submissions_data(submissions_list): | ||
return [(s.created_utc, s.title, s.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Would it be sufficient to use the item's ID / fullname instead of these attributes? While it's unlikely that two posts would collide here, these aren't necessarily identifiers.
-
Instead of creating and comparing two lists of IDs, maybe we can zip the search results? Like:
def assert_equal_submissions(list1, list2): self.assertEqual(len(list1), len(list2)) for (item1, item2) in zip(list1, list2): self.assertEqual(item1.fullname, item2.fullname) assert_equal_submissions(all_subs, all_subs_sr_object) assert_equal_submissions(all_subs, reversed(all_subs_reversed)) ...
I think the intention is more clear this way, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two lists of submissions should be directly comparable:
[<praw.objects.Submission at 0x110755ed0>,
<praw.objects.Submission at 0x110755fd0>,
<praw.objects.Submission at 0x110755150>,
<praw.objects.Submission at 0x110755210>,
<praw.objects.Submission at 0x1106c6b10>]
In [12]: b
Out[12]:
[<praw.objects.Submission at 0x1106ca850>,
<praw.objects.Submission at 0x1106bb0d0>,
<praw.objects.Submission at 0x1106bb3d0>,
<praw.objects.Submission at 0x1106bb290>,
<praw.objects.Submission at 0x1106bb190>]
In [13]: a == b
Out[13]: True
Note that the objects are at different addresses, but their built-in comparison enables them to compare as equal if they're the same type and have the same id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, woops! It's always the simplest answer that I forget. This is much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bboe Oh thanks, not sure why I didn't realize
It's a shame the API isn't using the UTC timestamp, but it used to be much worse! I don't know when the created-created_utc offset became 28,800 globally but it must have been recent. I haven't run any of your code, but it looks great to me, and the tests seem to agree. I might try to get a little more playful with the window_size modifiers instead of using a hard x2 multiplier though. For quick results, we want to get as close to our search_limit as we can on every loop, and avoid dawdling in the 30-90 range if possible. There's probably no perfect solution to this, but maybe you can increase and decrease by a certain percentage, and only get aggressive after repeated failures.
As for the function name, maybe just
For bonus points, do you think this function can accept some other search parameters? Cloudsearch is kind of annoying to format, but it's like a prefix expression I guess:
This opens up a lot of possibilities, so if the formatting doesn't get too crazy it may be worth the effort. Sorry I haven't been too active recently, this is a great contribution and I wish I had participated more. |
@eleweek Overall, I think we're just about ready. Perhaps we can make a simplification to the test with respect to directly comparing lists of subreddit objects, and rename the method. Then it should be good to go! Thanks for the persistence and great work. |
Awesome! Please mention us when you've done that so we know when to take a look. |
params, extra filtering by cloud search fields
Looks awesome. Just two bits of feedback, then I think we're good to go. |
@eleweek I'm really glad you added the additional cloudsearch parameter support, thanks a lot! All of the changes get a thumbs up from me. I'm still seeing |
up unittests a bit, extra_cloudsearch_fields is now None by default
Great, thanks! Just had to double check. @bboe, I'm happy with the PR when you are. |
Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337)
Hi, I've used this trick quite a lot, so I took some time and implemented a helper for praw. I hope you'll find it useful!
Reddit seems to use
created
attribute for "timestamp" queries.created
seems to be a sort of broken timestamp. I think this is very confusing, and I always forget to convert time.time() output to "reddit timestamp". That's why I chose to makehighest_timestamp
andlowest_timestamp
to be normal unix timestamp(even though it is inconsistent withr.search
)I am not a native speaker, so sorry in advance for possible grammatical errors in the docstring.
The basic algorithm is fairly simple, it just slides a dynamic length window over reddit timestamps and tries to keep 30..99 results within the window. Whenever this condition breaks, it increases or decreases the window length by a factor of 2x. I never really bothered to optimize these parameters, they are good enough for me. Maybe it is a good idea to expose them as function parameters.
Naming! TBH, I am not sure if
all_submissions
is a good name.Any feedback is appereciated!