Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337) #554

Merged
merged 16 commits into from Dec 18, 2015

Conversation

eleweek
Copy link
Contributor

@eleweek eleweek commented Nov 24, 2015

Hi, I've used this trick quite a lot, so I took some time and implemented a helper for praw. I hope you'll find it useful!

  1. Reddit seems to use created attribute for "timestamp" queries. created seems to be a sort of broken timestamp. I think this is very confusing, and I always forget to convert time.time() output to "reddit timestamp". That's why I chose to make highest_timestamp and lowest_timestamp to be normal unix timestamp(even though it is inconsistent with r.search)

  2. I am not a native speaker, so sorry in advance for possible grammatical errors in the docstring.

  3. The basic algorithm is fairly simple, it just slides a dynamic length window over reddit timestamps and tries to keep 30..99 results within the window. Whenever this condition breaks, it increases or decreases the window length by a factor of 2x. I never really bothered to optimize these parameters, they are good enough for me. Maybe it is a good idea to expose them as function parameters.

  4. Naming! TBH, I am not sure if all_submissions is a good name.

Any feedback is appereciated!


if highest_timestamp is None:
# convert normal unix timestamps into broken reddit timestamps
highest_timestamp = int(time.time()) + 28800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused: can you clarify what you mean by 'broken' and, if it is indeed broken, you may want to file an issue on reddit/reddit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken == looks like unix timestamp, but isn't. :) Compare "created" and "created_utc" attributes.

I doubt they are going to change the meaning of created attribute, especially given that you can get the proper timestamp from created_utc

https://www.reddit.com/r/redditdev/comments/29991t/whats_the_difference_between_created_and_created/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah :P

If cloudsearch is still using the broken created, I'd assume that's a bug since that's just weird and confusing for devs in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been reported before: https://www.reddit.com/r/bugs/comments/29ajsg/cloudsearch_syntax_timestamp_seaches_are/

It doesn't look like reddit admins care about this problem. :/

@eleweek
Copy link
Contributor Author

eleweek commented Nov 25, 2015

I've updated the diff and addressed some of the feedback from @13steinj . I also realized that I haven't actually made use of lowest_timestamp, so I fixed this(and tested the fix)

elif not isinstance(subreddit, six.string_types):
lowest_timestamp = subreddit.created
elif subreddit != "all":
lowest_timestamp = reddit_session.get_subreddit(subreddit).created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to change this to elif subreddit not in ("all", "contrib", "mod", "friend",) (maybe even add "front" and put it in the doc so people can just search the front page, and if it is front, make subreddit=None be passed in the search). Other that, looks cool :P

@eleweek
Copy link
Contributor Author

eleweek commented Nov 29, 2015

Bump!

@@ -111,6 +111,104 @@ def valid_redditors(redditors, sub):
if resp['ok']]


def all_submissions(reddit_session,
subreddit,
highest_timestamp=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to switch lowest and highest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I'll fix this.

@bboe
Copy link
Member

bboe commented Nov 30, 2015

This addition looks awesome. I really like this helper. There are a few order based things I asked questions about otherwise I think the next step is to add some tests. Given that there is a lot of code in this function, I'd really like to see 100% branch coverage. Great work!

@eleweek
Copy link
Contributor Author

eleweek commented Dec 4, 2015

Hi, I've address some of your feedback! I am not sure how to achieve 100% branch coverage though with betamax, since the time is constantly going forward and the recorded requests will become outdated the next second.

@13steinj
Copy link
Contributor

13steinj commented Dec 5, 2015

@eleweek betamax should be caching any and all requests made. Meaning; if you say, search via new and ensure that the first post is the newest post, and you do it on a small subreddit (like one of the api test subreddits as listed in PrawTest's configure), there shouldn't be a change in what is listed. Any further times the test is run, instead of making the request, betamax would be used to read from the cache.

@eleweek
Copy link
Contributor Author

eleweek commented Dec 5, 2015

Any further times the test is run, instead of making the request, betamax would be used to read from the cache.

Unfortunately, this isn't going to work: time goes forward => time.time() return value changes => the set of requests changes.

@bboe
Copy link
Member

bboe commented Dec 5, 2015

@eleweek You can mock out the calls to time so that it always return the values you want them to return when you run the tests. It's a little tricky if you've never done it before, a very useful tool to have for writing automated tests. There is some information here if you're up to it:
http://stackoverflow.com/questions/4481954/python-trying-to-mock-datetime-date-today-but-not-working

@eleweek
Copy link
Contributor Author

eleweek commented Dec 5, 2015

@bboe Thanks for the pointer, I've tried to make it work, but unfortunately it leads to this output when using nose:

Ran 261 tests in -422558.602s

I'm mocking time as:

import time

def mock_time():
    return 1448923624
time.time = mock_time

@eleweek
Copy link
Contributor Author

eleweek commented Dec 5, 2015

Actually, mocking time in setUp() and de-mocking it in tearDown() seems to work, although this leads to incorrect time reported if you press Ctrl+C during the time this test case is run. Hopefully this is acceptable.

@eleweek
Copy link
Contributor Author

eleweek commented Dec 5, 2015

Okay, actually, I managed to fully resolve the issue with incorrect running time values! I've updated the PR

@Damgaard
Copy link
Contributor

Damgaard commented Dec 6, 2015

Hey

As far as I remember from way back the created value is the time in the servers timezone. Since reddit may move servers from one timezone to another at any point and may have servers in multiple distinct timezones, the created value should not be trusted to be a consistent difference from utc. As such only created_utc should ever be used, since it is the only reliable timestamp. IIRC the created attribute is a legacy value that is only kept around, because removing it would be a backwards incompatible change that may break an unknown number of apps that expect this value to exist. Also, due to the openness of reddits API, there is no known way of being able to contact all consumers of the API to inform of a backwards incompatible change. In other words, removing created was considered such a complicated, dangerous change that it was postponed in favor of getting more pressing changes out. With that in mind created should be considered deprecated, dangerous and not to be used.

I'm not 100% on this information, it's been quite a while since I've been in the nitty gritty of PRAW/reddit. But I wanted to pipe in, since it looks like this PR assumes a fixed relationship between created and unix time, which could change at any time and not even be guaranteed to be a constant. This could prove to be a very serious maintenance headache years in the future if reddit ever arbitrarily decides to change the servers timezone.

@Damgaard
Copy link
Contributor

Damgaard commented Dec 6, 2015

I did another quick read through and it looks like the server expects the timestamp arguments in the timezone of created which is what causes the conversions between unix time and reddit time. In which case, and if my memory of the above is correct, then what is being done seems to be the only way to do this from PRAW's POV. But it seems quite fragile for something that is part of PRAW and not an external script.

The proper solution would then be to update the respective endpoint at reddit to accept utc instead of timezone specific time or as an alternative. This would embrace the deprecated nature of created and allow server timezone to be changed arbitrarily without breaking stuff. PRAW could then use created_utc safely in the knowledge that it will be consistent forever.

@eleweek
Copy link
Contributor Author

eleweek commented Dec 6, 2015

@Damgaard I do agree with you that it would be better to use created_utc rather than deprecated and confusing created. The issue was raised before. Alas, reddit admins aren't particularly responsive.

The proper solution would then be to update the respective endpoint at reddit to accept utc instead of timezone specific time or as an alternative.

Personally, I find reddit admins extremely unresponsive. I've reported multiple bugs(got zero response) and tried to start a discussion on slightly changing a different endpoint(again, zero response), so even though I don't mind contributing to reddit(I actually fixed a bug in reddit once), I have little desire trying to make a breaking changes or significant modifications.

@eleweek
Copy link
Contributor Author

eleweek commented Dec 8, 2015

Bump! Is there any chance that this could be merged without changing reddit's source?

@bboe
Copy link
Member

bboe commented Dec 8, 2015

Is there any chance that this could be merged without changing reddit's source?

Yes, while I agree with @Damgaard that UTC timestamps should be used, I think something that works initially is better than nothing.

Regarding timing, it may be a few days until I can take a solid look before merging. Maybe @Damgaard or @voussoir can before then? If there's no activity before Monday, please bump again, and thanks for the contribution.

verbosity=3))

def get_submissions_data(submissions_list):
return [(s.created_utc, s.title, s.url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Would it be sufficient to use the item's ID / fullname instead of these attributes? While it's unlikely that two posts would collide here, these aren't necessarily identifiers.

  2. Instead of creating and comparing two lists of IDs, maybe we can zip the search results? Like:

    def assert_equal_submissions(list1, list2):
        self.assertEqual(len(list1), len(list2))
        for (item1, item2) in zip(list1, list2):
            self.assertEqual(item1.fullname, item2.fullname)
    
    assert_equal_submissions(all_subs, all_subs_sr_object)
    assert_equal_submissions(all_subs, reversed(all_subs_reversed))
    ...
    

    I think the intention is more clear this way, let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two lists of submissions should be directly comparable:

[<praw.objects.Submission at 0x110755ed0>,
 <praw.objects.Submission at 0x110755fd0>,
 <praw.objects.Submission at 0x110755150>,
 <praw.objects.Submission at 0x110755210>,
 <praw.objects.Submission at 0x1106c6b10>]

In [12]: b
Out[12]: 
[<praw.objects.Submission at 0x1106ca850>,
 <praw.objects.Submission at 0x1106bb0d0>,
 <praw.objects.Submission at 0x1106bb3d0>,
 <praw.objects.Submission at 0x1106bb290>,
 <praw.objects.Submission at 0x1106bb190>]

In [13]: a == b
Out[13]: True

Note that the objects are at different addresses, but their built-in comparison enables them to compare as equal if they're the same type and have the same id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, woops! It's always the simplest answer that I forget. This is much better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bboe Oh thanks, not sure why I didn't realize

@voussoir
Copy link
Contributor

voussoir commented Dec 9, 2015

It's a shame the API isn't using the UTC timestamp, but it used to be much worse! I don't know when the created-created_utc offset became 28,800 globally but it must have been recent.

I haven't run any of your code, but it looks great to me, and the tests seem to agree. I might try to get a little more playful with the window_size modifiers instead of using a hard x2 multiplier though. For quick results, we want to get as close to our search_limit as we can on every loop, and avoid dawdling in the 30-90 range if possible.

There's probably no perfect solution to this, but maybe you can increase and decrease by a certain percentage, and only get aggressive after repeated failures.

 

As for the function name, maybe just timestamp_search? I'm having a hard time coming up with something descriptive and concise, haha.

 

For bonus points, do you think this function can accept some other search parameters? Cloudsearch is kind of annoying to format, but it's like a prefix expression I guess:

if username is not None:
    query = '(and author:"%s" timestamp:%d..%d)' % (username, lower, upper)
else:
    query = 'timestamp:%d..%d' % (lower, upper)

This opens up a lot of possibilities, so if the formatting doesn't get too crazy it may be worth the effort.

Sorry I haven't been too active recently, this is a great contribution and I wish I had participated more.

@bboe
Copy link
Member

bboe commented Dec 15, 2015

@eleweek Overall, I think we're just about ready. Perhaps we can make a simplification to the test with respect to directly comparing lists of subreddit objects, and rename the method. Then it should be good to go!

Thanks for the persistence and great work.

@eleweek
Copy link
Contributor Author

eleweek commented Dec 15, 2015

@voussoir @bboe Thanks for your feedback, I was busy last week, so I didn't get to updating the PR. I'll probably do this over the next couple of days.

@bboe
Copy link
Member

bboe commented Dec 15, 2015

I'll probably do this over the next couple of days.

Awesome! Please mention us when you've done that so we know when to take a look.

params, extra filtering by cloud search fields
@eleweek
Copy link
Contributor Author

eleweek commented Dec 17, 2015

@bboe @voussoir I've updated the PR! Changed window sliding params + added extra filtering by other cloud search fields

@bboe
Copy link
Member

bboe commented Dec 17, 2015

Looks awesome. Just two bits of feedback, then I think we're good to go.

@voussoir
Copy link
Contributor

@eleweek I'm really glad you added the additional cloudsearch parameter support, thanks a lot! All of the changes get a thumbs up from me.

I'm still seeing extra_cloudsearch_fields={} in the latest commit, 601fb8a, though. Can you make sure that gets changed to None before we merge? Following the timestamps it looks like the commit was made after Bryce's comment so I think you forgot to change it.

up unittests a bit, extra_cloudsearch_fields is now None by default
@eleweek
Copy link
Contributor Author

eleweek commented Dec 18, 2015

@voussoir It was late where I live, and I figured I could push a commit that fixes CI problems and fix those issues later.

I fixed it now though.

Also pinging @bboe just in case

@voussoir
Copy link
Contributor

Great, thanks! Just had to double check.

@bboe, I'm happy with the PR when you are.

bboe added a commit that referenced this pull request Dec 18, 2015
Add helper for downloading all submissions in a subreddit between two timestamps (Implements: #337)
@bboe bboe merged commit cb33a4e into praw-dev:master Dec 18, 2015
@bboe
Copy link
Member

bboe commented Dec 18, 2015

Awesome work @eleweek! Thanks for reviewing @voussoir, @13steinj and @Damgaard. 👏

@eleweek
Copy link
Contributor Author

eleweek commented Dec 18, 2015

Thank you @bboe , @voussoir and @13steinj . This was fun :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants