Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comment stream dropping comments? #1043

Closed
bicubic opened this issue Mar 9, 2019 · 33 comments
Closed

Comment stream dropping comments? #1043

bicubic opened this issue Mar 9, 2019 · 33 comments
Assignees
Labels
Auto-closed - Stale Automatically closed due to being stale for too long Bug Something isn't working Documentation Documentation issue or improvement Stale Issue or pull request has been inactive for 20 days Verified Confirmed bug or issue

Comments

@bicubic
Copy link

bicubic commented Mar 9, 2019

Issue Description

A simple consumer like the below seems to not be processing all comments.

I observed a drop in total comment throughput with the praw stream sometime around december 2018 and it has never really recovered.

I have tested by manually making a number of comments and observing that some of them don't get captured by the praw stream.

IO on the client side is not a limiting factor.

for comment in reddit.subreddit('all').stream.comments():
   do_something(comment)

System Information

  • PRAW Version: 6.0.0
  • Python Version: 3.7
  • Operating System:
@bboe
Copy link
Member

bboe commented Mar 9, 2019

@bicubic unfortunately the /r/all comment stream isn't 100% reliable. If you can find a way to increase reliability we'd definitely love to incorporate those ideas. What part of the documentation would make sense to update to state this observation?

@bicubic
Copy link
Author

bicubic commented Mar 9, 2019

The stream docs I guess.

Do you have any clues as to why it's not reliable? Is it the reddit api or is it praw itself? Are you aware of the volume drop I mentioned around December?

@bboe
Copy link
Member

bboe commented Mar 9, 2019

I'm not aware of any such volume drop. PRAW can grab up to 100 comments in a single request and makes requests roughly one a second (assuming you have only a single service running using your credentials). That means if Reddit ever gets more than 100 comments in a single second (more precisely, since the last request) items will be missed.

The part PRAW relies on is how quickly Reddit updates those listings. Observations would show that they're pretty reliable at returning results, however anecdotal evidence suggests that for active streams (all comments) that the listing isn't always perfectly updated since it's constantly being updated.

For better results, monitor only communities that you're interested in if you want a real time stream. For non real time results give pushshift a try.

@bicubic
Copy link
Author

bicubic commented Mar 10, 2019

That means if Reddit ever gets more than 100 comments in a single second (more precisely, since the last request) items will be missed.

That is not consistent with the failure mode I am seeing. Below is a count of comments ingested per approx. 1 second. They are floats because the counting mechanism is timed by the stream processor firing and so elapsed delta time may be some fraction higher than 1s.

Note that at no point does the throughput approach 100/s and in fact there is a pretty clear pattern of the throughput dropping to almost 0 on some calls.

4.110971667883664
29.12902488487295
31.61212391783092
1.2129483990617538
24.822695849963917
29.510219336362287
1.9923585240590969
37.163753280067624
3.9991415594933675
28.728541629348506
15.77966205792097
40.240926468318406
2.3981417749479816
24.74006601886485
29.92579635109039
1.3119115163031925
24.50243685705319
23.09836711640005
2.15659769109899
22.514574895599996
22.387294394752487
24.32655040884775
3.1010417302469966

@bboe
Copy link
Member

bboe commented Mar 10, 2019

Thanks for the data. If you can narrow it down more, I'd love to be wrong so that things can be fixed. You can try logging the actual requests to see if that helps shed any light on the missed comments:

https://praw.readthedocs.io/en/latest/getting_started/logging.html

Keep in mind too that comments might be going into a spam filter and thus they won't show up in the listings. However, they should be re-added when approved, but I don't know where in the listings they are added.

@bicubic
Copy link
Author

bicubic commented Mar 13, 2019

I'm not sure how I can help narrow it down since I have zero understanding of praw, but here is something you can easily test yourself.

https://gist.github.com/bicubic/774bf7ae25c29d78acb39d6d2b07849c

Couple of observations:

  • When I tested, the above averaged 55 comments per second, praw was averaging 22 for the same time period
  • Just like praw, the above does show that some periods show 0 new comments. I'm guessing there's a time gated cache sitting behind that endpoint
  • The above showed that some outputs contain exactly the maximum number of 100 fresh comments, so it is a source of some drops but does not explain why praw's throughput is so low

I don't know how praw works under the hood, but if it's calling a similar endpoint and sleeps between calls, then I can see how it would be losing fresh comments due to the 100 return limit. Let me know if I can help further.

@nmtake
Copy link
Contributor

nmtake commented Mar 19, 2019

I tried this script

import logging
import sys
import praw

logging.basicConfig(
        level=logging.DEBUG, stream=sys.stdout,
        format='%(asctime)s %(message)s')

reddit = praw.Reddit(...)
for comment in reddit.subreddit('all').stream.comments():
    print(comment.id)

And here is the result log. As we can see, there are many dropped comments as @bicubic pointed out, and some of them can be retrieved later via api/info:

>>> import praw
>>> reddit = praw.Reddit(...)
>>> dropped = ['t1_eivp01' + c for c in 'pquvxyz']
>>> for comment in reddit.info(dropped):
...     print(comment.id, comment.author)
...
eivp01p InnerRisk
eivp01q Skewered_Planets
eivp01u None
eivp01v Cirkah
eivp01x Chishikii
eivp01y sowaffled
eivp01z None

I too am guessing it's a cache problem (rather than private subreddit or unapproved comment). As the second GET /comments request in the log suggests (it returns only 9 comments and too many holes), PRAW is requesting too fast?

@Pyprohly
Copy link
Contributor

My testing suggests that this might have to do with the suboptimal before param adjusting PRAW does.

I’ve modified bicubic’s script to output comment ids to a text file, and compared them to the ids output by the following script:

import os
import praw

reddit = praw.Reddit()
subreddit = reddit.subreddit('all')

with open(os.path.splitext(__file__)[0] + '.txt', 'w') as fh:
	for comment in subreddit.stream.comments(pause_after=None, skip_existing=True):
		print(comment, file=fh)

I’ve compared the two text files—bicubic_stream.txt and praw_stream.txt—by counting the number of lines that we’re in each but missing from the other, using the following approach.

$ timeout -s SIGINT 120 python3 praw_stream.py & timeout -s SIGINT 120 python3 bicubic_stream.py
$ # diff <(sort bicubic_stream.txt) <(sort praw_stream.txt)
$ diff --color=no -U0 <(sort bicubic_stream.txt) <(sort praw_stream.txt) | tail -n +3 | grep -c '^-'
$ diff --color=no -U0 <(sort bicubic_stream.txt) <(sort praw_stream.txt) | tail -n +3 | grep -c '^+'

Here are some results, running both scripts (in parallel) for 2 minutes, and also some 5 minute trials. The first column is the number of ids in bicubic_stream.txt that weren’t found in praw_stream.txt, and, vice versa, the second column is the number of ids found in praw_stream.txt that weren’t picked up in bicubic_stream.txt. Each line represents a new trial.

# 2 minutes
518,247
715,334
489,662
512,682
573,470
659,508
693,265
678,279
681,557
588,190

# 5 minutes
1552,1012
1236,1199
1058,1835

The results vary inconsistently between trials, but bicubic’s stream seems to have a higher potential to win by a bigger margin.

Now, I’ve repeated the same tests but changed the following line:

list(function(limit=limit, params={"before": before_attribute}))

To

list(function(limit=limit, params={"before": None}))

Here are some results:

# 2 minutes
237,946
250,737
281,831
232,774
173,910
220,751
312,562
208,653

# 5 minutes
925,1587
463,1529
460,1237

PRAW’s stream now consistently beats bicubic’s stream by a very significant margin. I can’t explain why the results differ so much here. I almost feel like there’s some flaw in my testing since I’m getting such positive results, but I’m certain that the before adjusting is a factor to PRAW’s low throughput when streaming comments from r/all.

@nmtake
Copy link
Contributor

nmtake commented Mar 24, 2019

I noticed this discussion about stream_generator() by @bboe and @Pyprohly. I should have read it before posting my above comment. Also, here is a similar report on /r/redditdev: My comments don't show in sub.stream.comments, and I don't know why

@bboe
Copy link
Member

bboe commented Mar 24, 2019

So where do things stand? What change to PRAW, if any, would produce better results? It'd be great to pull such a change in if one exists.

@Pyprohly
Copy link
Contributor

Well assuming that it is a matter of before adjusting, one way to solve things would be to just lose the before adjusting, but this would reduce the efficiency of the stream. Doing this would only benefit those who are specifically streaming comments from r/all, so this is not a good solution.

The best thing to do now would probably be to just have PRAW detect that r/all is being streamed from, and have it use a None before value each time.

This could be implemented by having stream_generator() take a before_adjusting=True parameter, and have it be set to False when r/all is the target stream.

@bboe
Copy link
Member

bboe commented Mar 24, 2019

Can you say more about losing the efficiency of the stream? For slower streams, PRAW introduces longer wait limits when nothing has changed, so maybe dropping that param altogether is fine. I'd personally prefer higher accuracy over efficiency, and I suspect many PRAW users would as well.

@nmtake
Copy link
Contributor

nmtake commented Mar 24, 2019

@bboe I have no idea because I don't know what the root cause of this problem.

@bboe
Copy link
Member

bboe commented Mar 24, 2019

Sorry for the confusion @nmtake. That question should have been addressed directly to @Pyprohly regarding this comment:

Well assuming that it is a matter of before adjusting, one way to solve things would be to just lose the before adjusting, but this would reduce the efficiency of the stream.

@Pyprohly
Copy link
Contributor

I think removing the before adjusting would be an acceptable fix for now, although know that everyone who’s not streaming from r/all is going to be slightly worse off, which makes me a little bit uncomfortable.

But it’s not like any other reddit api library does any sort of fancy before param adjusting either.

I’ll reintroduce before adjusting with a more optimal algorithm by the time I’m though with #1025. (My new streaming implementation already tries to detect the target listing’s activity and adjusts before accordingly, and in a future edit I’ll get it choose to keep using a None value for active listings like r/all.)

@Watchful1
Copy link
Contributor

I'm guessing this isn't praw's fault, but reddit not updating the index's powering the before parameter fast enough, causing it to not properly find submissions "before" the id if that id was added to the index only a second before. The average number of comments being submitted to reddit each second has grown something like 30% in the last year, so I wouldn't be surprised if something on their side isn't able to keep up.

I think removing the before parameter is the right solution, and will at worst undetectably decrease performance. Worst case it fetches 100 items instead of 1, then iterates over all 100 client side and throws away 99 of them, which is a very fast operation compared to the request time.

It might be worth it to special case r/all rather than doing this globally though, since no single subreddit (or even a collection of them) is a majority of the new comments.

@bicubic
Copy link
Author

bicubic commented Mar 24, 2019

None of the approaches discussed here are actually capable of capturing 100% of r/all. That should be kept in mind for any changes to praw. For some use cases a loss rate of 5% is just as bad as 25%.

Perhaps its worth considering a different approach for those who do want the 100% r/all firehose and treating that as a special case.

@Watchful1
Copy link
Contributor

Aside from this bug, this approach absolutely will capture 100% of r/all as long as you don't make any other requests. And if you are planning to make other requests, there's no endpoint that will let you catch up since no endpoint returns more than a hundred objects.

It could be made more robust by using an incremental id based approach like pushshift does, but that won't solve the underlying problem of reddit getting 60-70 new comments a second and the client only being able to retrieve 100 at a time.

Just use pushshift. That's why it was created.

@bicubic
Copy link
Author

bicubic commented Mar 24, 2019

Having run an incremental ID fetch solution for the last few days, I see multiple time periods where the throughput rate exceeds 100messages/s.

Does that not impose a guaranteed loss on any orthodox api based approach?

@Watchful1
Copy link
Contributor

It averages out to 60-70 comments a second. As long as you kept track of which ids you had processed and which you hadn't, you can just keep requesting the ones you hadn't and you'll eventually catch up. In times of peak activity you might fall minutes behind, but as long as you're only requesting comments and not doing anything else you'll be fine.

How are you running an incremental ID approach?

@bicubic
Copy link
Author

bicubic commented Mar 24, 2019

How are you running an incremental ID approach?

Probably in a similar way pushshift does. With a pool of workers across multiple IPs making up to N polls per second for 100 explicit IDs each. I don't think such an approach is viable on a single node without technically violating api rate limits.

@Watchful1
Copy link
Contributor

Well I'm reasonably sure that u/Stuck_In_the_Matrix only runs one request a second for pushshift. And he fetches comments and submissions in the same request. He's talked recently about adding a second requester, but he hasn't needed to yet.

I'm not really sure where you're getting multiple hundreds of distinct ids a second from, reddit just doesn't get that much content other than a few brief periods during major sporting events.

This might be getting a bit off topic though.

@bicubic
Copy link
Author

bicubic commented Mar 24, 2019

Correct, I have a surplus of polling capacity because I'm re-polling comments to watch scores and thread progression.

To consume 100% of the firehose in real time you need to be capable of fetching up to 2x100 item queries per second. Re-polling at a later time is not ideal due to comments getting moderated or deleted seconds after creation. Re-polling also prevents use cases like bots reacting to firehose comments in real time.

tl;dr I don't think any orthodox approach can consume the firehose at 100% in real time. This is going to become increasingly an issue as daily comment volumes continue increasing. For that reason it might be worthwhile to treat the r/all stream as a special case in praw.

@bboe
Copy link
Member

bboe commented Apr 6, 2019

Can y'all see if the following PR improves the streaming functionality?

https://github.com/praw-dev/praw/pull/1050/files

Thanks!

@Pyprohly
Copy link
Contributor

Yes, #1050 works well for me @bboe.

I’ve noticed you’ve decided to do away with limit adjusting as well. You once mentioned to me that part of the reason for the param adjusting is to avoid cached results. I’d like to know about any relevant discussions. There must have been a reason for adding all this adjusting in the first place.

@Tystgit
Copy link

Tystgit commented May 11, 2019

I just want to note that I'm seeing the same issues mentioned here and I'm eagerly awaiting a new version that includes #1050 in the hope that it causes less comments to be dropped.

@PythonCoderAS
Copy link
Contributor

To resolve the issue, maybe it can be noted that the r/all stream may drop some comments. I think the main issue is that any stream to r/all is going to require a special class to handle dropped items, and proactively add them in. A simple base 36 decoder can count the item numbers, and then if it detects a gap of two, it inserts the missing comment or submission into the yield list.

I have confirmed that you can do base 36 conversions through the int's base parameter, as shown below:

In[2]: int("fdilbrw", base=36)
Out[2]: 33469023452

@PythonCoderAS PythonCoderAS self-assigned this Jan 25, 2020
@PythonCoderAS PythonCoderAS added Bug Something isn't working Documentation Documentation issue or improvement Verified Confirmed bug or issue labels Jan 25, 2020
@jarhill0
Copy link
Contributor

To resolve the issue, maybe it can be noted that the r/all stream may drop some comments.

Noting a known issue is always good, but just noting the issue isn't the same as fixing it. For this reason, it would be better if we could improve the streams to avoid dropping items entirely, but this is difficult to achieve.

@danksky
Copy link

danksky commented Feb 7, 2020

In my case, I'm looping through a list of subreddit titles and passing subreddit_title to create the subreddit object.

The following

subreddit = reddit.subreddit(subreddit_title)
for comment in subreddit.stream.comments():
   print(comment.id)

prints 1 commend ID per second / per subreddit.

I thought it might be a rate limiting issue, but every first subreddit that I stream comments from does the same as all the subreddits following. Doesn't rule out rate limiting, but I can't figure out what's wrong.

@PythonCoderAS
Copy link
Contributor

Maybe the subreddit does output 1 comment/sec. Try on AskReddit, which has a high comment throughput.

@Toldry
Copy link

Toldry commented Dec 3, 2020

Since this issue hasn't been fully resolved yet, is there at least a method to decrease the proportion of comments that are dropped?

I've noticed that the bot I wrote doesn't process most comments.
If I'd had to take a guess I'd say only around 20% of actual comments are retrieved via reddit.subreddit('all').stream.comments()

@github-actions
Copy link

This issue is stale because it has been open for 20 days with no activity. Remove the Stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale Issue or pull request has been inactive for 20 days label May 20, 2021
@github-actions
Copy link

This issue was closed because it has been stale for 10 days with no activity.

@github-actions github-actions bot added the Auto-closed - Stale Automatically closed due to being stale for too long label May 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auto-closed - Stale Automatically closed due to being stale for too long Bug Something isn't working Documentation Documentation issue or improvement Stale Issue or pull request has been inactive for 20 days Verified Confirmed bug or issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants