Comment stream dropping comments? #1043

bicubic · 2019-03-09T10:01:03Z

Issue Description

A simple consumer like the below seems to not be processing all comments.

I observed a drop in total comment throughput with the praw stream sometime around december 2018 and it has never really recovered.

I have tested by manually making a number of comments and observing that some of them don't get captured by the praw stream.

IO on the client side is not a limiting factor.

for comment in reddit.subreddit('all').stream.comments():
   do_something(comment)

System Information

PRAW Version: 6.0.0
Python Version: 3.7
Operating System:

The text was updated successfully, but these errors were encountered:

bboe · 2019-03-09T17:55:48Z

@bicubic unfortunately the /r/all comment stream isn't 100% reliable. If you can find a way to increase reliability we'd definitely love to incorporate those ideas. What part of the documentation would make sense to update to state this observation?

bicubic · 2019-03-09T22:54:36Z

The stream docs I guess.

Do you have any clues as to why it's not reliable? Is it the reddit api or is it praw itself? Are you aware of the volume drop I mentioned around December?

bboe · 2019-03-09T23:56:50Z

I'm not aware of any such volume drop. PRAW can grab up to 100 comments in a single request and makes requests roughly one a second (assuming you have only a single service running using your credentials). That means if Reddit ever gets more than 100 comments in a single second (more precisely, since the last request) items will be missed.

The part PRAW relies on is how quickly Reddit updates those listings. Observations would show that they're pretty reliable at returning results, however anecdotal evidence suggests that for active streams (all comments) that the listing isn't always perfectly updated since it's constantly being updated.

For better results, monitor only communities that you're interested in if you want a real time stream. For non real time results give pushshift a try.

bicubic · 2019-03-10T12:29:56Z

That means if Reddit ever gets more than 100 comments in a single second (more precisely, since the last request) items will be missed.

That is not consistent with the failure mode I am seeing. Below is a count of comments ingested per approx. 1 second. They are floats because the counting mechanism is timed by the stream processor firing and so elapsed delta time may be some fraction higher than 1s.

Note that at no point does the throughput approach 100/s and in fact there is a pretty clear pattern of the throughput dropping to almost 0 on some calls.

4.110971667883664
29.12902488487295
31.61212391783092
1.2129483990617538
24.822695849963917
29.510219336362287
1.9923585240590969
37.163753280067624
3.9991415594933675
28.728541629348506
15.77966205792097
40.240926468318406
2.3981417749479816
24.74006601886485
29.92579635109039
1.3119115163031925
24.50243685705319
23.09836711640005
2.15659769109899
22.514574895599996
22.387294394752487
24.32655040884775
3.1010417302469966

bboe · 2019-03-10T21:09:53Z

Thanks for the data. If you can narrow it down more, I'd love to be wrong so that things can be fixed. You can try logging the actual requests to see if that helps shed any light on the missed comments:

https://praw.readthedocs.io/en/latest/getting_started/logging.html

Keep in mind too that comments might be going into a spam filter and thus they won't show up in the listings. However, they should be re-added when approved, but I don't know where in the listings they are added.

bicubic · 2019-03-13T12:38:18Z

I'm not sure how I can help narrow it down since I have zero understanding of praw, but here is something you can easily test yourself.

https://gist.github.com/bicubic/774bf7ae25c29d78acb39d6d2b07849c

Couple of observations:

When I tested, the above averaged 55 comments per second, praw was averaging 22 for the same time period
Just like praw, the above does show that some periods show 0 new comments. I'm guessing there's a time gated cache sitting behind that endpoint
The above showed that some outputs contain exactly the maximum number of 100 fresh comments, so it is a source of some drops but does not explain why praw's throughput is so low

I don't know how praw works under the hood, but if it's calling a similar endpoint and sleeps between calls, then I can see how it would be losing fresh comments due to the 100 return limit. Let me know if I can help further.

nmtake · 2019-03-19T15:41:00Z

I tried this script

import logging
import sys
import praw

logging.basicConfig(
        level=logging.DEBUG, stream=sys.stdout,
        format='%(asctime)s %(message)s')

reddit = praw.Reddit(...)
for comment in reddit.subreddit('all').stream.comments():
    print(comment.id)

And here is the result log. As we can see, there are many dropped comments as @bicubic pointed out, and some of them can be retrieved later via api/info:

>>> import praw
>>> reddit = praw.Reddit(...)
>>> dropped = ['t1_eivp01' + c for c in 'pquvxyz']
>>> for comment in reddit.info(dropped):
...     print(comment.id, comment.author)
...
eivp01p InnerRisk
eivp01q Skewered_Planets
eivp01u None
eivp01v Cirkah
eivp01x Chishikii
eivp01y sowaffled
eivp01z None

I too am guessing it's a cache problem (rather than private subreddit or unapproved comment). As the second GET /comments request in the log suggests (it returns only 9 comments and too many holes), PRAW is requesting too fast?

Pyprohly · 2019-03-20T23:05:52Z

My testing suggests that this might have to do with the suboptimal before param adjusting PRAW does.

I’ve modified bicubic’s script to output comment ids to a text file, and compared them to the ids output by the following script:

import os
import praw

reddit = praw.Reddit()
subreddit = reddit.subreddit('all')

with open(os.path.splitext(__file__)[0] + '.txt', 'w') as fh:
	for comment in subreddit.stream.comments(pause_after=None, skip_existing=True):
		print(comment, file=fh)

I’ve compared the two text files—bicubic_stream.txt and praw_stream.txt—by counting the number of lines that we’re in each but missing from the other, using the following approach.

$ timeout -s SIGINT 120 python3 praw_stream.py & timeout -s SIGINT 120 python3 bicubic_stream.py
$ # diff <(sort bicubic_stream.txt) <(sort praw_stream.txt)
$ diff --color=no -U0 <(sort bicubic_stream.txt) <(sort praw_stream.txt) | tail -n +3 | grep -c '^-'
$ diff --color=no -U0 <(sort bicubic_stream.txt) <(sort praw_stream.txt) | tail -n +3 | grep -c '^+'

Here are some results, running both scripts (in parallel) for 2 minutes, and also some 5 minute trials. The first column is the number of ids in bicubic_stream.txt that weren’t found in praw_stream.txt, and, vice versa, the second column is the number of ids found in praw_stream.txt that weren’t picked up in bicubic_stream.txt. Each line represents a new trial.

# 2 minutes
518,247
715,334
489,662
512,682
573,470
659,508
693,265
678,279
681,557
588,190

# 5 minutes
1552,1012
1236,1199
1058,1835

The results vary inconsistently between trials, but bicubic’s stream seems to have a higher potential to win by a bigger margin.

Now, I’ve repeated the same tests but changed the following line:

praw/praw/models/util.py

Line 172 in dba778a

list(function(limit=limit, params={"before": before_attribute}))

To

list(function(limit=limit, params={"before": None}))

Here are some results:

# 2 minutes
237,946
250,737
281,831
232,774
173,910
220,751
312,562
208,653

# 5 minutes
925,1587
463,1529
460,1237

PRAW’s stream now consistently beats bicubic’s stream by a very significant margin. I can’t explain why the results differ so much here. I almost feel like there’s some flaw in my testing since I’m getting such positive results, but I’m certain that the before adjusting is a factor to PRAW’s low throughput when streaming comments from r/all.

nmtake · 2019-03-24T00:12:57Z

I noticed this discussion about stream_generator() by @bboe and @Pyprohly. I should have read it before posting my above comment. Also, here is a similar report on /r/redditdev: My comments don't show in sub.stream.comments, and I don't know why

bboe · 2019-03-24T00:22:22Z

So where do things stand? What change to PRAW, if any, would produce better results? It'd be great to pull such a change in if one exists.

Pyprohly · 2019-03-24T00:32:00Z

Well assuming that it is a matter of before adjusting, one way to solve things would be to just lose the before adjusting, but this would reduce the efficiency of the stream. Doing this would only benefit those who are specifically streaming comments from r/all, so this is not a good solution.

The best thing to do now would probably be to just have PRAW detect that r/all is being streamed from, and have it use a None before value each time.

This could be implemented by having stream_generator() take a before_adjusting=True parameter, and have it be set to False when r/all is the target stream.

bboe · 2019-03-24T00:49:17Z

Can you say more about losing the efficiency of the stream? For slower streams, PRAW introduces longer wait limits when nothing has changed, so maybe dropping that param altogether is fine. I'd personally prefer higher accuracy over efficiency, and I suspect many PRAW users would as well.

nmtake · 2019-03-24T00:58:24Z

@bboe I have no idea because I don't know what the root cause of this problem.

bboe · 2019-03-24T01:17:15Z

Sorry for the confusion @nmtake. That question should have been addressed directly to @Pyprohly regarding this comment:

Well assuming that it is a matter of before adjusting, one way to solve things would be to just lose the before adjusting, but this would reduce the efficiency of the stream.

Pyprohly · 2019-03-24T01:28:05Z

I think removing the before adjusting would be an acceptable fix for now, although know that everyone who’s not streaming from r/all is going to be slightly worse off, which makes me a little bit uncomfortable.

But it’s not like any other reddit api library does any sort of fancy before param adjusting either.

I’ll reintroduce before adjusting with a more optimal algorithm by the time I’m though with #1025. (My new streaming implementation already tries to detect the target listing’s activity and adjusts before accordingly, and in a future edit I’ll get it choose to keep using a None value for active listings like r/all.)

Watchful1 · 2019-03-24T05:09:56Z

I'm guessing this isn't praw's fault, but reddit not updating the index's powering the before parameter fast enough, causing it to not properly find submissions "before" the id if that id was added to the index only a second before. The average number of comments being submitted to reddit each second has grown something like 30% in the last year, so I wouldn't be surprised if something on their side isn't able to keep up.

I think removing the before parameter is the right solution, and will at worst undetectably decrease performance. Worst case it fetches 100 items instead of 1, then iterates over all 100 client side and throws away 99 of them, which is a very fast operation compared to the request time.

It might be worth it to special case r/all rather than doing this globally though, since no single subreddit (or even a collection of them) is a majority of the new comments.

bicubic · 2019-03-24T05:32:09Z

None of the approaches discussed here are actually capable of capturing 100% of r/all. That should be kept in mind for any changes to praw. For some use cases a loss rate of 5% is just as bad as 25%.

Perhaps its worth considering a different approach for those who do want the 100% r/all firehose and treating that as a special case.

Watchful1 · 2019-03-24T05:39:28Z

Aside from this bug, this approach absolutely will capture 100% of r/all as long as you don't make any other requests. And if you are planning to make other requests, there's no endpoint that will let you catch up since no endpoint returns more than a hundred objects.

It could be made more robust by using an incremental id based approach like pushshift does, but that won't solve the underlying problem of reddit getting 60-70 new comments a second and the client only being able to retrieve 100 at a time.

Just use pushshift. That's why it was created.

bicubic · 2019-03-24T05:56:52Z

Having run an incremental ID fetch solution for the last few days, I see multiple time periods where the throughput rate exceeds 100messages/s.

Does that not impose a guaranteed loss on any orthodox api based approach?

Watchful1 · 2019-03-24T06:07:58Z

It averages out to 60-70 comments a second. As long as you kept track of which ids you had processed and which you hadn't, you can just keep requesting the ones you hadn't and you'll eventually catch up. In times of peak activity you might fall minutes behind, but as long as you're only requesting comments and not doing anything else you'll be fine.

How are you running an incremental ID approach?

bicubic · 2019-03-24T06:21:03Z

How are you running an incremental ID approach?

Probably in a similar way pushshift does. With a pool of workers across multiple IPs making up to N polls per second for 100 explicit IDs each. I don't think such an approach is viable on a single node without technically violating api rate limits.

Watchful1 · 2019-03-24T06:28:52Z

Well I'm reasonably sure that u/Stuck_In_the_Matrix only runs one request a second for pushshift. And he fetches comments and submissions in the same request. He's talked recently about adding a second requester, but he hasn't needed to yet.

I'm not really sure where you're getting multiple hundreds of distinct ids a second from, reddit just doesn't get that much content other than a few brief periods during major sporting events.

This might be getting a bit off topic though.

bicubic · 2019-03-24T06:40:15Z

Correct, I have a surplus of polling capacity because I'm re-polling comments to watch scores and thread progression.

To consume 100% of the firehose in real time you need to be capable of fetching up to 2x100 item queries per second. Re-polling at a later time is not ideal due to comments getting moderated or deleted seconds after creation. Re-polling also prevents use cases like bots reacting to firehose comments in real time.

tl;dr I don't think any orthodox approach can consume the firehose at 100% in real time. This is going to become increasingly an issue as daily comment volumes continue increasing. For that reason it might be worthwhile to treat the r/all stream as a special case in praw.

bboe · 2019-04-06T19:40:20Z

Can y'all see if the following PR improves the streaming functionality?

https://github.com/praw-dev/praw/pull/1050/files

Thanks!

Pyprohly · 2019-04-10T13:37:57Z

Yes, #1050 works well for me @bboe.

I’ve noticed you’ve decided to do away with limit adjusting as well. You once mentioned to me that part of the reason for the param adjusting is to avoid cached results. I’d like to know about any relevant discussions. There must have been a reason for adding all this adjusting in the first place.

Tystgit · 2019-05-11T05:41:37Z

I just want to note that I'm seeing the same issues mentioned here and I'm eagerly awaiting a new version that includes #1050 in the hope that it causes less comments to be dropped.

PythonCoderAS · 2020-01-25T13:54:09Z

To resolve the issue, maybe it can be noted that the r/all stream may drop some comments. I think the main issue is that any stream to r/all is going to require a special class to handle dropped items, and proactively add them in. A simple base 36 decoder can count the item numbers, and then if it detects a gap of two, it inserts the missing comment or submission into the yield list.

I have confirmed that you can do base 36 conversions through the int's base parameter, as shown below:

In[2]: int("fdilbrw", base=36)
Out[2]: 33469023452

jarhill0 · 2020-01-29T22:34:05Z

To resolve the issue, maybe it can be noted that the r/all stream may drop some comments.

Noting a known issue is always good, but just noting the issue isn't the same as fixing it. For this reason, it would be better if we could improve the streams to avoid dropping items entirely, but this is difficult to achieve.

danksky · 2020-02-07T16:38:46Z

In my case, I'm looping through a list of subreddit titles and passing subreddit_title to create the subreddit object.

The following

subreddit = reddit.subreddit(subreddit_title)
for comment in subreddit.stream.comments():
   print(comment.id)

prints 1 commend ID per second / per subreddit.

I thought it might be a rate limiting issue, but every first subreddit that I stream comments from does the same as all the subreddits following. Doesn't rule out rate limiting, but I can't figure out what's wrong.

PythonCoderAS · 2020-03-16T17:34:57Z

Maybe the subreddit does output 1 comment/sec. Try on AskReddit, which has a high comment throughput.

Toldry · 2020-12-03T19:19:41Z

Since this issue hasn't been fully resolved yet, is there at least a method to decrease the proportion of comments that are dropped?

I've noticed that the bot I wrote doesn't process most comments.
If I'd had to take a guess I'd say only around 20% of actual comments are retrieved via reddit.subreddit('all').stream.comments()

github-actions · 2021-05-20T23:34:25Z

This issue is stale because it has been open for 20 days with no activity. Remove the Stale label or comment or this will be closed in 10 days.

github-actions · 2021-05-31T00:54:55Z

This issue was closed because it has been stale for 10 days with no activity.

Pyprohly mentioned this issue Mar 24, 2019

Candidate patch for #1043 #1045

Closed

jarhill0 mentioned this issue Apr 30, 2019

Which REST endpoints are being called by the subreddit.stream.comments()? #1057

Closed

Pyprohly mentioned this issue May 31, 2019

Remove use of before parameter in streams #1050

Closed

1 task

PythonCoderAS self-assigned this Jan 25, 2020

PythonCoderAS added Bug Something isn't working Documentation Documentation issue or improvement Verified Confirmed bug or issue labels Jan 25, 2020

This was referenced Jan 26, 2020

Add documentation notes for outstanding issues #1321

Merged

Made a special stream for r/all that does handling of skipped items #1328

Closed

github-actions bot added the Stale Issue or pull request has been inactive for 20 days label May 20, 2021

github-actions bot added the Auto-closed - Stale Automatically closed due to being stale for too long label May 31, 2021

github-actions bot closed this as completed May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comment stream dropping comments? #1043

Comment stream dropping comments? #1043

bicubic commented Mar 9, 2019

bboe commented Mar 9, 2019

bicubic commented Mar 9, 2019

bboe commented Mar 9, 2019

bicubic commented Mar 10, 2019

bboe commented Mar 10, 2019

bicubic commented Mar 13, 2019

nmtake commented Mar 19, 2019 •

edited

Pyprohly commented Mar 20, 2019

nmtake commented Mar 24, 2019

bboe commented Mar 24, 2019

Pyprohly commented Mar 24, 2019

bboe commented Mar 24, 2019

nmtake commented Mar 24, 2019

bboe commented Mar 24, 2019

Pyprohly commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

bboe commented Apr 6, 2019

Pyprohly commented Apr 10, 2019

Tystgit commented May 11, 2019 •

edited

PythonCoderAS commented Jan 25, 2020

jarhill0 commented Jan 29, 2020

danksky commented Feb 7, 2020

PythonCoderAS commented Mar 16, 2020

Toldry commented Dec 3, 2020

github-actions bot commented May 20, 2021

github-actions bot commented May 31, 2021

Comment stream dropping comments? #1043

Comment stream dropping comments? #1043

Comments

bicubic commented Mar 9, 2019

Issue Description

System Information

bboe commented Mar 9, 2019

bicubic commented Mar 9, 2019

bboe commented Mar 9, 2019

bicubic commented Mar 10, 2019

bboe commented Mar 10, 2019

bicubic commented Mar 13, 2019

nmtake commented Mar 19, 2019 • edited

Pyprohly commented Mar 20, 2019

nmtake commented Mar 24, 2019

bboe commented Mar 24, 2019

Pyprohly commented Mar 24, 2019

bboe commented Mar 24, 2019

nmtake commented Mar 24, 2019

bboe commented Mar 24, 2019

Pyprohly commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

Watchful1 commented Mar 24, 2019

bicubic commented Mar 24, 2019

bboe commented Apr 6, 2019

Pyprohly commented Apr 10, 2019

Tystgit commented May 11, 2019 • edited

PythonCoderAS commented Jan 25, 2020

jarhill0 commented Jan 29, 2020

danksky commented Feb 7, 2020

PythonCoderAS commented Mar 16, 2020

Toldry commented Dec 3, 2020

github-actions bot commented May 20, 2021

github-actions bot commented May 31, 2021

nmtake commented Mar 19, 2019 •

edited

Tystgit commented May 11, 2019 •

edited