Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategy for missed posts? #12

Closed
ChrisPalmerNZ opened this issue Apr 17, 2021 · 10 comments
Closed

Strategy for missed posts? #12

ChrisPalmerNZ opened this issue Apr 17, 2021 · 10 comments

Comments

@ChrisPalmerNZ
Copy link

ChrisPalmerNZ commented Apr 17, 2021

Hi Matt,

Thanks for pmaw - it's a very nice library you've created!

This is not a problem with pmaw, but with my understanding of how to use it.

I have executed a search_comments with the same before and after parameters in both psaw and pmaw. It was so much faster in psaw, I was amazed! But, success rate varied from 93% to 83%, so at the end I had 40,630 comments using pmaw compared to 40,762 using psaw.

What is the best strategy for retrieving the comments that were missed? Should I assemble a list of submission ids from a search_submissions with the same parameters, based on their having num_comments greater than in the retrieved comments (or not even in the comments), then use search_submission_comment_ids with them? Or, can I utilize safe_exit, and re-run the process to see if I can get more? Or, something else?

Perhaps I it would be best to use search_submission_comment_id from the get-go? I have found that searching by id with psaw much slower that just using a date range, and as I have a range I didn't bother with it in this case. Is it slower to use than search_comments?

Cheers
Chris

@mattpodolak
Copy link
Owner

Hi @ChrisPalmerNZ, the success rate metric which is printed represents how many requests are rejected due to rate-limiting from Pushshift, any failed request is retried automatically.

That's interesting that there was a different number of comments. Can you share the query that you ran? Were there any shards down while you ran the query?

The safe_exit feature is intended for long-running queries which may be interrupted during execution, as it allows you to resume using cached requests and responses. I may be able to recommend a strategy for missed posts based on the parameters you are using.

@ChrisPalmerNZ
Copy link
Author

ChrisPalmerNZ commented Apr 19, 2021

Hi @mattpodolak

Thanks for replying, sorry its taken me a while to get back to you.

I used the same subreddit, start, end and parameters, and query for both libraries, they were subreddit='CovidVaccinated', before=1618581599, and after=1606262347.

And the query (<api> signifies that I used either psaw or pmaw):

   <api>.search_comments(
                    after=after,
                    before=before,         
                    subreddit=subreddit,
                    fields=["id","subreddit","link_id","parent_id","is_submitter","author",
                                "author_fullname","body","score","created_utc","permalink"],
                    limit=None
                    )

I didn't check shards, should I execute an api.metadata_.get('shards') to check them?

I got this from psaw - but eventually I got 40,762 comments:

D:\anaconda3\envs\pytorch_1_4\lib\site-packages\psaw\PushshiftAPI.py:192: UserWarning: Got non 200 code 502
  warnings.warn("Got non 200 code %s" % response.status_code)
D:\anaconda3\envs\pytorch_1_4\lib\site-packages\psaw\PushshiftAPI.py:192: UserWarning: Got non 200 code 522
  warnings.warn("Got non 200 code %s" % response.status_code)

I got this from pmaw, and got 40,630 comments:

40730 results available in Pushshift
Checkpoint:: Success Rate: 93.00% - Requests: 100 - Batches: 10 - Items Remaining: 31889
Checkpoint:: Success Rate: 90.50% - Requests: 200 - Batches: 20 - Items Remaining: 23382
Checkpoint:: Success Rate: 88.00% - Requests: 300 - Batches: 30 - Items Remaining: 16397
Checkpoint:: Success Rate: 84.50% - Requests: 400 - Batches: 40 - Items Remaining: 10293
Checkpoint:: Success Rate: 83.20% - Requests: 500 - Batches: 50 - Items Remaining: 4528
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 59 - Items Remaining: 1
Checkpoint:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: -99
Total:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: 0

@mattpodolak
Copy link
Owner

I'm currently working on troubleshooting what happened to those 100 missing comments.

The number of comments returned by psaw is likely incorrect, I ran your query directly against Pushshift, the metadata indicates that there are only 40730 total results available.

https://api.pushshift.io/reddit/search/comment/?after=1606262347&before=1618581599&subreddit=CovidVaccinated&metadata=true

"metadata": {
        "after": 1606262347,
        "agg_size": 100,
        "api_version": "3.0",
        "before": 1618581599,
        "es_query": {
            "query": {
                "bool": {
                    "filter": {
                        "bool": {
                            "must": [
                                {
                                    "terms": {
                                        "subreddit": [
                                            "covidvaccinated"
                                        ]
                                    }
                                },
                                {
                                    "range": {
                                        "created_utc": {
                                            "gt": 1606262347
                                        }
                                    }
                                },
                                {
                                    "range": {
                                        "created_utc": {
                                            "lt": 1618581599
                                        }
                                    }
                                }
                            ],
                            "should": []
                        }
                    },
                    "must_not": []
                }
            },
            "size": 25,
            "sort": {
                "created_utc": "asc"
            }
        },
        "execution_time_milliseconds": 38.79,
        "index": "rc_delta3",
        "metadata": "true",
        "ranges": [
            {
                "range": {
                    "created_utc": {
                        "gt": 1606262347
                    }
                }
            },
            {
                "range": {
                    "created_utc": {
                        "lt": 1618581599
                    }
                }
            }
        ],
        "results_returned": 25,
        "shards": {
            "failed": 0,
            "skipped": 0,
            "successful": 4,
            "total": 4
        },
        "size": 25,
        "sort": "asc",
        "sort_type": "created_utc",
        "subreddit": [
            "CovidVaccinated"
        ],
        "timed_out": false,
        "total_results": 40730
    }

@mattpodolak
Copy link
Owner

mattpodolak commented Apr 20, 2021

Additional update, I ran your query with both psaw and pmaw.

psaw returned 40726 comments with unique ids, while pmaw returned 40729 comments with unique ids.

I am currently investigating why there was 1 comment missed, I'll release an update sometime this week once I discover the root cause.

@ChrisPalmerNZ
Copy link
Author

ChrisPalmerNZ commented Apr 20, 2021

Thanks for doing this Matthew. I ran the pmaw query a day after the psaw one, and I noticed at that time that it said that fewer (40,730) posts were available than what psaw returned. I measured the number of posts from both libraries by the length of the data, rather than any reporting by the library. I am not currently in front of my PC, but I have saved the data so when I get home tonight I will look at it to see if there were any duplicates returned that might explain the higher psaw number.

@ChrisPalmerNZ
Copy link
Author

Hi Matthew
I looked at my data, and realize that I transposed the last 2 digits of the psaw data - it was 40,726 rather than 40,762. Which agrees with what you reported for psaw. However, my data from pmaw was less at 40,630. Perhaps I was unlucky and there were shards down - can you advise me how I should have tested for that? I am happy to email the IDs to you if that helps your inquiry.

@mattpodolak
Copy link
Owner

mattpodolak commented Apr 21, 2021

Usually, if shards are down a warning should be printed in both pmaw and psaw.

I'm not too sure why there were 100 missing results as I was unable to re-create this, so it could be data inconsistency with Pushshift. I have in the past partially lost pmaw results when exploring the data directly using the generator before storing in a CSV.

I would refer to the number of items available reported by pmaw: "40730 results available in Pushshift," as a baseline for the number that you should expect to be returned for a query, and you can re-run accordingly.

Based on the logs you provided, it appears that pmaw re-ran the query after 1 result was not found, this has been fixed for the next version which will be released.

Total:: Success Rate: 83.02% - Requests: 583 - Batches: 59 - Items Remaining: 1 # finished the query
Checkpoint:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: -99 # re-tries the query to get the missing item
Total:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: 0 

@mattpodolak
Copy link
Owner

mattpodolak commented Apr 21, 2021

Two problems discovered thanks to this issue:

  1. Duplicate results can be included if a query with before and after misses result(s) - fixed in v1.0.5
  2. Missing results can occur for queries that specify before and after - will be tracked in Queries that specify before and after can return a different number of results than reported as available by Pushshift #13

@ChrisPalmerNZ
Copy link
Author

ChrisPalmerNZ commented Apr 21, 2021

Thanks for all of that Matt - I'm glad, and very impressed, that my issue resulted in your devoted attention, and that it led to an improvement - its a great product! BTW, last night I re-ran the query and got all 40,730 results. And, I am familiar with how generators work, I unpacked it straight to CSV, so that wasn't the issue here...

@mattpodolak
Copy link
Owner

No problem, thanks for reporting the issue. It's worth noting that the 40,730 results that pmaw returned to you likely has a single duplicate.

I'm still working on figuring out the root cause, but v1.0.5 will not add duplicate results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants