Hi community,

With Mako's invaluable help, I was able to write some python to help retrieve submissions from Reddit via various subreddits. 

Below are two functions that can 1) retrieve submissions from any given subreddit and 2) debug any duplicate submissions that may unintentionally jump into your data.

I'm sharing this resource here for anyone else that may be using Reddit/PushshiftAPI or just curious about the process!

---

In [1]:
import json
from pmaw import PushshiftAPI
api = PushshiftAPI()

In [101]:
# today may 10th in epoch time
may_10th_epoch = 1683765853
#august_27_2021 = 1630025053
april_30_2022 = 1651276800

Above I listed out the relevant dates for my project - the date you want to start from (for me today aka 5/10/23) and the date you want to stop your retrieval. I wanted one year of submissions - due to the Pushshift API drama, the most recent date I could access is 4/30/23 - so I made the one year prior mark 4/30/22. 

In [132]:
def get_submissions(subreddit):
    before_time = may_10th_epoch

    post_list = []

    counter = 0
    while True:
        posts = api.search_submissions(subreddit = subreddit,
                                   limit = 100, sort = "created_utc",
                                   order = "desc",
                                   before = before_time)

        new_post_list = [post for post in posts]

        post_list.extend(new_post_list)

        before_time = new_post_list[-1]["created_utc"]
        print(f"api call #{counter + 1} finished. start from epoch time:", before_time)
        if len(new_post_list) == 0:
            break

        #if len(post_list) >= 500:
        #    break
        if before_time < april_30_2022:
            break

        counter += 1
    return post_list

The function defined here does the bulk of the work! 

It calls for 100 posts at a time that are sorted in descending time order - basically 'today' backwards. The max limit I could achieve was 1000, but then ran into other funky duplication bugs, so I'm sticking with 100. Since it's looped, the 'before_time' variable will ensure that it keeps retrieving posts at a consistent rate in descending time order. Each submission will be saved into the post_list and expanded through each loop of 100 new submissions. There's also a note to tell you which/how many batches are processed at a time. Finally, it is set to break at my chosen end time (one year of submissions), but that is easily adjustable to your preferred stop date OR stop sample - if you want a certain number of posts instead of posts from a duration of time, use the #len option instead of the date one. 

In [134]:
def dedup_list(inposts):
    dedup_post_list = []
    for post in inposts:
        if post in dedup_post_list:
            continue
        else:
            dedup_post_list.append(post)
    print(f"dropped {len(inposts) - len(dedup_post_list)} duplicates")
    return dedup_post_list

This function defined helps debug any duplicated submissions, which I unfortunately could not evade otherwise. Dedup for de-deplicate posts. This will output data with no duplicates as it checks what comes in and makes sure only one unique identifier of each post comes out. 

In [131]:
post_list = get_submissions("kindness")

api call #0 finished. start from epoch time: 1677266233
api call #1 finished. start from epoch time: 1671438860
api call #2 finished. start from epoch time: 1667400411
api call #3 finished. start from epoch time: 1663293857
api call #4 finished. start from epoch time: 1657885245
api call #5 finished. start from epoch time: 1653001375
api call #6 finished. start from epoch time: 1649996106


Here is the actual function call for getting subreddit submissions. 

Where I have inserted "kindness", you'll just input your subreddit of interest within quotes. For example, "camping" or "science". The output text lets you know which/how many calls have processed and the start time for submissions in that call (it is still written in epoch time, so you can easily change it to standard time with an online converter). It is important to double check here that the numbers actually decrease for each call, since you want the time to report in descending order. 

In [135]:
dedup_post_list = dedup_list(post_list)

dropped 23 duplicates


Here is the dedup function aka de-duplicating the submission data collected! 

It wil print how many duplicate submissions were dropped. 

In [136]:
with open("kindness_post_data-20230510.jsonl", 'w') as output_file:
    for post in dedup_post_list:
        line_string = json.dumps(post)
        print(line_string, file = output_file)

Here's the last step of this part - writing the submission data to a file!

You can name it whatever you like - just don't forget the '.jsonl' part!

---

bike rack / junkyard:

aka not necessary for the data collection and processing, but I don't want to delete it yet

In [103]:
new_list = []
for x in post_list:
    new_list.append(x["created_utc"])
new_list

[1682898517,
 1682871106,
 1682787989,
 1682748752,
 1682600043,
 1682594443,
 1682592601,
 1682567064,
 1682439403,
 1682420135,
 1682333780,
 1682262759,
 1682246392,
 1681845619,
 1681831267,
 1681815407,
 1681788056,
 1681661356,
 1681645049,
 1681605139,
 1681589034,
 1681584358,
 1681550535,
 1681503762,
 1681481978,
 1681388295,
 1681286504,
 1681248663,
 1681177890,
 1681141250,
 1680994660,
 1680984834,
 1680811491,
 1680794220,
 1680793036,
 1680788924,
 1680782408,
 1680753944,
 1680721112,
 1680714623,
 1680702898,
 1680662178,
 1680624585,
 1682898517,
 1682871106,
 1682787989,
 1682748752,
 1682600043,
 1682594443,
 1682592601,
 1682567064,
 1682439403,
 1682420135,
 1682333780,
 1682262759,
 1682246392,
 1681845619,
 1681831267,
 1681815407,
 1681788056,
 1681661356,
 1681645049,
 1681605139,
 1681589034,
 1681584358,
 1681550535,
 1681503762,
 1681481978,
 1681388295,
 1681286504,
 1681248663,
 1681177890,
 1681141250,
 1680994660,
 1680984834,
 1680811491,
 1680794220,

In [107]:
seen_timestamp = {}

for timestamp in new_list:
    key = str(timestamp)
    if key in seen_timestamp:
        seen_timestamp[key] += 1
        print(key)
    else:
        seen_timestamp[key] = 1
print(seen_timestamp.values())

1682898517
1682871106
1682787989
1682748752
1682600043
1682594443
1682592601
1682567064
1682439403
1682420135
1682333780
1682262759
1682246392
1681845619
1681831267
1681815407
1681788056
1681661356
1681645049
1681605139
1681589034
1681584358
1681550535
1681503762
1681481978
1681388295
1681286504
1681248663
1681177890
1681141250
1680994660
1680984834
1680811491
1680794220
1680793036
1680788924
1680782408
1680753944
1680721112
1680714623
1680702898
1680662178
1680624585
1654739425
1654739367
1654686868
1654458590
1654445257
1654423784
1654404540
1654385611
1654307339
1654307051
1654204541
1654204203
1654196879
1654009143
1653852230
1653829598
1653822399
1653817042
1653735035
1653698935
1653662824
1653638652
1653629298
1653629004
1653612878
1653590248
1653582983
1653568640
1653558222
1653471698
1653365583
1653354909
1653331630
1653331049
1653311208
1653264659
1653156254
1653151903
1653130839
1653125006
1653105107
1653077812
1653001375
1652933169
1652854929
1652831651
1652830445
1652829814