# Introduction
This is a Jupyter notebook, designed to test out the PRAW library and get an idea of how to use it for reddit scraping. It requires the `praw` library, which can be installed directly (e.g. `pip install praw`) or in a conda environment.

# Set up a reddit instance
First, we need to set up a reddit instance. This requires setting up a Reddit application. Head over to [the authorized applications page](https://www.reddit.com/prefs/apps) and create a new script at the bottom. It is required to add a title, description and redirect uri (use `http://localhost:8080` here). Once your apps has been created, you will have the required credentials to start using the Reddit API. The `user_agent` is the name of the application; the `client_id` is the line of gibberish next to the icon of the application; and the `client_secret` is the `secret` gibberish. 

In [1]:
import praw
reddit = praw.Reddit(
    user_agent="IsItMould?",
    client_id="EXRL43UtmsdPymRwqZErkg",
    client_secret="JrkEkIWSzI_srcUvJeh24Q__qvS3VQ",
)

# Get and filter data from subreddit
Next, let's try to obtain some data from our subreddit of interest: `r/kombucha`. First, lets print the attributes and methods for the latest post. Then we will get the last 5 posts by date and output some basic information like the post id, timestamp and title.

## Available attributes
Let's have a look at the available attributes and methods for the objects that are returned by `praw`.

In [105]:
# Get the attributes for a subreddit
print(dir(reddit.subreddit('kombucha')))

# Get the attributes for a post
last_posts = reddit.subreddit('kombucha').new(limit=1)
for post in last_posts:
    print(vars(post))

['MESSAGE_PREFIX', 'STR_FIELD', 'VALID_TIME_FILTERS', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_convert_to_fancypants', '_create_or_update', '_fetch', '_fetch_data', '_fetch_info', '_fetched', '_kind', '_parse_xml_response', '_path', '_reddit', '_reset_attributes', '_safely_add_arguments', '_submission_class', '_submit_media', '_subreddit_collections_class', '_subreddit_list', '_upload_inline_media', '_upload_media', '_url_parts', '_validate_gallery', '_validate_inline_media', '_validate_time_filter', 'banned', 'collections', 'comments', 'contributor', 'controversial', 'display_name', 'emoji', 'filters', 'flair', 'fullname', 'gilded', 'hot', 'message', 'mod', 'moderator', 

## Return the last posts
Seems that the last posts can be returned using `.new()`. We can then extract useful information like the post id, the timestamp and the post title.

In [48]:
# Import datetime
import datetime

# Grab the last 5 posts
last_posts = reddit.subreddit('kombucha').new(limit=5)

# Function to determine date and time posted
# courtesty of https://www.reddit.com/r/learnprogramming/comments/37kr5n/praw_is_it_possible_to_get_post_time_and_date/
def get_date(submission):
	time = submission.created
	return datetime.datetime.fromtimestamp(time)

# Print post id, date and title
print("Last 5 posts from the kombucha reddit:\n")
print("id\ttimestamp\ttitle")
for post in last_posts:
    attrs = (post.id, get_date(post), post.title)
    print("%s\t%s\t%s" % attrs)

Last 5 posts from the kombucha reddit:

id	timestamp	title
t7rlvt	2022-03-06 06:07:52	Scoby hotel question
t7qbhc	2022-03-06 04:48:15	Kratom as a singular base in 1f
t7o1p7	2022-03-06 02:38:07	Is this normal??
t7kec5	2022-03-05 23:23:52	Third batch, four days in, but what is that dark area within the scoby — and what are these dark thread-like things in the liquid? Newbie here, still learning.
t7k2i9	2022-03-05 23:07:08	5 Days into F1 (second batch), do these white bumps look anything like mold to you?


## Filter by flair
To create our scraper, we are looking for posts that contain images of pellicles. Specifically, we want images of pellicles that have been classified as either "mold!" or not "not mold" by their flair. It seems to be possible to filter posts by flair.

In [61]:
# Obtain the last 5 mold posts
mold_posts = reddit.subreddit('kombucha').search(query='flair:"mold!"', sort='new', limit = 5, syntax='lucene')
print("Last 5 mold posts from the kombucha reddit:\n")
print("id\ttimestamp\ttitle")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.title)
    print("%s\t%s\t%s" % attrs)

Last 5 mold posts from the kombucha reddit:

id	timestamp	title
t6a39m	2022-03-04 05:03:15	Mold?
t4k01s	2022-03-01 23:18:08	hey guys, that looks normal? or it's kind of mold?
t3i3pd	2022-02-28 16:26:36	Any idea if this is normal pellicle formation, or mold?
t2cb8t	2022-02-27 02:17:09	My first time making kombucha. This does look like mold? Should I throw it out?
t11wwj	2022-02-25 12:34:45	Ignored this for… a year? Definitely has a dried leathery top that extends about an inch down. Should I (try) take it out and cut off the dried part and then add more sweet tea?


In [68]:
# Obtain the last 5 not mold posts
not_mold_posts = reddit.subreddit('kombucha').search(query='flair:"not mold"', sort='new', limit = 5, syntax='lucene')
print("\nLast 5 not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\ttitle")
for post in not_mold_posts:
    attrs = (post.id, get_date(post), post.title)
    print("%s\t%s\t%s" % attrs)


Last 5 not mold posts from the kombucha reddit:

id	timestamp	title
t6a39m	2022-03-04 05:03:15	Mold?
t3i3pd	2022-02-28 16:26:36	Any idea if this is normal pellicle formation, or mold?
t2cb8t	2022-02-27 02:17:09	My first time making kombucha. This does look like mold? Should I throw it out?
t11wwj	2022-02-25 12:34:45	Ignored this for… a year? Definitely has a dried leathery top that extends about an inch down. Should I (try) take it out and cut off the dried part and then add more sweet tea?
t0n4vh	2022-02-24 23:20:59	First time making kombucha. Is this mold?


That didn't seem to work - the flairs are different, yet the posts overlap. Altering the syntax between `lucene`, `cloudsearch` and `plain` syntaxes also didn't do the trick. On the reddit website - `flair:"not mold"` returns just the `not mold` posts. As an alternative, can we get all `mold` posts and then obtain by their flair?

In [86]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\ttitle")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.link_flair_text, post.title)
    print("%s\t%s\t%s\t%s" % attrs)


Last 5 mold or not mold posts from the kombucha reddit:

id	timestamp	flair	title
t6a39m	2022-03-04 05:03:15	not mold	Mold?
t4k01s	2022-03-01 23:18:08	mold!	hey guys, that looks normal? or it's kind of mold?
t3i3pd	2022-02-28 16:26:36	not mold	Any idea if this is normal pellicle formation, or mold?
t2cb8t	2022-02-27 02:17:09	not mold	My first time making kombucha. This does look like mold? Should I throw it out?
t11wwj	2022-02-25 12:34:45	not mold	Ignored this for… a year? Definitely has a dried leathery top that extends about an inch down. Should I (try) take it out and cut off the dried part and then add more sweet tea?


That worked! These posts match the results on the reddit website, and this seems sufficient to scrape the posts (and hopefully images) we need for our classifier.

## Image from post
Next, we need to obtain the URLs to images that have been attached to these posts. First, let's try to get these from the post URL.

In [89]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.link_flair_text, post.url)
    print("%s\t%s\t%s\t%s" % attrs)


Last 5 mold or not mold posts from the kombucha reddit:

id	timestamp	flair	url
t6a39m	2022-03-04 05:03:15	not mold	https://www.reddit.com/gallery/t6a39m
t4k01s	2022-03-01 23:18:08	mold!	https://www.reddit.com/gallery/t4k01s
t3i3pd	2022-02-28 16:26:36	not mold	https://www.reddit.com/gallery/t3i3pd
t2cb8t	2022-02-27 02:17:09	not mold	https://i.redd.it/sry5fwcf1ak81.jpg
t11wwj	2022-02-25 12:34:45	not mold	https://www.reddit.com/gallery/t11wwj


It seems that in some posts, the URL points directly to an image, whereas in others, it points to a gallery. Perhaps the `media_metadata` allows us to access all images within the gallery?

In [107]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    #print(dir(post))
    #break
    attrs = (post.id, get_date(post), post.link_flair_text, post.media_metadata)
    print("%s\t%s\t%s\t%s\n\n" % attrs)


Last 5 mold or not mold posts from the kombucha reddit:

id	timestamp	flair	url
t6a39m	2022-03-04 05:03:15	not mold	{'5ik11nnmjal81': {'status': 'valid', 'e': 'Image', 'm': 'image/jpg', 'p': [{'y': 192, 'x': 108, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=108&crop=smart&auto=webp&s=a9a3d6ad8eebfc9cd31e14e400b1f691d7d5372f'}, {'y': 384, 'x': 216, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=216&crop=smart&auto=webp&s=4de944ecc985afad908ca87a1d0cd4ed4840d78b'}, {'y': 568, 'x': 320, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=320&crop=smart&auto=webp&s=de608d065ee8d9249989fff3f0d5bef50803501b'}, {'y': 1137, 'x': 640, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=640&crop=smart&auto=webp&s=66884928f830099b8ba674e953a4ff63d6d82b77'}, {'y': 1706, 'x': 960, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=960&crop=smart&auto=webp&s=27453143cb7b5dc12b48543c99c914eea151d657'}, {'y': 1920, 'x': 1080, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?widt

AttributeError: 'Submission' object has no attribute 'media_metadata'

This returns an error, because one of the submissions doesn't have `media_metadata`. It seems that this is the post that pointed directly to an URL. Can we avoid this problem by checking if the URL point to an image?

In [109]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    if post.url.endswith(('.jpg', '.png', '.gif', '.jpeg')):
        attrs = (post.id, get_date(post), post.link_flair_text, post.url)
    else:
        attrs = (post.id, get_date(post), post.link_flair_text, post.media_metadata)
    print("%s\t%s\t%s\t%s\n\n" % attrs)


Last 5 mold or not mold posts from the kombucha reddit:

id	timestamp	flair	url
t6a39m	2022-03-04 05:03:15	not mold	{'5ik11nnmjal81': {'status': 'valid', 'e': 'Image', 'm': 'image/jpg', 'p': [{'y': 192, 'x': 108, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=108&crop=smart&auto=webp&s=a9a3d6ad8eebfc9cd31e14e400b1f691d7d5372f'}, {'y': 384, 'x': 216, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=216&crop=smart&auto=webp&s=4de944ecc985afad908ca87a1d0cd4ed4840d78b'}, {'y': 568, 'x': 320, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=320&crop=smart&auto=webp&s=de608d065ee8d9249989fff3f0d5bef50803501b'}, {'y': 1137, 'x': 640, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=640&crop=smart&auto=webp&s=66884928f830099b8ba674e953a4ff63d6d82b77'}, {'y': 1706, 'x': 960, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?width=960&crop=smart&auto=webp&s=27453143cb7b5dc12b48543c99c914eea151d657'}, {'y': 1920, 'x': 1080, 'u': 'https://preview.redd.it/5ik11nnmjal81.jpg?widt

Success! Although it's probably safest to still check whether the `media_metadata` attribute exists once we develop our true scraper.

## Parse the `media_metadata`
Now that we know that the image URLs can be stored in the `media_metadata` attribute, it's time to get the image URLs out. It seems that the the `media_metadata` attribute is a dict, which contains one name:value pair per image, and that for each image, many links to different images sizes are created in the `p` name:value pair. Another link is stored in the `s` key:value pair, which appears not to be scaled, so I assume that this is the original submission. For now, let's select these URLs, although for our classifier purposes, they are much too large, and we may choose to select a resized image (e.g. `x = 1080`).

In [175]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 10, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    if post.url.endswith(('.jpg', '.png', '.gif', '.jpeg')):
        attrs = (post.id, get_date(post), post.link_flair_text, post.url)
        print("%s\t%s\t%s\t%s" % attrs)
    else:
        attrs = (post.id, get_date(post), post.link_flair_text)
        try:
            media_metadata = post.media_metadata
            for id in media_metadata.keys():
                url = media_metadata[id]["s"]["u"]
                sub_attrs = attrs + (url,)
                print("%s\t%s\t%s\t%s" % sub_attrs)
        except AttributeError:
            # Most likely a video - skip
            continue


Last 5 mold or not mold posts from the kombucha reddit:

id	timestamp	flair	url
t6a39m	2022-03-04 05:03:15	not mold	https://preview.redd.it/5ik11nnmjal81.jpg?width=2268&format=pjpg&auto=webp&s=bc64f6eeb7a9a510df20a3fcb9c93ae882a8f0fe
t6a39m	2022-03-04 05:03:15	not mold	https://preview.redd.it/hdkxlonmjal81.jpg?width=2268&format=pjpg&auto=webp&s=1890eced03d9199751ca813fac984138367ce3e5
t4k01s	2022-03-01 23:18:08	mold!	https://preview.redd.it/eh1lhbe8kuk81.jpg?width=3024&format=pjpg&auto=webp&s=b6c10d625d5cd71b2e1239897535fe29bd607555
t4k01s	2022-03-01 23:18:08	mold!	https://preview.redd.it/hp1gmbe8kuk81.jpg?width=3024&format=pjpg&auto=webp&s=cee7d91729704155afdba4eb641f9d2d785c561b
t3i3pd	2022-02-28 16:26:36	not mold	https://preview.redd.it/brdgv5rwdlk81.jpg?width=3024&format=pjpg&auto=webp&s=7cec490c2602a8f1e0db0bf374da6deb0d026fca
t3i3pd	2022-02-28 16:26:36	not mold	https://preview.redd.it/bysx86rwdlk81.jpg?width=3024&format=pjpg&auto=webp&s=6f5a7182eb7ce0d7a2fa6e36da5634dd23ad1b6b
t

# Download and store images
Now we have URLs to a bunch of images (sometimes multiple from the same post) and their classification. Let's see if we can download these images using `urllib` requests.

In [171]:
import urllib.request
import os.path
import time

# Downloads an image to path
def download_img(url, path, name):
    img_path = os.path.join(path, name)
    with open(img_path, "wb") as f:
        f.write(urllib.request.urlopen(url).read())

# Simplifies the classification
def get_class(flair):
    cl = "mold_1" if flair == "mold!" else "not_mold_0"
    return(cl)

# Define path
path = "/Users/guus/Downloads"

# Get the last 5 mold/not mold posts
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')

# Loop through posts and save the image
for post in mold_posts:
    if post.url.endswith(('.jpg', '.png', '.gif', '.jpeg')):
        url = post.url
        id = post.id
        cl = get_class(post.link_flair_text)
        img_name, img_type = os.path.splitext(url)
        img_name = os.path.basename(img_name)
        name = id + "_" + img_name + "_" + cl + img_type
        download_img(url, path, name)
        
        # 1 request per 2 second allowed (apparently)
        time.sleep(2)

    else:
        attrs = (post.id, get_date(post), post.link_flair_text)
        try:
            media_metadata = post.media_metadata
            for id in media_metadata.keys():
                url = media_metadata[id]["s"]["u"]
                post_id = post.id
                cl = get_class(post.link_flair_text)
                img_type = media_metadata[id]["m"].replace("image/", "")
                img_name = id
                name = post_id + "_" + img_name + "_" + cl + img_type
                download_img(url, path, name)
                
                # 1 request per 2second allowed (apparently)
                time.sleep(2)
        except AttributeError:
            continue


AttributeError: 'Submission' object has no attribute 'media_metadata'

# Number of available datapoints
Now that we know that we can download images with a classification, let's inspect how large out dataset will be.

In [167]:
# Get the last 5 mold/not mold posts
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = None, 
    time_filter = "all",
    syntax='lucene')

n = 0
print("id\ttimestamp\tflair\ttitle")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.link_flair_text, post.title)
    print("%s\t%s\t%s\t%s" % attrs)
    n += 1
print("Number of posts:\t" + str(n))

id	timestamp	flair	title
t6a39m	2022-03-04 05:03:15	not mold	Mold?
t4k01s	2022-03-01 23:18:08	mold!	hey guys, that looks normal? or it's kind of mold?
t3i3pd	2022-02-28 16:26:36	not mold	Any idea if this is normal pellicle formation, or mold?
t2cb8t	2022-02-27 02:17:09	not mold	My first time making kombucha. This does look like mold? Should I throw it out?
t11wwj	2022-02-25 12:34:45	not mold	Ignored this for… a year? Definitely has a dried leathery top that extends about an inch down. Should I (try) take it out and cut off the dried part and then add more sweet tea?
t0n4vh	2022-02-24 23:20:59	not mold	First time making kombucha. Is this mold?
t05i3u	2022-02-24 08:58:56	mold!	is this mold?
sz7n44	2022-02-23 04:58:16	not mold	Not sure what to do next.
sz3x3e	2022-02-23 02:05:50	mold!	It’s definitely mold 😭
syzoqz	2022-02-22 23:01:48	not mold	Is this mold?
swjncj	2022-02-19 21:54:24	mold!	Looks like a couple mold spots, is the black mold too?
sw522h	2022-02-19 09:13:23	mold!	RIP Scoby
svp

There appears to be a limit on the number of posts returned: only 248 posts were returned, but the oldest date is not very far in the past.

# Moving from PRAW to Pushshift
Pushshift is a copy of all reddit posts and comments + an API intended for big queries. This means that it can be used to query further back than PRAW allows. Below is an API call that queries the kombucha subreddit for the 5 latest posts. The output is a JSON object with all the familiar attributes.

```
https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=5
```

However, there are a few issues that need to be overcome when using Pushshift:

* Pushshift doesn't have any method for filtering by flair, though flairs are available in the output. This means we'll need to device a different way of findings mould related posts;
* Pushshift stores posts as they were submitted. This means that any later updated to the post might not be included in the database. Thus, any flair update to `mold!` or `not mold` will not have been included, and they are likely to still be labeled as `what's wrong!?` or `question`.

Since we're looking for mold, we can use the query `mold` to find any posts that mention mould. However, these is change of missing many posts. Alternatively, we could scrape all posts, and then look for `.jpg` in the post link or metadata like we did above. We then use PRAW to obtain the flair for those submission IDs.

In a way, only having the original flair from Pushshift is an advantage, because some users initially label their question about mould as `mold!`, despite not knowing whether it is mould. Those labels could be wrong if not updated by the user. By filtering those submissions out, we improve our final data set.

In [222]:
import json

# Query last 500 mold results
q = 'https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=500&q=mold'
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
for x in result["data"]:
    ids.append(x["id"])
    date = datetime.datetime.fromtimestamp(x["created_utc"])
    print("%s\t%s\t%s" % (x["id"], date, x["title"]))

print(len(ids))


ta2o6p	2022-03-09 08:41:49	Rotten Smelly Kombucha
t9wuxe	2022-03-09 02:59:31	This is my first SCOBY 2 weeks in. Is this healthy looking? I don’t know if this is mold or not.
t9tgjo	2022-03-09 00:06:19	Is this mold ?. Never seen such bubbles and patches in my 2F. I used elderberry syrup which has clean ingredients as well
t9j0h5	2022-03-08 16:16:51	Is it mold? This is the 2nd time it happens, 2 days into F2. Only happen to grapefruit ginger so far though.
t96umb	2022-03-08 03:58:53	First ever batch! Yay! Tastes great, but a little mold on the top?
t9073r	2022-03-07 22:37:09	First brew after a couple of months and I get mold
t8qoir	2022-03-07 15:41:51	Could there be mold in my scoby hotel?
t86y7v	2022-03-06 20:58:40	Another one: is this mold?
t7jwd6	2022-03-05 22:58:59	How does this look? Mold? Been fermentating since 2/26/22
t7bypl	2022-03-05 16:37:18	First time brewer, had this 1F going for 2.5 weeks (chilly house at 64 F) - mandatory “is it mold or yeast or ???” post
t79yiu	2022-03-05

In [223]:
# Get the flair text via PRAW
full_ids = [i if i.startswith('t3_') else f't3_{i}' for i in ids]
posts = reddit.info(full_ids)
for post in posts:
    print(post.link_flair_text)

what's wrong!?
what's wrong!?
mold!
what's wrong!?
question
mold!
what's wrong!?
not mold
question
question
what's wrong!?
what's wrong!?
not mold
what's wrong!?
what's wrong!?
question
what's wrong!?
what's wrong!?
mold!
what's wrong!?
not mold
what's wrong!?
question
question
question
question
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
not mold
question
what's wrong!?
what's wrong!?
what's wrong!?
question
not mold
what's wrong!?
mold!
what's wrong!?
what's wrong!?
question
question
mold!
what's wrong!?
not mold
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
question
mold!
what's wrong!?
what's wrong!?
not mold
question
question
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
question
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
what's wrong!?
mold!
what's wrong!?
what's wrong!?
mold!
what's wrong!?
what's wrong!?
question
mold!
not mold
question
what's wrong!?
what's wrong!?
what's 

That seems to work. We first fetched the posts using Pushshift, and then batch queried PRAW using the post ids to get the current flairs. The number of returned items seems to be limited to 100, so we could fetch all the data in batches of 100 using the `before` / `after` filter. Next question: can we fetch posts from further back? to test this, let's sort the posts ascending, rather than descending - this should get us the first ever image posts in the subreddit. Because flairs may not have been in use back then, let's just print the titles.

In [215]:
q = 'https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=asc&sort_type=created_utc&size=5&q=".jpg"'
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
for x in result["data"]:
    ids.append(x["id"])
    print("%s\t%s" % (x["id"], x["title"]))


ubql2	Made my first kombucha, just bottled, pics!  
x9jar	Hey Reddit,   I'm brewing my kombucha in 5 gallon food grade plastic buckets.  What do you think?
227d3r	Any Boston area Kombuchittors?
24ms6k	Vinegar smell, is this normal?
2dconc	Keeping hothouse brew going in winter


In [213]:
# Get the post title through PRAW
full_ids = [i if i.startswith('t3_') else f't3_{i}' for i in ids]
posts = reddit.info(full_ids)
for post in posts:
    print(post.title)

Made my first kombucha, just bottled, pics!  
Hey Reddit,   I'm brewing my kombucha in 5 gallon food grade plastic buckets.  What do you think?
Any Boston area Kombuchittors?
Vinegar smell, is this normal?
Keeping hothouse brew going in winter


That also seems to work. Now we can limit the fields to the essential ones to reduce query time and data transfer.

In [230]:
q = "https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=5&fields=id,media_metadata,created_utc,url,link_flair_text,is_gallery,retrieved_on,title"
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
dates = list()
for x in result["data"]:
    ids.append(x["id"])
    dates.append(x["created_utc"])
    print("%s\t%s\t%s" % (x["id"], x["created_utc"], x["title"]))

ta4k3h	1646820086	First brew PH stuck at 3.5
ta49vo	1646818775	Tips for making Jackfruit Kombucha?
ta2o6p	1646811709	Rotten Smelly Kombucha
ta0dw0	1646802944	Moving Kombucha
ta047p	1646802010	r/Kombucha Weekly Weird Brews and Experiments (March 09, 2022)


Now let's see if we can fetch the last 5 posts before a specific post. We select the timestamp from the third post above, and fetch the 5 posts before it.

In [231]:
q = "https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=5&fields=id,media_metadata,created_utc,url,link_flair_text,retrieved_on,title&before=%s" % dates[2]
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
for x in result["data"]:
    ids.append(x["id"])
    print("%s\t%s" % (x["id"], x["title"]))

ta0dw0	Moving Kombucha
ta047p	r/Kombucha Weekly Weird Brews and Experiments (March 09, 2022)
t9zadx	Did I kill my starter by overdilution?
t9y6u6	Scoby Ambush
t9y561	SCOBY AMBUSH


This means that we can iteratively fetch 100 posts, get the timestamp of the last post, and then fetch the 100 posts before that, which allow us to scrape all posts in a subreddit. PSAW is a wrapper around pushshift that does exactly that: it paginates by timestamp in blocks of 100, but makes sure that not too many requests are sent and that any timeout is caught. This means that we can fetch the entire history of a subreddit in a single command. Let's try it

In [237]:
from psaw import PushshiftAPI
api = PushshiftAPI()

# Get all posts made between march 8 2022 and now
start = int(datetime.datetime(2022, 3, 8).timestamp())
res = api.search_submissions(
    after = start,
    subreddit = 'kombucha',
    sort_type = 'desc',
    filter = ['id', 'media_metadata', 'created_utc', 'url', 'link_flair_text', 'retrieved_on' , 'title'])
posts = list(res)
print(len(posts))


KeyboardInterrupt: 

That seems to run forever, even for a short timespan. It may be better to implement this function manually.

In [271]:
# We expect 100 hits per query
limit = 100
n_hits = limit

# Get the current timestamp
now = datetime.datetime.today()
now_ts = datetime.datetime.timestamp(now)
ts = int(now_ts)

# Get a date at which to stop fetching posts
end_str = "03/07/2022"
end = datetime.datetime.strptime(end_str, "%m/%d/%Y")
end_ts = int(datetime.datetime.timestamp(date))

# Structure to store posts
posts = []

# Rudimentary function to call the API
def call_ps(limit, ts, end_ts):
    q = "https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=%s&fields=id,media_metadata,created_utc,url,link_flair_text,retrieved_on,title&before=%s&after=%s" % (limit, ts, end_ts)
    response = urllib.request.urlopen(q)
    data = response.read()
    result = json.loads(data)["data"]
    return(result)

# Fetch results until the end date is hit
while (ts >= end_ts and n_hits == limit):
    result = call_ps(limit, ts, end_ts)
    posts = posts + result
    print("Number of posts fetched: " + str(len(posts)))
    n_hits = len(result)
    ts = int(result[n_hits - 1]["created_utc"])
    time.sleep(1)
    
    

Number of posts fetched: 100
Number of posts fetched: 200
Number of posts fetched: 300
Number of posts fetched: 400
Number of posts fetched: 500
Number of posts fetched: 600
Number of posts fetched: 607
