# Introduction
This is a Jupyter notebook, designed to test out the PRAW library and get an idea of how to use it for reddit scraping. It requires the `praw` library, which can be installed directly (e.g. `pip install praw`) or in a conda environment.

# Set up a reddit instance
First, we need to set up a reddit instance. This requires setting up a Reddit application. Head over to [the authorized applications page](https://www.reddit.com/prefs/apps) and create a new script at the bottom. It is required to add a title, description and redirect uri (use `http://localhost:8080` here). Once your apps has been created, you will have the required credentials to start using the Reddit API. The `user_agent` is the name of the application; the `client_id` is the line of gibberish next to the icon of the application; and the `client_secret` is the `secret` gibberish. 

In [1]:
import praw
reddit = praw.Reddit(
    user_agent="IsItMould?",
    client_id="EXRL43UtmsdPymRwqZErkg",
    client_secret="JrkEkIWSzI_srcUvJeh24Q__qvS3VQ",
)

# Get and filter data from subreddit
Next, let's try to obtain some data from our subreddit of interest: `r/kombucha`. First, lets print the attributes and methods for the latest post. Then we will get the last 5 posts by date and output some basic information like the post id, timestamp and title.

## Available attributes
Let's have a look at the available attributes and methods for the objects that are returned by `praw`.

In [2]:
# Get the attributes for a subreddit
print(dir(reddit.subreddit('kombucha')))

# Get the attributes for a post
last_posts = reddit.subreddit('kombucha').new(limit=1)
for post in last_posts:
    print(vars(post))

['MESSAGE_PREFIX', 'STR_FIELD', 'VALID_TIME_FILTERS', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_convert_to_fancypants', '_create_or_update', '_fetch', '_fetch_data', '_fetch_info', '_fetched', '_kind', '_parse_xml_response', '_path', '_reddit', '_reset_attributes', '_safely_add_arguments', '_submission_class', '_submit_media', '_subreddit_collections_class', '_subreddit_list', '_upload_inline_media', '_upload_media', '_url_parts', '_validate_gallery', '_validate_inline_media', '_validate_time_filter', 'banned', 'collections', 'comments', 'contributor', 'controversial', 'display_name', 'emoji', 'filters', 'flair', 'fullname', 'gilded', 'hot', 'message', 'mod', 'moderator', 

## Return the last posts
Seems that the last posts can be returned using `.new()`. We can then extract useful information like the post id, the timestamp and the post title.

In [3]:
# Import datetime
import datetime

# Grab the last 5 posts
last_posts = reddit.subreddit('kombucha').new(limit=5)

# Function to determine date and time posted
# courtesty of https://www.reddit.com/r/learnprogramming/comments/37kr5n/praw_is_it_possible_to_get_post_time_and_date/
def get_date(submission):
	time = submission.created
	return datetime.datetime.fromtimestamp(time)

# Print post id, date and title
print("Last 5 posts from the kombucha reddit:\n")
print("id\ttimestamp\ttitle")
for post in last_posts:
    attrs = (post.id, get_date(post), post.title)
    print("%s\t%s\t%s" % attrs)

Last 5 posts from the kombucha reddit:

id	timestamp	title
ta7nxf	2022-03-09 14:22:03	My kombucha smells rotten 😔
ta4k3h	2022-03-09 11:01:26	First brew PH stuck at 3.5
ta49vo	2022-03-09 10:39:35	Tips for making Jackfruit Kombucha?
ta0dw0	2022-03-09 06:15:44	Moving Kombucha
ta047p	2022-03-09 06:00:10	r/Kombucha Weekly Weird Brews and Experiments (March 09, 2022)


## Filter by flair
To create our scraper, we are looking for posts that contain images of pellicles. Specifically, we want images of pellicles that have been classified as either "mold!" or not "not mold" by their flair. It seems to be possible to filter posts by flair.

In [None]:
# Obtain the last 5 mold posts
mold_posts = reddit.subreddit('kombucha').search(query='flair:"mold!"', sort='new', limit = 5, syntax='lucene')
print("Last 5 mold posts from the kombucha reddit:\n")
print("id\ttimestamp\ttitle")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.title)
    print("%s\t%s\t%s" % attrs)

In [None]:
# Obtain the last 5 not mold posts
not_mold_posts = reddit.subreddit('kombucha').search(query='flair:"not mold"', sort='new', limit = 5, syntax='lucene')
print("\nLast 5 not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\ttitle")
for post in not_mold_posts:
    attrs = (post.id, get_date(post), post.title)
    print("%s\t%s\t%s" % attrs)

That didn't seem to work - the flairs are different, yet the posts overlap. Altering the syntax between `lucene`, `cloudsearch` and `plain` syntaxes also didn't do the trick. On the reddit website - `flair:"not mold"` returns just the `not mold` posts. As an alternative, can we get all `mold` posts and then obtain by their flair?

In [None]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\ttitle")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.link_flair_text, post.title)
    print("%s\t%s\t%s\t%s" % attrs)

That worked! These posts match the results on the reddit website, and this seems sufficient to scrape the posts (and hopefully images) we need for our classifier.

## Image from post
Next, we need to obtain the URLs to images that have been attached to these posts. First, let's try to get these from the post URL.

In [None]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.link_flair_text, post.url)
    print("%s\t%s\t%s\t%s" % attrs)

It seems that in some posts, the URL points directly to an image, whereas in others, it points to a gallery. Perhaps the `media_metadata` allows us to access all images within the gallery?

In [None]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    #print(dir(post))
    #break
    attrs = (post.id, get_date(post), post.link_flair_text, post.media_metadata)
    print("%s\t%s\t%s\t%s\n\n" % attrs)

This returns an error, because one of the submissions doesn't have `media_metadata`. It seems that this is the post that pointed directly to an URL. Can we avoid this problem by checking if the URL point to an image?

In [None]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    if post.url.endswith(('.jpg', '.png', '.gif', '.jpeg')):
        attrs = (post.id, get_date(post), post.link_flair_text, post.url)
    else:
        attrs = (post.id, get_date(post), post.link_flair_text, post.media_metadata)
    print("%s\t%s\t%s\t%s\n\n" % attrs)

Success! Although it's probably safest to still check whether the `media_metadata` attribute exists once we develop our true scraper.

## Parse the `media_metadata`
Now that we know that the image URLs can be stored in the `media_metadata` attribute, it's time to get the image URLs out. It seems that the the `media_metadata` attribute is a dict, which contains one name:value pair per image, and that for each image, many links to different images sizes are created in the `p` name:value pair. Another link is stored in the `s` key:value pair, which appears not to be scaled, so I assume that this is the original submission. For now, let's select these URLs, although for our classifier purposes, they are much too large, and we may choose to select a resized image (e.g. `x = 1080`).

In [None]:
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 10, 
    syntax='lucene')
print("\nLast 5 mold or not mold posts from the kombucha reddit:\n")
print("id\ttimestamp\tflair\turl")
for post in mold_posts:
    if post.url.endswith(('.jpg', '.png', '.gif', '.jpeg')):
        attrs = (post.id, get_date(post), post.link_flair_text, post.url)
        print("%s\t%s\t%s\t%s" % attrs)
    else:
        attrs = (post.id, get_date(post), post.link_flair_text)
        try:
            media_metadata = post.media_metadata
            for id in media_metadata.keys():
                url = media_metadata[id]["s"]["u"]
                sub_attrs = attrs + (url,)
                print("%s\t%s\t%s\t%s" % sub_attrs)
        except AttributeError:
            # Most likely a video - skip
            continue

# Download and store images
Now we have URLs to a bunch of images (sometimes multiple from the same post) and their classification. Let's see if we can download these images using `urllib` requests.

In [4]:
import urllib.request
import os.path
import time

# Downloads an image to path
def download_img(url, path, name):
    img_path = os.path.join(path, name)
    with open(img_path, "wb") as f:
        f.write(urllib.request.urlopen(url).read())

# Simplifies the classification
def get_class(flair):
    cl = "mold_1" if flair == "mold!" else "not_mold_0"
    return(cl)

# Define path
path = "/Users/guus/Downloads"

# Get the last 5 mold/not mold posts
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = 5, 
    syntax='lucene')

# Loop through posts and save the image
for post in mold_posts:
    if post.url.endswith(('.jpg', '.png', '.gif', '.jpeg')):
        url = post.url
        id = post.id
        cl = get_class(post.link_flair_text)
        img_name, img_type = os.path.splitext(url)
        img_name = os.path.basename(img_name)
        name = id + "_" + img_name + "_" + cl + img_type
        download_img(url, path, name)
        
        # 1 request per 2 second allowed (apparently)
        time.sleep(2)

    else:
        attrs = (post.id, get_date(post), post.link_flair_text)
        try:
            media_metadata = post.media_metadata
            for id in media_metadata.keys():
                url = media_metadata[id]["s"]["u"]
                post_id = post.id
                cl = get_class(post.link_flair_text)
                img_type = media_metadata[id]["m"].replace("image/", "")
                img_name = id
                name = post_id + "_" + img_name + "_" + cl + img_type
                download_img(url, path, name)
                
                # 1 request per 2second allowed (apparently)
                time.sleep(2)
        except AttributeError:
            continue


# Number of available datapoints
Now that we know that we can download images with a classification, let's inspect how large out dataset will be.

In [None]:
# Get the last 5 mold/not mold posts
mold_posts = reddit.subreddit('kombucha').search(
    query='flair:"*mold*"', 
    sort='new', 
    limit = None, 
    time_filter = "all",
    syntax='lucene')

n = 0
print("id\ttimestamp\tflair\ttitle")
for post in mold_posts:
    attrs = (post.id, get_date(post), post.link_flair_text, post.title)
    print("%s\t%s\t%s\t%s" % attrs)
    n += 1
print("Number of posts:\t" + str(n))

There appears to be a limit on the number of posts returned: only 248 posts were returned, but the oldest date is not very far in the past.

# Moving from PRAW to Pushshift
Pushshift is a copy of all reddit posts and comments + an API intended for big queries. This means that it can be used to query further back than PRAW allows. Below is an API call that queries the kombucha subreddit for the 5 latest posts. The output is a JSON object with all the familiar attributes.

```
https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=5
```

However, there are a few issues that need to be overcome when using Pushshift:

* Pushshift doesn't have any method for filtering by flair, though flairs are available in the output. This means we'll need to device a different way of findings mould related posts;
* Pushshift stores posts as they were submitted. This means that any later updated to the post might not be included in the database. Thus, any flair update to `mold!` or `not mold` will not have been included, and they are likely to still be labeled as `what's wrong!?` or `question`.

Since we're looking for mold, we can use the query `mold` to find any posts that mention mould. However, these is change of missing many posts. Alternatively, we could scrape all posts, and then look for `.jpg` in the post link or metadata like we did above. We then use PRAW to obtain the flair for those submission IDs.

In a way, only having the original flair from Pushshift is an advantage, because some users initially label their question about mould as `mold!`, despite not knowing whether it is mould. Those labels could be wrong if not updated by the user. By filtering those submissions out, we improve our final data set.

In [5]:
import json

# Query last 500 mold results
q = 'https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=500&q=mold'
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
for x in result["data"]:
    ids.append(x["id"])
    date = datetime.datetime.fromtimestamp(x["created_utc"])
    print("%s\t%s\t%s" % (x["id"], date, x["title"]))

print(len(ids))


KeyboardInterrupt: 

In [None]:
# Get the flair text via PRAW
full_ids = [i if i.startswith('t3_') else f't3_{i}' for i in ids]
posts = reddit.info(full_ids)
for post in posts:
    print(post.link_flair_text)

That seems to work. We first fetched the posts using Pushshift, and then batch queried PRAW using the post ids to get the current flairs. The number of returned items seems to be limited to 100, so we could fetch all the data in batches of 100 using the `before` / `after` filter. Next question: can we fetch posts from further back? to test this, let's sort the posts ascending, rather than descending - this should get us the first ever image posts in the subreddit. Because flairs may not have been in use back then, let's just print the titles.

In [None]:
q = 'https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=asc&sort_type=created_utc&size=5&q=".jpg"'
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
for x in result["data"]:
    ids.append(x["id"])
    print("%s\t%s" % (x["id"], x["title"]))


In [None]:
# Get the post title through PRAW
full_ids = [i if i.startswith('t3_') else f't3_{i}' for i in ids]
posts = reddit.info(full_ids)
for post in posts:
    print(post.title)

That also seems to work. Now we can limit the fields to the essential ones to reduce query time and data transfer.

In [None]:
q = "https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=5&fields=id,media_metadata,created_utc,url,link_flair_text,is_gallery,retrieved_on,title"
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
dates = list()
for x in result["data"]:
    ids.append(x["id"])
    dates.append(x["created_utc"])
    print("%s\t%s\t%s" % (x["id"], x["created_utc"], x["title"]))

Now let's see if we can fetch the last 5 posts before a specific post. We select the timestamp from the third post above, and fetch the 5 posts before it.

In [None]:
q = "https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=5&fields=id,media_metadata,created_utc,url,link_flair_text,retrieved_on,title&before=%s" % dates[2]
response = urllib.request.urlopen(q)
data = response.read()
result = json.loads(data)
ids = list()
for x in result["data"]:
    ids.append(x["id"])
    print("%s\t%s" % (x["id"], x["title"]))

This means that we can iteratively fetch 100 posts, get the timestamp of the last post, and then fetch the 100 posts before that, which allow us to scrape all posts in a subreddit. PSAW is a wrapper around pushshift that does exactly that: it paginates by timestamp in blocks of 100, but makes sure that not too many requests are sent and that any timeout is caught. This means that we can fetch the entire history of a subreddit in a single command. Let's try it

In [None]:
from psaw import PushshiftAPI
api = PushshiftAPI()

# Get all posts made between march 8 2022 and now
start = int(datetime.datetime(2022, 3, 8).timestamp())
res = api.search_submissions(
    after = start,
    subreddit = 'kombucha',
    sort_type = 'desc',
    filter = ['id', 'media_metadata', 'created_utc', 'url', 'link_flair_text', 'retrieved_on' , 'title'])
posts = list(res)
print(len(posts))


That seems to run forever, even for a short timespan. It may be better to implement this function manually.

In [41]:
# We expect 100 hits per query
limit = 100
n_hits = limit

# Get the current timestamp
now = datetime.datetime.today()
now_ts = datetime.datetime.timestamp(now)
ts = int(now_ts)

# Get a date at which to stop fetching posts
end_str = "03/01/2022"
end = datetime.datetime.strptime(end_str, "%m/%d/%Y")
end_ts = int(datetime.datetime.timestamp(end))

# Structure to store submissions
posts = []

# Rudimentary function to call the API
def fetch_ps_submission(
    ts, end_ts, limit = 100):
    q = "https://api.pushshift.io/reddit/search/submission/?subreddit=kombucha&sort=desc&sort_type=created_utc&size=%s&fields=id,media_metadata,created_utc,url,link_flair_text,retrieved_on,title&before=%s&after=%s" % (limit, ts, end_ts)
    response = urllib.request.urlopen(q)
    data = response.read()
    result = json.loads(data)["data"]
    return(result)

# Fetch results until the end date is reached or results run out
while (ts >= end_ts and n_hits == limit):
    result = fetch_ps_submission(ts, end_ts)
    posts = posts + result
    print("Number of posts fetched: " + str(len(posts)))
    n_hits = len(result)
    ts = int(result[n_hits - 1]["created_utc"])
    time.sleep(1)
    
    

Number of posts fetched: 100
Number of posts fetched: 164


Now that we know how to fetch all submissions up to a specific date, we can store the submissions based on the presence of `.jp(e)g` in the post URL or `media_metadata`.

In [42]:
class Submission:

    # Base attributes imported from pushshift
    def __init__(self, ps_out):

        # PushShift parameters
        self.id = ps_out["id"]
        self.url = ps_out["url"]
        self.created = ps_out["created_utc"]
        self.retrieved = ps_out["retrieved_on"]
        self.title = ps_out["title"]
        self.media_meta = ps_out["media_metadata"] if "media_metadata" in ps_out.keys() else None
        self.original_flair = ps_out["link_flair_text"] if "link_flair_text" in ps_out.keys() else None
    
        # Inferred parameters
        self.has_image = False
        self.img_urls = []
        self.dl_imgs = False

        # PRAW parameters
        self.current_flair = None

    # Method to check for images
    def check_image(self):
        if self.url.endswith(('.jpg', '.jpeg')):
            self.has_image = True
            self.img_urls.append(self.url)
        elif self.media_meta is not None:
            self.has_image = True
            for id in self.media_meta.keys():
                try:
                    url = self.media_meta[id]["s"]["u"]
                    self.img_urls.append(url)
                except KeyError:
                    # Not all images are processed and therefore lack the "s" key
                    continue
    
    # Method to check if post has a relevant flair for downloading
    def check_download(self, flairs):
        if self.current_flair in flairs:
            self.dl_imgs = True

    # Method to extract additional PRAW parameters
    def get_praw(self, praw_subm):
        self.media_meta = praw_subm.media_metadata if "media_metadata" in vars(praw_subm) else None
        self.current_flair = praw_subm.link_flair_text if "link_flair_text" in vars(praw_subm) else None
        self.img_urls = []

In [43]:
# Store all submissions into a dictionary of objects
submissions = dict()
for p in posts:
    sub = Submission(p)
    sub.check_image()
    submissions[sub.id] = sub

print(submissions["ta4k3h"].img_urls)
print(submissions["t9js4w"].img_urls)


['https://preview.redd.it/d8oy2jf30cm81.jpg?width=3072&amp;format=pjpg&amp;auto=webp&amp;s=69d2aa30768fe421003a4115917d0f697265dd01', 'https://preview.redd.it/f4ylvdl30cm81.jpg?width=3072&amp;format=pjpg&amp;auto=webp&amp;s=151003e4448303a32e37809c02e3bd6d7b18785f']
['https://i.redd.it/4o54mmxtl6m81.jpg']


Above I printed two URLs from the `submission.media_metadata`, and one from a `submission.url`. The ones from the `media_metadata` do not appear to be working, so we will also have to fetch those through PRAW. Let's create the PRAW fetching part:

In [44]:
# Get the ids of posts with images
ids = submissions.keys()
img_ids = []
for id in ids:
    if submissions[id].has_image:
        img_ids.append(id)
full_ids = [i if i.startswith('t3_') else f't3_{i}' for i in img_ids]

# Split the ids into chunks
# courtesy of https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))
ids_list = chunks(full_ids, 50)

# Loop through chunks and obtain submissions through PRAW
for x in ids_list:
    posts = reddit.info(x)
    for post in posts:
        submissions[post.id].get_praw(post)
        submissions[post.id].check_image()

print(submissions["ta4k3h"].img_urls)
print(submissions["t9js4w"].img_urls)
print(submissions["t9js4w"].original_flair)
print(submissions["t9js4w"].current_flair)

['https://preview.redd.it/f4ylvdl30cm81.jpg?width=3072&format=pjpg&auto=webp&s=151003e4448303a32e37809c02e3bd6d7b18785f', 'https://preview.redd.it/d8oy2jf30cm81.jpg?width=3072&format=pjpg&auto=webp&s=69d2aa30768fe421003a4115917d0f697265dd01']
['https://i.redd.it/4o54mmxtl6m81.jpg']
beautiful booch
beautiful booch


Success! We now have the original and current flairs, as well as working URLs to images. Now all that's left is to subset the data to posts that included a mold related flair and download the images. Let's also write the data to a table.

In [49]:
# Check if a mould flair is present and whether it has changed
for sub in submissions.values():
    flairs = ['mold!', 'kahm!', 'not mold', 'pellicle']
    sub.check_download(flairs)
    if (sub.dl_imgs):
        print(sub.original_flair, sub.current_flair, sub.url)

mold! mold! https://www.reddit.com/gallery/t9qzee
mold! mold! https://i.redd.it/dzvsztbe61m81.jpg
kahm! kahm! https://i.redd.it/ekg4osqg21m81.jpg
what's wrong!? not mold https://www.reddit.com/gallery/t86y7v
question not mold https://www.reddit.com/gallery/t6a39m
mold! mold! https://www.reddit.com/gallery/t4k01s


It seems that out of the recent posts, most started with a mold flair already assigned. Upon inspecting the images, they also appear to be correctly assigned. For the moment, let's not exclude those posts and first scrape all the data. We can always exclude those images later when we are certain that there are enough images for our classifier.