# Scraping the *r/funny* subreddit with PRAW
## 1. Introduction
This is the beginning of my project where I will be collecting all of the necessary data to fine-tune my LLM. I will be using the `PRAW` library to scrape posts and collect the following fields:
- `post.id`: Used for partition key
- `post.title`: Title of the post
- `post.selftext`: The body of the post
- `post.url`: This is the image or video
- `post.created_utc`: Datetime of post
- `post.score`: Number of upvotes
- `post.num_comments`: Number of comments
- `top_comment.body`: Text of the top comment
- `top_comment.score`: Number of upvotes for the top comment

Because BLIP is an image-to-text model, **I am going to keep the scope of post.urls to .png and .jpg files.** GIFs and videos won't be processed, but instead their descriptions will be inferred using few-shot learning with GPT-4.

In [1]:
!pip install praw -q

In [2]:
import time
import json
import praw
import random
import pandas as pd
import requests
from io import BytesIO

import boto3
from botocore.exceptions import ClientError

## 2. Initialize Clients

In [3]:
def get_secret():
    """Get secret from AWS Secrets Manager"""

    secret_name = "reddit_scraper"
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']

    return json.loads(secret)


In [4]:
# Initialize Reddit Client
secret = get_secret()
client_id = secret['client_id']
client_secret = secret['client_secret']
password = secret['user_password']
user_agent = secret['user_agent']
username = secret['username']

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    password=password,
    user_agent=user_agent,
    username=username,
)

In [5]:
# Initialize s3
s3 = boto3.client('s3')
bucket_name = 'sagemaker-us-east-1-513033806411'

# Initialize dynamodb
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('funny-reddit-posts')  

## 3. Scraping Reddit
First, I define some helper functions. I am going to store all of the posts in DynamoDB using the `submissionId` as the unique parition key. I will then extract the images from the posts and dump them into my S3 bucket, referencing the object key in the DynamoDB table.

In [6]:
def write_to_dynamodb(item):
    try:
        table.put_item(Item=item)
        print(f"Successfully inserted post: {post.id}")
    except Exception as e:
        print(f"Error inserting post: {post.id}, Error: {str(e)}")
        
    
def download_image(image_url, bucket_name, s3_path):
    try:
        response = requests.get(image_url, stream=True)
        if response.status_code == 200:
            # Upload directly from the BytesIO object to avoid saving locally
            s3.upload_fileobj(BytesIO(response.content), bucket_name, s3_path)
            print(f"Uploaded {image_url} to s3://{bucket_name}/{s3_path}")
            return f"s3://{bucket_name}/{s3_path}"
        else:
            print(f"Failed to download image {image_url}")
            return None
    except Exception as e:
        print(f"Exception during download/upload: {str(e)}")
        return None

Next, I define functions to iterate through the posts and store the relevant data in DynamoDB and images in S3

In [7]:
def get_top_comment(post):
    """Retreive the most upvoted comment from a post"""
    
    # Remove MoreComments objects for a flat comment list
    post.comments.replace_more(limit=0)
    # Filter out comments by moderators and sort by score
    top_comments = [comment for comment in post.comments if not comment.stickied and not comment.distinguished]
    if top_comments:
        # Sort the comments based on score in descending order and take the top one
        top_comment = sorted(top_comments, key=lambda x: x.score, reverse=True)[0]
        print(f"Top Comment: {top_comment.body} (Score: {top_comment.score})\n")
        return top_comment
    else:
        print("No comments found.\n")
        return None
    

def scrape_posts(top_posts):
    """Iterate through the top posts in the subreddit generator
       and store them in DynamoDB/S3"""
    
    for post in top_posts:
    
    print(f"Title: {post.title}")
    print(f"URL: {post.url}")
    # Check if the post has a body (selftext), and print it if it does
    if post.selftext:
        print(f"Body: {post.selftext}\n")
    else:
        print("No body for this post.\n")

    # get top comment
    top_comment = get_top_comment(post)  
    
    item = {
            'submissionId': post.id,  # Use post ID as the partition key
            'title': post.title,
            'body': post.selftext,
            'url': post.url,
            'createdUtc': int(post.created_utc),
            'score': post.score,
            'numComments': post.num_comments,
            'topComment': top_comment.body if top_comment else "N/A",
            'topCommentScore': top_comment.score if top_comment else 0
            }
    
    # If the post URL points to an image (.jpg or .png)
    if post.url.endswith(('.jpg', '.png')):
        s3_path = f"reddit/funny/posts/{post.id}{post.url[-4:]}"  
        image_s3_url = download_image(post.url, bucket_name, s3_path)

        # Add the S3 URL to your item before storing to DynamoDB
        item['imageS3Url'] = image_s3_url or "N/A"
        
    
    # Insert the item into DynamoDB
    write_to_dynamodb(item)

    time.sleep(0.6)  # Waits for 0.6 seconds before processing the next post




### a) Get Top Posts of All Time
In this cell, I collect the top posts of all time from r/funny. The Reddit API limits the amount returned, so this will give ~1,000 posts. The output below is truncated.

In [10]:
# Define the subreddit you want to scrape
subreddit = reddit.subreddit('funny')

# Fetch the top posts
top_posts = subreddit.top(time_filter = "all", limit = None)  # Adjust the limit as needed

# scrape top posts of all time
scrape_posts(top_posts)

Title: My cab driver tonight was so excited to share with me that he’d made the cover of the calendar. I told him I’d help let the world see
URL: https://i.redd.it/tojcmbvjwk601.jpg
No body for this post.

Top Comment: He's not just on the cover but also [Mr. December](https://i.imgur.com/J4wQbxf.png)


Here's the [whole calendar](https://nyctaxicalendar.com/) which features plenty of shirtless NYC cab dudes.
 (Score: 26712)

Successfully inserted post: 7mjw12
Uploaded https://i.redd.it/tojcmbvjwk601.jpg to s3://sagemaker-us-east-1-513033806411/reddit/funny/posts/7mjw12.jpg

Title: Guardians of the Front Page
URL: http://i.imgur.com/OOFRJvr.gifv
No body for this post.

Top Comment: Can't wait to upvote this 17 different times later this week. (Score: 26644)

Successfully inserted post: 5gn8ru

Title: Gas station worker takes precautionary measures after customer refused to put out his cigarette
URL: https://gfycat.com/ResponsibleJadedAmericancurl
No body for this post.

Top Comment: I 

### b) Get Top Posts of the Past Year
Next, I collect the top posts from the past year.

In [11]:
# Fetch the top posts
top_posts = subreddit.top(time_filter = "year", limit = None)  # Adjust the limit as needed

# scrape top posts of all time
scrape_posts(top_posts)

Title: Adam Sandler and Jennifer Aniston are shocked by the size of an Australian reporter
URL: https://v.redd.it/o72vkte6dnta1
No body for this post.

Top Comment: "Put your hat on", that's hilarious (Score: 13782)

Successfully inserted post: 12knt5j

Title: My hometown just unveiled a 9/11 memorial at the fireman's museum. Think they could have used another set of eyes on this one...
URL: https://i.imgur.com/Y8BzrdR.jpg
No body for this post.

Top Comment: That's...unfortunate. (Score: 18403)

Successfully inserted post: 11qhab2

...


### c) Get Top Posts of the Past Month
Finally, I narrow the scope to the past month.

In [12]:
# Fetch the top posts
top_posts = subreddit.top(time_filter = "month", limit = None)  # Adjust the limit as needed

# scrape top posts of all time
scrape_posts(top_posts)

Title: my favorite video so far this year
URL: https://v.redd.it/38ktcvgthubc1
No body for this post.

Top Comment: My man is just playing the hand he was dealt. (Score: 11975)

Successfully inserted post: 19470zf

Title: London, UK
URL: https://i.redd.it/fa536c2dxfcc1.png
No body for this post.

Top Comment: We love the british tourist in amsterdam.... such good behaviour😐 (Score: 5902)

Successfully inserted post: 196l27v
Uploaded https://i.redd.it/fa536c2dxfcc1.png to s3://sagemaker-us-east-1-513033806411/reddit/funny/posts/196l27v.png

Title: My coworker was asked to cut the cake today at work.
URL: https://i.redd.it/dhs2xa401bfc1.jpeg
No body for this post.

Top Comment: That is a person ensuring they never get asked to cut a cake again. (Score: 18277)

Successfully inserted post: 1adm9h7

...


### d) Bypass API Limits using Random Search
The target is 2,500 posts and I am a few hundred short of that, so in order to collect more top posts, I am going to randomly search top posts from a corpus of text. This mitigates the cap of 1,000 posts that the Reddit API enforces. To make sure that I have a good distribution, I will limit posts collected for each search word to three. I also set a minimum of 25k upvotes to ensure quality of the post.

In [13]:
# keep track of posts already collected
post_ids = []

# Define the subreddit you want to scrape
subreddit = reddit.subreddit('funny')

# Iterate through random words and time filters
for word in search_words:
    print(f"Searching for '{word}':")
    # Conduct search
    search_results = subreddit.search(query=word, sort='top', time_filter="all", limit=3)
    for post in search_results:
        print(f"Title: {post.title} | Score: {post.score}")

        # store submission if > 25k upvotes
        if post.score > 25000:

            # don't reprocess already seen posts
            if post.id in post_ids:
                continue
            else:
                # get top comment
                top_comment = get_top_comment(post)  
                item = {
                'submissionId': post.id,  # Use post ID as the primary key
                'title': post.title,
                'body': post.selftext,
                'url': post.url,
                'createdUtc': int(post.created_utc),
                'score': post.score,
                'numComments': post.num_comments,
                'topComment': top_comment.body if top_comment else "N/A",
                'topCommentScore': top_comment.score if top_comment else 0
                }

                # add to processed posts list
                post_ids.append(post.id)

                # If the post URL points to an image (.jpg or .png)
                if post.url.endswith(('.jpg', '.png')):
                    s3_path = f"reddit/funny/posts/{post.id}{post.url[-4:]}"  
                    image_s3_url = download_image(post.url, bucket_name, s3_path)

                    # Add the S3 URL to your item before storing to DynamoDB
                    item['imageS3Url'] = image_s3_url or "N/A"
                    
                # Insert the item into DynamoDB
                write_to_dynamodb(item)

        # Ensure to respect Reddit's rate limits
        time.sleep(0.6)

Searching for 'Adventure':
Title: Our dog who ran off on an adventure for 7.5 hours ringing our doorbell at 3 am to let us know she’s home | Score: 153599
Top Comment: Dogs can be the most inconsiderate roommates. (Score: 12632)

Successfully inserted post: o9kpkg

Title: The Adventures of Asian Superman | Score: 90298
Top Comment: Perry could learn a thing or two about cultural sensitivity from [Lois Lane](https://www.comicbookdaily.com/wp-content/uploads/2009/10/Lois-Lane-106.jpg). (Score: 5696)

Successfully inserted post: jqwv4h
Uploaded https://i.redd.it/u2c6so4xl7y51.png to s3://sagemaker-us-east-1-513033806411/reddit/funny/posts/jqwv4h.png

Title: What an adventure! | Score: 73435
Top Comment: Meanwhile, thousands of documents on floor 4 go unstapled.  You monster. (Score: 10858)

Successfully inserted post: 6586ox
Uploaded https://i.redd.it/6zdmyk0cudry.jpg to s3://sagemaker-us-east-1-513033806411/reddit/funny/posts/6586ox.jpg


## 4. Conclusion
This concludes the first section of this project. I now have ~2,700 posts stored in DynamoDB for easy access, along with the images in S3. In the next notebook, I will begin processing these images to generate captions to complete the dataset for fine-tuning.