# THIS NOTEBOOK IS FOR PREPARING, CHECKING AND CLEANING THE DATASET

Import statements

In [1]:
import praw
import json
import os
from dotenv import load_dotenv
load_dotenv()

True

Get the dataset by scraping the top 100 reddit posts in a subreddit (in this case r/PHbuildapc)

In [7]:

reddit = praw.Reddit(
    client_id = os.getenv('CLIENT_ID'),
    client_secret = os.getenv('CLIENT_SECRET'),
    user_agent = "retldr"
)

data = []
subreddit = reddit.subreddit("PHbuildapc") # subreddit of choice 

for submission in subreddit.top(limit=100): # only fetch the top 100 posts
    post = {
        "title": submission.title,
        "selftext": submission.selftext,
        "comments": [
            comment.body for comment in submission.comments
            if isinstance(comment, praw.models.Comment)
        ]
    }
    data.append(post)

# save dataset to json
with open("reddit_data.json", "w") as file:
    json.dump(data, file, indent=4)


Open the dataset (json) file

In [2]:
with open('reddit_data.json', 'r+') as file:
    data = json.load(file)

Check how many summaries have been added to the dataset

In [4]:
summary_counter = 0
post_w_summary = []
for i in data:
    if "summary" in i:
        summary_counter+=1
        post_w_summary.append(i["index"])

print(f"{summary_counter} posts have summaries added to it")
print(f"posts with index {post_w_summary} have summaries added to it")

6 posts have summaries added to it
posts with index [0, 1, 2, 3, 4, 5] have summaries added to it


Add index numbers to the dataset

In [20]:
index_num = 0
for i in data:
    i['index'] = index_num
    index_num += 1

with open('reddit_data.json', 'w') as file:
    json.dump(data, file, indent=4)

print("index numbers added and saved to json file.")

index numbers added and saved to json file.


Attempt to automate adding summaries to each post entry in the dataset

In [7]:
index = 7
summary = "A Reddit post advises budget-conscious PC builders to avoid overthinking 'future-proofing.' It emphasizes focusing on practical needs rather than chasing the latest tech. Key takeaways include: Prioritize PSU: Invest in a reliable PSU (Gold-rated, 500-750W) as it can transition to future builds. While Bronze-rated PSUs are also viable, choosing a reputable brand is crucial. CPU/GPU Advice: Mid-range options like Ryzen 5/7 or Intel i5 paired with a mid-tier GPU (e.g., RTX 3060) are sufficient for most games at 1080p settings. Avoid overpaying for CPUs, motherboards, or overclocking features unless you have specific needs. AM4 vs. AM5 Debate: AM4 remains cost-effective and viable for years, contrary to claims about its obsolescence. Realistic Expectations: Most gamers don't need ultra settings or high-end builds to enjoy games. Focus on maximizing value within your budget. Community Insights: Comments share relatable experiences, highlighting common pitfalls like overthinking builds or falling for future-proofing' marketing. Many emphasize the satisfaction of well-balanced, budget-friendly builds. The post encourages gamers to build within their means, enjoy the present, and plan for upgrades only when necessary."

for i in data:
    if i["index"] == index:
        i["summary"] = summary

with open('reddit_data.json', 'w') as file:
    json.dump(data, file, indent=4)