# **Pipeline to Create Poem Dataset from Reddit's r/OCPoetry Thread**

Make sure to create a Reddit account and register for a project in order to use the Reddit API. Once that is done, note down your personal use script (credentials key) and secret key. Poems will be scraped from the r/OCPoetry subreddit.

This is an academic project. In order to comply with Reddit API terms of service and guidelines, the actual dataset created will not be posted. However, you can create your own credentials if you want to follow along.

**Workflow**

1. **Data Collection:** PRAW + Reddit API to scrape 5k poems from r/OCPoetry. Fields like title, body, author to CSV/JSON
2. **Preprocessing:** Remove spaces, special characters, fix formatting, normalize (lowercase)
3. **Label Generation:** OpenAI API to generate symbolic labels for each poem, parse response and add labels as new column of dataset
   - Using GPT 4.5
4. **Dataset Finalization:** Structure with poem_text and labels and convert labels suitable for training, if needed
5. **Model Training:** Choose pretrained transformer model and fine-tune it as multi-label classifier on created dataset (70/15/15)
   - We need ground truth so we will use GPT 4.5 to train/label and then fine tune with GPT 3.5
6. **Evaluation:** Eval metrics (precision, recall, F1, hamming loss) and human-evaluations comparing GPT 4.5 and 3.5's output
7. **Deployment:** Input unseen poem, output predicted symbolic labels/themes
8. **Iteration If Time:** Incorporate Wikidata SPARQL for entity linking and label enrichment

**TOS**

- Create a Reddit app to get client_id, client_secret, and user_agent
- 60 requests per minute
- Throttle requests accordingly to avoid temporary bans (no spamming)
- No scraping personal info

**Set up Environment**

In [1]:
pip install praw pandas tqdm

Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
Downloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: tqdm, update_checker, prawcore, praw

   ------------------------------ --------- 3/4 [praw]
   ---------------------------------------- 4/4 [praw]

Successfully installed praw-7.8.1 prawcore-2.4.0 tqdm-4.67.1 update_checker-0.18.0
Note: you may need to restart the kernel to use updated packages.


**Import Packages**

In [1]:
import praw
import csv
import time
from prawcore.exceptions import RequestException, ResponseException, ServerError

**Set Up Credentials**

In [2]:
reddit = praw.Reddit(
    client_id='bwe4o-iuFl22kqNIjpeqUg',
    client_secret='N3Jm4lVrj_vZ3OiZf5x53eJ_HWtczw',
    user_agent='symbolism-project-llm by /u/Mxrchives')

# Choose the subreddit
subreddit = reddit.subreddit('OCPoetry')

**Test with 10 posts into a .csv first**

In [3]:
with open('ocpoem_posts.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    # Write header
    writer.writerow(['title', 'author', 'poem_text'])

    # 10 newest posts
    for submission in subreddit.new(limit=10):
        title = submission.title
        author = str(submission.author)
        poem_text = submission.selftext.replace('\n', ' ')  # Replace newlines with spaces so that it's easier for cleaning later
        writer.writerow([title, author, poem_text])

In [7]:
# Successful!

**Get 1K samples safely (make sure to not spam)**

In [4]:
with open('ocpoetry_posts.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['title', 'author', 'poem_text'])

    count = 0
    try:
        for submission in subreddit.new(limit=1000):
            # Write post info
            writer.writerow([
                submission.title,
                submission.author.name if submission.author else 'deleted',
                submission.selftext.replace('\n', ' ')  # clean newlines
            ])
            count += 1

            # pause every 50 posts for 2 seconds to be safe!!!
            if count % 50 == 0:
                print(f'Fetched {count} posts, sleeping to respect rate limits...')
                time.sleep(2)

    except (RequestException, ResponseException, ServerError) as e:
        print(f'Request error: {e}')
        print('Waiting for 10 seconds before retrying...')
        time.sleep(10)

Fetched 50 posts, sleeping to respect rate limits...
Fetched 100 posts, sleeping to respect rate limits...
Fetched 150 posts, sleeping to respect rate limits...
Fetched 200 posts, sleeping to respect rate limits...
Fetched 250 posts, sleeping to respect rate limits...
Fetched 300 posts, sleeping to respect rate limits...
Fetched 350 posts, sleeping to respect rate limits...
Fetched 400 posts, sleeping to respect rate limits...
Fetched 450 posts, sleeping to respect rate limits...
Fetched 500 posts, sleeping to respect rate limits...
Fetched 550 posts, sleeping to respect rate limits...
Fetched 600 posts, sleeping to respect rate limits...
Fetched 650 posts, sleeping to respect rate limits...
Fetched 700 posts, sleeping to respect rate limits...
Fetched 750 posts, sleeping to respect rate limits...
Fetched 800 posts, sleeping to respect rate limits...
Fetched 850 posts, sleeping to respect rate limits...
Fetched 900 posts, sleeping to respect rate limits...
Fetched 950 posts, sleeping t

In [None]:
# Successful! Retrieving data took only 1 minute.

**Note:** 990 posts were retrieved! Still good for our data so no worries. Sometimes this happens because .new() or .limit parameters don’t return the exact number requested. This can be due to deleted posts. We cannot retrieve those last remaining 10 posts specifically to get to 1k because Reddit API does not let you randomly fetch. It will give you the newest posts in descending order. This is still okay!

In [None]:
# Now we will use the curated ocpoetry_posts.csv to head to the data cleaning process.