# **Pipeline to Clean the Curated Poem Dataset from Reddit's r/OCPoetry Thread**

This specific subreddit contains rules that tell the user to critique and give feedback on 2 other poems in order to post their own. Because of this, there will be links and other comments within the body's text of the post. After taking a look at our raw dataset (ocpoetry_poems.csv), these are some things that I see that need cleaning or pre-processing:

- Make sure that spaces are appropriate with no new line breaks
- Apostrophe's are replaced with weird symbols that need to be cleaned out
- URL links need to be removed
- Brackets with number need to be cleaned out (no bracketed references like [1])
- No numbers
    - Find out what to do with years, because some poems have '96, referring to 1996
- No semicolons or any symbols (:;/"'#*&!?()[]-+, etc.) or any punctuation
- Take out the word "feedback" in any of the body texts
- Take out titles in the body text?
    - Titles will look like this: * * Title * *
- No languages other than English
- Normalize everything (all lowercase, no punctuation)
- Remove emptry entries or the ones that glitched and have 2 words or less than 35 characters

**Load in Dependencies**

In [6]:
# Language detector

!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     -------------------------------------- 981.5/981.5 kB 9.2 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py): started
  Building wheel for langdetect (setup.py): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993364 sha256=452e5c0ee7bacc08325d17afe9c5f497f7f03deba4a9d6f525155261e6f173c0
  Stored in directory: c:\users\marielle\appdata\local\pip\cache\wheels\0a\f2\b2\e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


  DEPRECATION: Building 'langdetect' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'langdetect'. Discussion can be found at https://github.com/pypa/pip/issues/6334


In [21]:
import pandas as pd
import re
import html
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException
import unicodedata

In [53]:
df = pd.read_csv("C:/Users/Marielle/OneDrive/Desktop/LLM Project/ocpoetry_posts.csv")

In [54]:
df.head(5)

Unnamed: 0,title,author,poem_text
0,A Puddle's Whisper,PappaSmurfXIV,You fall into a puddle of resentment and tear...
1,The Last Light,Potential-Walrus-885,7 months have come and gone Time to pack my b...
2,While You Were in Japan,Dogdydaycare_,You were already waking up as I was crying mys...
3,Lost Girl,QueenofGreen79,Nobody knows about the girl who sits alone in ...
4,Night Shift,Cunt-huffer,Night Shift Steel doors gasp rust‑b...


In [55]:
df.shape # 989 poems

(989, 3)

**Cleaning and Normalizing the Dataset**

In [56]:
# Functions

def clean_text(text):
    if pd.isna(text):
        return ""
        
    # Normalize to lowercase
    text = text.lower()
    
    # Fix common bad apostrophe/quote/dash unicode artifacts
    text = text.replace("â€™", "'").replace("’", "'").replace("‘", "'")
    text = text.replace("“", '"').replace("”", '"')
    text = text.replace("â€“", "-").replace("â€”", "-")
    text = text.replace("â€œ", '"').replace("â€", '"')
    text = text.replace("â€˜", "'").replace("â€¢", "*").replace("â€¦", "...")
    
    # Normalize unicode characters and remove any non-ASCII characters
    text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode("ascii")
    
    # Remove URLs (http, https, www)
    text = re.sub(r"http\S+|www\.\S+", "", text)
    
    # Remove bracketed numbers like [1], [23]
    text = re.sub(r"\[\d+\]", "", text)
    
    # Remove titles formatted like **title**
    text = re.sub(r"\*\*[^*]+\*\*", "", text)
    
    # Remove the word 'feedback' anywhere
    text = re.sub(r"\bfeedback\b", "", text)
    
    # Remove the word 'link' anywhere
    text = re.sub(r"\blink\b", "", text)
    
    # Remove unwanted stray symbols but keep useful punctuation for LLMs:
    # Keep apostrophes, commas, periods, question marks, exclamation points, colons, semicolons, dashes, quotes
    
    # Remove other symbols like # * & ( ) [ ] + / \ < > = ~ | ^ _ ` : etc.
    text = re.sub(r"[#*&()\[\]+/\\<>=~|^_`:]", "", text)
    
    # Remove newlines and normalize whitespace
    text = text.replace("\n", " ").replace("\r", " ")
    text = re.sub(r"\s+", " ", text).strip()
    
    return text
    
def is_english(text):
    try:
        return detect(text) == "en"
    except LangDetectException:
        return False

In [57]:
# Application

# Clean and filter
df['cleaned'] = df['poem_text'].apply(clean_text)
df = df[df['cleaned'].apply(is_english)]

# Remove entries with less than 35 characters or two words or less
df = df[df['cleaned'].str.len() >= 35]
df = df[df['cleaned'].str.split().apply(len) > 2]

# Reset index
df = df.reset_index(drop=True)

In [58]:
# Take a look at clean data

df.head(5)

Unnamed: 0,title,author,poem_text,cleaned
0,A Puddle's Whisper,PappaSmurfXIV,You fall into a puddle of resentment and tear...,you fall into a puddle of resentment and tears...
1,The Last Light,Potential-Walrus-885,7 months have come and gone Time to pack my b...,7 months have come and gone time to pack my ba...
2,While You Were in Japan,Dogdydaycare_,You were already waking up as I was crying mys...,you were already waking up as i was crying mys...
3,Lost Girl,QueenofGreen79,Nobody knows about the girl who sits alone in ...,nobody knows about the girl who sits alone in ...
4,Night Shift,Cunt-huffer,Night Shift Steel doors gasp rust‑b...,night shift steel doors gasp rustburned breath...


In [59]:
df.shape # Now 980 poems

(981, 4)

In [60]:
# Save cleaned data frame

df.to_csv(r"C:\Users\Marielle\OneDrive\Desktop\LLM Project\ocpoetry_cleaned.csv", index=False)

**Note:** After looking, please manually delete row 979 because it was not in English. After that is done, we are good to move onto the labeling using GPT 4.0!