### Task 2: Embedding Preparation – Stage 1: Data Cleaning

This notebook implements the first stage of our embedding pipeline, focused on preparing raw Reddit text for downstream modeling. Specifically, we:

1. **Objective**: Ensure that each comment or post contains valid text and remove noise before generating embeddings.

2. **Workflow Steps**:
   - Verify that the `selftext` column is non-null, non-empty, and not a placeholder (`[removed]` or `[deleted]`).
   - Strip out URLs, Markdown syntax characters, and extra whitespace.
   - Normalize text to lowercase and trim surrounding spaces.

3. **Execution Environment**: Runs on the Midway cluster via an interactive `sinteractive` session, invoking a PySpark `SparkSession`.

4. **Full Pipeline and Design Details**: See the project README for an end-to-end overview of our large-scale computing design, from data ingestion through embedding generation.  


#### 2.1.0 Spark session calling

In [None]:
import os, sys
os.environ["PYSPARK_PYTHON"]        = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

from pyspark.sql import SparkSession

N_CORE_LOCAL   = 32         
SHUFFLE_PART   = 256         
DRIVER_MEMORY  = "12g"       

spark = (SparkSession.builder
         .appName("WSB-FinBERT-local32")
         .master(f"local[{N_CORE_LOCAL}]")
         .config("spark.sql.shuffle.partitions", SHUFFLE_PART)
         .config("spark.driver.memory", DRIVER_MEMORY)
         .getOrCreate())

print("✅ Spark master =", spark.sparkContext.master,
      "| shuffle =", SHUFFLE_PART)


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/05/28 12:43:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
✅ Spark master = local[32] | shuffle = 256


In [None]:
### Stop key
spark.stop()

#### 2.1.1: Examine column information of input data

In [2]:
# path to the parquet directory
PATH_STAGE2 = "/scratch/midway3/zhengzhiyu6689/macs30123/project/reddit/stage02_with_ticker"

# read the whole dataset
df = spark.read.parquet(PATH_STAGE2)

# print schema (column names + types)
df.printSchema()

# optional: show a small sample to eyeball values
df.show(5, truncate=False)


                                                                                

root
 |-- id: string (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- title: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- is_comment: boolean (nullable = true)
 |-- ticker: string (nullable = true)

+------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------+------+
|id    |created_utc|subreddit     |title                                                                                                                                                                                                                                                                          |selftext |is_comment|ticker|
+------+-----------+--------------+-------------

#### 2.1.2 Examine the null of text column

In [None]:
from pyspark.sql import functions as F

# Change this path to the Parquet directory you want to validate
SRC_PATH = "/scratch/midway3/zhengzhiyu6689/macs30123/project/reddit/stage02_with_ticker"

# Read the Parquet dataset from the specified source path
df = spark.read.parquet(SRC_PATH)

# Filter out records where 'selftext' is null, empty, or contains only "[removed]" or "[deleted]"
non_empty = df.filter(
    (F.col("selftext").isNotNull()) &
    (F.trim("selftext") != "") &
    (~F.lower(F.trim("selftext")).isin("[removed]", "[deleted]"))
)

# Count how many rows have non-empty, valid 'selftext' content
print("Number of non-empty text rows =", non_empty.count())

# Display a sample of 20 rows showing whether it's a comment, the text body, and the extracted ticker
non_empty.select("is_comment", "selftext", "ticker").show(20, truncate=120)



                                                                                

非空文本行数 = 21745905
+----------+------------------------------------------------------------------------------------------------------------------------+------+
|is_comment|                                                                                                                selftext|ticker|
+----------+------------------------------------------------------------------------------------------------------------------------+------+
|     false|Most people cannot trade == most people are HOLDING!\n\nI think these dips are all a farce to scare people off tomorr...|   GME|
|     false|I’m trying to jump in on the holy GME from Australia. Market in the USA has opened 3 minutes ago. And its saying I am...|   GME|
|     false|&amp;#x200B;\n\n[I sold some of my GME calls back when it was 75$ Friday so I decided to hop back in with the profits...|   GME|
|     false|I’ve had enough with robinhoods bullshit crashing , they forced a margin call on me the other day so my positions wen...|   

#### 2.1.3 preliminary text cleaning

In [None]:
from pyspark.sql import functions as F, types as T

# Source and destination directories on Midway filesystem
SRC = "/scratch/midway3/zhengzhiyu6689/macs30123/project/reddit/stage02_with_ticker"
OUT = "/scratch/midway3/zhengzhiyu6689/macs30123/project/reddit/stage03_clean_with_ticker"

# Read the previously saved Parquet data including extracted tickers
df = spark.read.parquet(SRC)

# Filter out rows where 'selftext' is null, empty, or contains only placeholders
non_empty = df.filter(
    (F.col("selftext").isNotNull()) &
    (F.trim("selftext") != "") &
    (~F.lower(F.trim("selftext")).isin("[removed]", "[deleted]"))
)

# Regular expressions for URLs, Markdown characters, and whitespace
URL_RE   = r'https?://\S+|www\.\S+'
MD_RE    = r'[*_`>~\[\]\(\)]'
SPACE_RE = r'\s+'

# Clean the 'selftext' column:
# 1. Remove URLs
# 2. Strip Markdown punctuation
# 3. Collapse multiple spaces into single space
# 4. Lowercase all text
# Retain the 'ticker' column for downstream analysis
clean = (
    non_empty
    .withColumn(
        "clean_text",
        F.lower(F.regexp_replace("selftext", URL_RE, " "))
    )
    .withColumn(
        "clean_text",
        F.regexp_replace("clean_text", MD_RE, " ")
    )
    .withColumn(
        "clean_text",
        F.regexp_replace("clean_text", SPACE_RE, " ")
    )
    .select(
        "id",
        "created_utc",
        "subreddit",
        "ticker",                      # preserve extracted ticker column
        F.trim("clean_text").alias("clean_text")
    )
)

# Repartition the cleaned DataFrame for parallel write (200 partitions)
(clean.repartition(200)
      .write.mode("overwrite")
      .parquet(OUT))

print("✓ stage03_clean_with_ticker written to:", OUT)




25/05/28 10:39:41 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 99.51% for 44 writers
25/05/28 10:39:41 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 97.30% for 45 writers
25/05/28 10:39:41 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 95.19% for 46 writers
25/05/28 10:39:41 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 93.16% for 47 writers
25/05/28 10:39:41 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 91.22% for 48 writers
25/05/28 10:39:41 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 89.36% for 49 writers
25/05/28 10:39:41 WARN MemoryManager: Total allocation exceeds 9

[Stage 4:>                                                       (0 + 64) / 200]

25/05/28 10:39:42 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 69.50% for 63 writers
25/05/28 10:39:42 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 70.62% for 62 writers
25/05/28 10:39:42 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 71.78% for 61 writers
25/05/28 10:39:42 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 72.98% for 60 writers
25/05/28 10:39:42 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 74.21% for 59 writers
25/05/28 10:39:42 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 75.49% for 58 writers
25/05/28 10:39:42 WARN MemoryManager: Total allocation exceeds 9



25/05/28 10:39:43 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 99.51% for 44 writers
25/05/28 10:39:43 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 97.30% for 45 writers
25/05/28 10:39:43 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 95.19% for 46 writers
25/05/28 10:39:43 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 93.16% for 47 writers
25/05/28 10:39:43 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 91.22% for 48 writers
25/05/28 10:39:43 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 89.36% for 49 writers
25/05/28 10:39:43 WARN MemoryManager: Total allocation exceeds 9



25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 81.08% for 54 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 82.61% for 53 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 84.20% for 52 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 85.85% for 51 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 84.20% for 52 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 85.85% for 51 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 9



25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 99.51% for 44 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 99.51% for 44 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 97.30% for 45 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 95.19% for 46 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 93.16% for 47 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 95.19% for 46 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 9



25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 99.51% for 44 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 97.30% for 45 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 95.19% for 46 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 97.30% for 45 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 95.19% for 46 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 93.16% for 47 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 9



25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 84.20% for 52 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 82.61% for 53 writers
25/05/28 10:39:44 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 81.08% for 54 writers
25/05/28 10:39:45 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 79.61% for 55 writers
25/05/28 10:39:45 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 78.19% for 56 writers
25/05/28 10:39:45 WARN MemoryManager: Total allocation exceeds 95.00% (5,876,770,333 bytes) of heap memory
Scaling row group sizes to 76.82% for 57 writers
25/05/28 10:39:45 WARN MemoryManager: Total allocation exceeds 9

                                                                                

✓ stage03_clean_with_ticker : /scratch/midway3/zhengzhiyu6689/macs30123/project/reddit/stage03_clean_with_ticker
