# 02_data_cleaning.ipynb

**Goals:**  
1. Load raw CSVs for Fold 4–7  
2. Combine into one DataFrame  
3. Clean text (remove URLs, punctuation; lowercase)  
4. Parse timestamps into datetime  
5. Save processed CSV for feature engineering

In [3]:
import os
import pandas as pd
import re

# Ensure output directory exists
os.makedirs("data/processed", exist_ok=True)

In [7]:
# Paths to raw files
files = [
    "/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold4_reddit.csv",
    "/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold5_reddit.csv",
    "/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold6_reddit.csv",
    "/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold7_reddit.csv",
]

# Load each into a DataFrame and concatenate
dfs = [pd.read_csv(path) for path in files]
df = pd.concat(dfs, ignore_index=True)
print("Combined shape:", df.shape)
df.head(2)

Combined shape: (1884, 8)


Unnamed: 0,Version,Subreddit,Title,Body,Score,Comments,Timestamp,URL
0,Fold 4,samsung,How much battery have you lost over the year(s)?,My Galaxy Note 9 has 50% degradation haha. (It...,17,31,1753390000.0,https://www.reddit.com/r/samsung/comments/1m8f...
1,Fold 4,samsung,I really wanna get samsung Z flip 7,Okay so Ive been an Iphone user for more than ...,74,66,1752759000.0,https://www.reddit.com/r/samsung/comments/1m27...


In [10]:
# Some posts may have empty body
df["Title"] = df["Title"].fillna("")
df["Body"]  = df["Body"].fillna("")

# Create unified Text column
df["Text"] = df["Title"] + " " + df["Body"]


In [12]:
def clean_text(s):
    s = s.lower()                           # lowercase
    s = re.sub(r"http\S+", "", s)           # remove URLs
    s = re.sub(r"[^a-z\s]", "", s)          # keep letters & spaces
    s = re.sub(r"\s+", " ", s).strip()      # collapse whitespace
    return s

df["Clean_Text"] = df["Text"].apply(clean_text)
df[["Text", "Clean_Text"]].head(2)

Unnamed: 0,Text,Clean_Text
0,How much battery have you lost over the year(s...,how much battery have you lost over the years ...
1,I really wanna get samsung Z flip 7 Okay so Iv...,i really wanna get samsung z flip okay so ive ...


In [14]:
# Convert Unix timestamp to datetime
df["Date"] = pd.to_datetime(df["Timestamp"], unit="s")
# Optional: extract just date
df["DateOnly"] = df["Date"].dt.date

# Drop unused columns
df = df.drop(columns=["Title", "Body", "Timestamp"])
df.head(2)

Unnamed: 0,Version,Subreddit,Score,Comments,URL,Text,Clean_Text,Date,DateOnly
0,Fold 4,samsung,17,31,https://www.reddit.com/r/samsung/comments/1m8f...,How much battery have you lost over the year(s...,how much battery have you lost over the years ...,2025-07-24 20:43:52,2025-07-24
1,Fold 4,samsung,74,66,https://www.reddit.com/r/samsung/comments/1m27...,I really wanna get samsung Z flip 7 Okay so Iv...,i really wanna get samsung z flip okay so ive ...,2025-07-17 13:30:00,2025-07-17


In [20]:
# Save to processed folder
df.to_csv("/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/processed/reddit_clean.csv", index=False)
print("Saved cleaned data:", df.shape)

Saved cleaned data: (1884, 9)
