# 01_data_collection.ipynb

**Goal:** Scrape Reddit posts for Galaxy Fold models (Fold 4–7) and save raw CSVs for downstream processing.

We’ll use PRAW (Python Reddit API Wrapper) to collect:
- Post title & body
- Engagement metrics (score, comments)
- Metadata (timestamp, URL)
- A “Version” tag so we know which Fold each post refers to

In [3]:
# Install dependencies (run once)
!pip install praw pandas --quiet

In [5]:
import praw
import pandas as pd

# Reddit API credentials 
client_id     = "OYT3vAuAzCMNp_5JSgaBBw"
client_secret = "PntVTcvv9pfKr1RpTlABKnj_flBCFw"
user_agent    = "FoldSalesPredictor by /u/iofficialshrey"

# Authentication
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent
)

print("Read-only mode:", reddit.read_only)  # For credential Verfication


Read-only mode: True


In [7]:
# 1. Subreddits to search
subreddits = ["samsung", "galaxyfold", "Android"]

# 2. Reusable scraper function
def scrape_reddit(version_query, version_label, subreddits, limit=200):
    """
    Scrape up to `limit` posts per subreddit matching `version_query`.
    Returns a DataFrame with columns:
      Version, Subreddit, Title, Body, Score, Comments, Timestamp, URL
    """
    records = []
    for sub in subreddits:
        for post in reddit.subreddit(sub).search(version_query, sort="new", limit=limit):
            records.append({
                "Version": version_label,
                "Subreddit": sub,
                "Title": post.title,
                "Body": post.selftext,
                "Score": post.score,
                "Comments": post.num_comments,
                "Timestamp": post.created_utc,
                "URL": post.url,
            })
    return pd.DataFrame(records)

In [13]:
# Scrape Fold 4
df_fold4 = scrape_reddit("Fold 4 OR Galaxy Fold 4", "Fold 4", subreddits, limit=200)
df_fold4.to_csv("/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold4_reddit.csv", index=False)
print("✅ Fold 4 data saved:", df_fold4.shape)

✅ Fold 4 data saved: (519, 8)


In [15]:
# Scrape Fold 5
df_fold5 = scrape_reddit("Fold 5 OR Galaxy Fold 5", "Fold 5", subreddits, limit=200)
df_fold5.to_csv("/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold5_reddit.csv", index=False)
print("✅ Fold 5 data saved:", df_fold5.shape)


✅ Fold 5 data saved: (493, 8)


In [17]:
# Scrape Fold 6
df_fold6 = scrape_reddit("Fold 6 OR Galaxy Fold 6", "Fold 6", subreddits, limit=200)
df_fold6.to_csv("/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold6_reddit.csv", index=False)
print("✅ Fold 6 data saved:", df_fold6.shape)


✅ Fold 6 data saved: (485, 8)


In [19]:
# Scrape Fold 7
df_fold7 = scrape_reddit("Fold 7 OR Galaxy Fold 7", "Fold 7", subreddits, limit=200)
df_fold7.to_csv("/Users/shreychaudhary/Documents/Samsung_Fold7_Sales_Prediction/data/raw/fold7_reddit.csv", index=False)
print("✅ Fold 7 data saved:", df_fold7.shape)


✅ Fold 7 data saved: (387, 8)


In [21]:
# Combine into a single DataFrame
df_all = pd.concat([df_fold4, df_fold5, df_fold6, df_fold7], ignore_index=True)
print(df_all.Version.value_counts())
df_all.head()

Version
Fold 4    519
Fold 5    493
Fold 6    485
Fold 7    387
Name: count, dtype: int64


Unnamed: 0,Version,Subreddit,Title,Body,Score,Comments,Timestamp,URL
0,Fold 4,samsung,How much battery have you lost over the year(s)?,My Galaxy Note 9 has 50% degradation haha. (It...,17,31,1753390000.0,https://www.reddit.com/r/samsung/comments/1m8f...
1,Fold 4,samsung,I really wanna get samsung Z flip 7,Okay so Ive been an Iphone user for more than ...,74,66,1752759000.0,https://www.reddit.com/r/samsung/comments/1m27...
2,Fold 4,samsung,Samsung could dominate foldables today — they’...,"Samsung has the best displays, multitasking UI...",18,23,1752579000.0,https://www.reddit.com/r/samsung/comments/1m0f...
3,Fold 4,samsung,Improvements for the S26 Series and onward,"As many people have noted, Samsung has been st...",28,40,1749941000.0,https://www.reddit.com/r/samsung/comments/1lbl...
4,Fold 4,samsung,Switching phones,Hey folks! I’m currently using an iPhone 14 Pr...,3,2,1747724000.0,https://www.reddit.com/r/samsung/comments/1kqy...
