#Reddit User Persona Project

---


`Welcome! This project is part of a technical assignment aimed at scraping a Reddit user's posts and comments, then building an AI-generated User Persona with cited references from their activity.`


---


> 🔍 Objective:








```
*  Scrape public Reddit profile data (posts + comments).
*  Use that data to extract behavioral traits, interests, tone, and opinions.
*   Present a structured User Persona with source citations
```












---



> 🛠️ Methods Attempted:






```
PRAW API (OAuth with script app)
❌ Restricted: Reddit API no longer allows user comment history via password-based auth (401 errors).

.json endpoints (reddit.com/user/<name>/comments.json)
❌ Blocked: Public access restricted with 403 errors, even with spoofed headers.

HTML Scraping with requests + BeautifulSoup
❌ Blocked: Reddit now blocks IPs from cloud platforms like Google Colab using Cloudflare protections.
```




---



> ✅ Final Approach:




```
To simulate the scraping pipeline and continue the assignment, I’ve used mock Reddit-style data to demonstrate:

Data parsing

Prompt formulation

GPT-powered persona generation with post/comment citations

This solution respects the logic and intent of the assignment, even under real-world scraping limitations.
```





---



# Scraping Public Reddit Data via PRAW API

In [None]:
# Install Reddit API wrapper
!pip install praw
!pip install praw python-dotenv



In [None]:
import praw
import getpass

client_id = getpass.getpass("Client ID: ")
client_secret = getpass.getpass("Client Secret: ")
username = getpass.getpass("Reddit Username: ")
password = getpass.getpass("Reddit Password: ")

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent="redditPersonaApp/0.1",
    username=username,
    password=password
)

Client ID: ··········
Client Secret: ··········
Reddit Username: ··········
Reddit Password: ··········


In [None]:
print("Authenticated as:", reddit.user.me())

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Authenticated as: New_Ad5709


In [None]:
def fetch_user_activity(username, limit=100):
    """Fetches posts and comments for a given Reddit user."""
    try:
        redditor = reddit.redditor(username)

        submissions = []
        comments = []

        # Fetch latest submissions
        for submission in redditor.submissions.new(limit=limit):
            submissions.append({
                "type": "post",
                "title": submission.title,
                "body": submission.selftext,
                "subreddit": submission.subreddit.display_name,
                "created_utc": submission.created_utc,
                "url": submission.url
            })

        # Fetch latest comments
        for comment in redditor.comments.new(limit=limit):
            comments.append({
                "type": "comment",
                "body": comment.body,
                "subreddit": comment.subreddit.display_name,
                "created_utc": comment.created_utc,
                "link": f"https://www.reddit.com{comment.permalink}"
            })

        return submissions, comments

    except Exception as e:
        print(f"Error fetching data for user '{username}': {e}")
        return [], []

In [None]:
print(reddit.read_only)

True


In [None]:
reddit = praw.Reddit(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    user_agent=USER_AGENT,
    check_for_async=False
)

In [None]:
for post in reddit.subreddit("AskReddit").hot(limit=3):
    print("Title:", post.title)
    print("Selftext:", post.selftext)

ResponseException: received 401 HTTP response

In [None]:
user = reddit.user.me()  # your own account
for comment in user.comments.new(limit=5):
    print(comment.body)

AttributeError: 'NoneType' object has no attribute 'comments'

In [None]:
username = "spez"  # Reddit co-founder (safe public profile)
user = reddit.redditor(username)

for comment in user.comments.new(limit=5):
    print(comment.body)

ResponseException: received 401 HTTP response

# Scraping Public Reddit Data via .json

In [None]:
import requests
import time

In [None]:
def scrape_user_comments(username, limit=25):
    url = f"https://www.reddit.com/user/{username}/comments.json"
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Error fetching comments for {username}. Status code: {response.status_code}")
        return []

    comments_data = response.json().get("data", {}).get("children", [])
    comments = [item["data"]["body"] for item in comments_data[:limit]]

    return comments

In [None]:
def scrape_user_posts(username, limit=25):
    """Scrapes latest posts (submissions) from a Reddit user using the .json endpoint."""
    url = f"https://www.reddit.com/user/{username}/submitted.json"
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Error fetching posts for {username}. Status code: {response.status_code}")
        return []

    posts_data = response.json().get("data", {}).get("children", [])
    posts = []

    for item in posts_data[:limit]:
        post = item["data"]
        text = post.get("title", "")
        if post.get("selftext"):  # include post body if it's a text post
            text += "\n" + post["selftext"]
        posts.append(text)

    return posts

In [None]:
username = "GallowBoob"

comments = scrape_user_comments(username)
posts = scrape_user_posts(username)

print(f"Total Comments: {len(comments)}")
print(f"Total Posts: {len(posts)}")

# Print a sample
print("\nSample Comment:", comments[0] if comments else "None")
print("\nSample Post:", posts[0] if posts else "None")

Error fetching comments for GallowBoob. Status code: 403
Error fetching posts for GallowBoob. Status code: 403
Total Comments: 0
Total Posts: 0

Sample Comment: None

Sample Post: None


# Web Scraping with BeautifulSoup

In [None]:
!pip install requests beautifulsoup4

import requests
from bs4 import BeautifulSoup
import time



In [None]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

In [None]:
def scrape_comments_html(username, max_pages=1):
    all_comments = []
    after = None

    for _ in range(max_pages):
        url = f"https://www.reddit.com/user/{username}/comments/"
        response = requests.get(url, headers=headers)

        if response.status_code != 200:
            print(f"Failed to load page: {response.status_code}")
            break

        soup = BeautifulSoup(response.text, "html.parser")

        comment_tags = soup.find_all("div", class_="_1qeIAgB0cPwnLhDF9XSiJM")  # comment text containers
        for tag in comment_tags:
            text = tag.get_text(strip=True)
            if text:
                all_comments.append(text)

        # Sleep to avoid being blocked (respect Reddit)
        time.sleep(1)

    return all_comments

In [None]:
def scrape_posts_html(username, max_pages=1):
    all_posts = []
    url = f"https://www.reddit.com/user/{username}/submitted/"
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to load page: {response.status_code}")
        return []

    soup = BeautifulSoup(response.text, "html.parser")
    post_tags = soup.find_all("h3")  # Post titles are typically in <h3>

    for tag in post_tags:
        title = tag.get_text(strip=True)
        if title:
            all_posts.append(title)

    time.sleep(1)
    return all_posts

In [None]:
username = "spez"  # Reddit co-founder, public account

comments = scrape_comments_html(username)
print(f"Fetched {len(comments)} comments.")
print(comments[:3])

Failed to load page: 403
Fetched 0 comments.
[]


In [None]:
username = "eraqi915"  # famously active Redditor

comments = scrape_comments_html(username)
print(f"Fetched {len(comments)} comments.")
print(comments[:3])

Failed to load page: 403
Fetched 0 comments.
[]


# Simulating the scraping pipeline

In [None]:
mock_posts = [
    "Just built my first mechanical keyboard — totally worth the effort!",
    "Why do people hate pineapple on pizza? I love the sweet-salty combo.",
    "After 120 hours in Elden Ring, I can finally say I beat Malenia.",
    "I honestly think Android's customization is miles ahead of iOS.",
    "If you haven’t read Dune yet, you’re missing out on some top-tier sci-fi."
]

mock_comments = [
    "Totally agree! I switched to Android for the widgets and haven’t looked back.",
    "That boss fight gave me PTSD, lol. Congrats!",
    "I use a Keychron Q1 and it's amazing for typing.",
    "Sweet and salty is an underrated flavor combo.",
    "The world-building in Dune is unmatched. Herbert was a genius."
]

In [None]:
combined_text = "Reddit User Posts and Comments:\n\n"
for i, post in enumerate(mock_posts, 1):
    combined_text += f"Post {i}: {post}\n"

for i, comment in enumerate(mock_comments, 1):
    combined_text += f"Comment {i}: {comment}\n"

In [None]:
print(combined_text[:800])

Reddit User Posts and Comments:

Post 1: Just built my first mechanical keyboard — totally worth the effort!
Post 2: Why do people hate pineapple on pizza? I love the sweet-salty combo.
Post 3: After 120 hours in Elden Ring, I can finally say I beat Malenia.
Post 4: I honestly think Android's customization is miles ahead of iOS.
Post 5: If you haven’t read Dune yet, you’re missing out on some top-tier sci-fi.
Comment 1: Totally agree! I switched to Android for the widgets and haven’t looked back.
Comment 2: That boss fight gave me PTSD, lol. Congrats!
Comment 3: I use a Keychron Q1 and it's amazing for typing.
Comment 4: Sweet and salty is an underrated flavor combo.
Comment 5: The world-building in Dune is unmatched. Herbert was a genius.



In [2]:
!pip install openai
import openai



In [3]:
import getpass

openai.api_key = getpass.getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


In [8]:
from openai import OpenAI

client = OpenAI(api_key="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")  # Replace with your actual key

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message.content)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [9]:
def generate_prompt(posts, comments):
    prompt = "You're an AI tasked with analyzing a Reddit user's personality based on their activity.\n\n"
    prompt += "📌 Reddit User Posts:\n"
    for i, post in enumerate(posts, 1):
        prompt += f"{i}. {post}\n"

    prompt += "\n📌 Reddit User Comments:\n"
    for i, comment in enumerate(comments, 1):
        prompt += f"{i}. {comment}\n"

    prompt += """

📌 Your Task:
1. Build a short User Persona with traits, interests, tone, and preferences.
2. For each trait, mention which post or comment it came from (e.g., “Post 2”, “Comment 4”).
3. Keep it concise but insightful — suitable for a product or marketing team.

Format:
- Trait/Interest: [your insight]
  Source: Post X or Comment Y
"""
    return prompt

In [10]:
prompt_text = generate_prompt(posts, comments)
print(prompt_text)

NameError: name 'posts' is not defined

## ✅ Summary

- ❌ Scraping attempt failed due to Reddit API restrictions (403/401)
- ❌ LLM integration attempt failed due to OpenAI API restrictions (unavailable free API keys)
- ✅ Created realistic Reddit-style mock data
- ✅ Generated user persona using ChatGPT
- ✅ Cited relevant post/comment sources
- ✅ Saved final persona in `.txt` format for submission