# Collect and Aggregate Reddit Posts: Autism & Healthcare Experiences

This notebook will:
1. **Collect** Reddit posts/comments about autism healthcare using PRAW.
2. **Deduplicate** and **clean** the data.
3. **Aggregate** comments under their respective posts.
4. **Save** the final aggregated JSON file.

### Key Libraries:
- [PRAW](https:#praw.readthedocs.io/en/latest/) for Reddit API
- [tqdm](https:#tqdm.github.io/) for progress bars
- [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) for parallelised data fetching

In [39]:
# ------------------------------------------------------------
# Imports
# ------------------------------------------------------------
import praw
import prawcore
import json
from datetime import datetime, timezone
import time
import random
from tqdm import tqdm
import concurrent.futures
from collections import defaultdict, Counter
import os
import matplotlib.pyplot as plt
import statistics
import numpy as np
import pandas as pd
import seaborn as sns

print("Libraries imported successfully.")

Libraries imported successfully.


## Reddit API Credentials

Replace these placeholders with actual:
- `client_id`
- `client_secret`
- `refresh_token`

In [2]:
client_id = 'YOUR_CLIENT_ID'
client_secret = 'YOUR_CLIENT_SECRET'
user_agent = 'YOUR_USER_AGENT'
refresh_token = 'YOUR_REFRESH_TOKEN'

# Initialise PRAW using the refresh token
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent,
    refresh_token=refresh_token
)

print("Reddit instance created successfully")

Reddit instance created successfully


## Subreddit Configuration & Keywords

This notebook includes:
- A dictionary of subreddits (`subreddits_info`) and their subscriber counts.
- A keyword list capturing diverse aspects of autism healthcare.
- Reasonably low thresholds for submission/comment scores to broaden the data collection.

In [36]:
# Subreddit subscrier counts & years they went active
subreddits_info = {
    'autism':           {'subs': 324486, 'year_active': 2019},
    'AutismInWomen':    {'subs': 113095, 'year_active': 2022},
    'aspergers':        {'subs': 148902, 'year_active': 2019},
    'AutisticAdults':   {'subs': 51258,  'year_active': 2022},
    'aspiememes':       {'subs': 202954, 'year_active': 2019},
    'Autism_Parenting': {'subs': 29028,  'year_active': 2022},
    'aspergirls':       {'subs': 80347,  'year_active': 2021},
    'AskAutism':        {'subs': 5949,   'year_active': 2021},
    'AutisticWithADHD': {'subs': 32842,  'year_active': 2022},
    'neurodiversity':   {'subs': 67833,  'year_active': 2021},
    'autismUK':         {'subs': 3926,   'year_active': 2019},
}

# Healthcare and autism-related search keywords
keywords = [
    "healthcare", "treatment", "therapy",
    "access to care", "insurance", "coverage", "waiting list",
    "diagnosis", "misdiagnosis", "quality of service", "bad experience",
    "good experience", "stigma", "barriers", "disparities", "lack of resources",
    "ABA", "occupational therapy", "speech therapy", "behavioural therapy",
    "mental health", "psychiatrist", "psychologist", "comorbid conditions",
    "transition planning", "IEP", "emotional support", "inclusion", "burnout",
    "paediatrician", "dentist", "mental health services", "appointment",
    "prescription", "medication", "electroshock", "waiting room", "sensory overload"
    "hospital", "NHS", "health inequality", "ADHD", "masking", 
    "CBT", "cognitive behavioural therapy", "vaccine", "vaccines cause autism"
]

# Search parameters
time_filter = 'all'
sort_orders = ['relevance', 'new', 'top', 'comments']
results_per_query = 100
target_total_posts = 10000

# Score thresholds
MIN_SUBMISSION_SCORE = 10
MIN_COMMENT_SCORE = 2

print("Subreddit info, keywords, and thresholds set.")

Subreddit info, keywords, and thresholds set.


## Helper Functions

These functions handle Reddit data collection and concurrency with minimal debug output:

- **`get_max_results_per_subreddit(subs)`**  
  - Determines the maximum number of posts to collect from a subreddit based on its subscriber count.
    

- **`is_relevant(submission, keyword)`**  
  - Checks whether a submission references the given keyword in both the title and selftext (excluding “the good doctor”) and meets a minimum length requirement.
    

- **`search_subreddit(subreddit_name, keywords, results_per_query, max_results, pbar)`**  
  - Searches a single subreddit across multiple keywords and sort orders.
  - Updates a **TQDM progress bar** (`pbar`) each time a new post is collected.
  - Retries requests up to 5 times on rate limits or network issues.
  - Sleeps briefly between search attempts to avoid spamming the API.
    

- **`search_reddit(subreddits_info, keywords, results_per_query, target_total_posts)`**  
  - Uses `ThreadPoolExecutor` for **concurrent** searching of multiple subreddits.
  - Displays a **plain-text** TQDM progress bar (no Jupyter widgets) that increments **once per new post**.
  - If the final count exceeds `target_total_posts`, it trims the dataset at the end (random sampling).


In [7]:
def get_max_results_per_subreddit(subs):
    """
    Determine the maximum number of results to fetch based on subreddit subscriber count.
    Args:
        subs (int): Number of subscribers in the subreddit
    Returns:
        int: Maximum nmber of results to fetch (1000-3000 based on subscriber count)
    """
    if subs > 200000:
        return 3000
    elif subs > 100000:
        return 2000
    elif subs > 50000:
        return 1500
    else:
        return 1000

def is_relevant(submission, keyword):
    """
    Filter submissions based on content relevance and quality criteria.
    Args:
        submission: PRAW submission object containing post data
        keyword (str): Search term to match in content
    Returns:
        bool: True if submission meets all relevance criteria:
            - Contains keyword in both title and body
            - Meets minimum length requirements
             Not related to programme "The Good Doctor"
    """
    title_lower = submission.title.lower()
    body_lower = submission.selftext.lower()

    if ("good doctor" in title_lower or "the good doctor" in title_lower or
        "good doctor" in body_lower or "the good doctor" in body_lower):
        return False

    title_contains_kw = (keyword.lower() in title_lower)
    body_contains_kw  = (keyword.lower() in body_lower)
    sufficient_length = (len(submission.title) > 10 and len(submission.selftext) > 50)

    return (title_contains_kw and body_contains_kw and sufficient_length)

def search_subreddit(subreddit_name, keywords, results_per_query, max_results, pbar):
    """
    Collect and filter relevant posts and comments from a specific subreddit.
    Args:
        subreddit_name (str): Name of the subreddit to search
        keywords (list): List of keywords to search for
        results_per_query (int): Number of results to request per API call
        max_results (int): Maximum total results to collect for this subreddit
        pbar (tqdm): Progress bar object for tracking collection progress
    Returns:
        list: Collection of dictionaries containing filtered post/comment data
    Note:
        - Utilises multiple sort orders to maximise result diversity
        - Implements rate limiting and error handling
        - Updates progress bar for each collected post
    """
    posts = []
    subreddit = reddit.subreddit(subreddit_name)

    for keyword in keywords:
        for sort_order in sort_orders:
            attempts = 0
            while attempts < 5:  # Try up to 5 times for each query
                try:
                    # Fetch submissions matching keyword with specified parameters
                    submissions = subreddit.search(
                        keyword,
                        limit=results_per_query,
                        time_filter=time_filter,
                        sort=sort_order
                    )
                    # Process each submission
                    for submission in submissions:
                        # Filter by score and relevance
                        if (submission.score > MIN_SUBMISSION_SCORE 
                                and is_relevant(submission, keyword)):

                            # Convert submission timestamp to datetime
                            post_date = datetime.fromtimestamp(submission.created_utc, tz=timezone.utc)
                            
                            # Load all top-level comments (no nested comments)
                            submission.comments.replace_more(limit=0)

                            # Process each comment in the submission
                            for comment in submission.comments.list():
                                # Filter comments by keyword, score, and length
                                if (keyword.lower().strip('"') in comment.body.lower()
                                    and comment.score > MIN_COMMENT_SCORE
                                    and len(comment.body) > 30):
                                    posts.append({
                                        'subreddit': subreddit_name,
                                        'keyword': keyword,
                                        'title': submission.title,
                                        'selftext': submission.selftext,
                                        'comment': comment.body,
                                        'permalink': comment.permalink,
                                        'submission_score': submission.score,
                                        'comment_score': comment.score,
                                        'created_utc': post_date.strftime('%Y-%m-%d %H:%M:%S')
                                    })
                                    # Update the progress bar for each new post
                                    pbar.update(1)

                                    # If hit max for this subreddit, break
                                    if len(posts) >= max_results:
                                        break

                            if len(posts) >= max_results:
                                break

                    # A small sleep to reduce rate-limit issues
                    time.sleep(1)
                    break  # break from attempts on sucess

                except prawcore.exceptions.RequestException as e:
                    attempts += 1
                    print(f"[ERROR] Request failed: {e}")
                    if hasattr(e.response, 'headers') and 'X-Ratelimit-Reset' in e.response.headers:
                        wait_time = int(e.response.headers['X-Ratelimit-Reset'])
                        print(f"[RATE LIMIT] Retrying in {wait_time}s...")
                        time.sleep(wait_time)
                    else:
                        fallback_wait = 5
                        print(f"[RATE LIMIT] No header. Sleeping {fallback_wait}s...")
                        time.sleep(fallback_wait)

                except Exception as e:
                    print(f"[ERROR] Unexpected: {e}")
                    break

            if len(posts) >= max_results:
                break
    return posts

def search_reddit(subreddits_info, keywords, results_per_query, target_total_posts):
    """
    Coordinate parallel searches across multiple subreddits with progress tracking.
    Args:
        subreddits_info (dict): Dictionary of subreddit information including subscriber counts
        keywords (list): List of keywords to search for
        results_per_query (int): Number of results to request per API call
        target_total_posts (int): Total number of posts to collect across all subreddits
    Returns:
        list: Combned collection of post/comment data from all searched subreddits
    Note:
        - Utilises ThreadPoolExecutor for parallel processing
        - Limits concurrent threads to 10 to prevent API rate limits
        - Shows progress with TQDM progress bar
    """
    all_posts = []
    # ascii=True -> plain-text progress bar
    pbar = tqdm(total=target_total_posts, desc="Collecting posts", ascii=True)

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        future_to_subreddit = {
            executor.submit(
                search_subreddit,
                name,
                keywords,
                results_per_query,
                get_max_results_per_subreddit(info['subs']),
                pbar
            ): name
            for name, info in subreddits_info.items()
        }
        for future in concurrent.futures.as_completed(future_to_subreddit):
            subreddit_name = future_to_subreddit[future]
            try:
                # Extend all_posts with what we got from that subreddit
                subreddit_posts = future.result()
                all_posts.extend(subreddit_posts)

                # If reached or exceeded the target, stop
                if len(all_posts) >= target_total_posts:
                    break
            except Exception as e:
                print(f"[ERROR] Problem with {subreddit_name}: {e}")

    pbar.close()
    return all_posts

print("Helper functions defined.")

Helper functions defined.


## Data Collection, Deduplication, & Cleaning

1. Collect posts from all subreddits.
2. If total exceeds the target, randomly sample.
3. Save raw data, then remove duplicates using `(title, comment)` pairs.
4. Save JSON with no more duplicates.

In [6]:
# Data processing pipeline
base_folder = r"C:\Users\laure\Desktop\dissertation_notebook"
data_folder = os.path.join(base_folder, "Data")
os.makedirs(data_folder, exist_ok=True)

# File paths
raw_json_path = os.path.join(data_folder, "reddit_raw_data.json")
no_dupes_json_path = raw_json_path
aggregated_path = os.path.join(data_folder, "aggregated_raw_reddit_data.json")

# 1) Collect data
posts = search_reddit(subreddits_info, keywords, results_per_query, target_total_posts)

# 2) If exceeded target, randmly sample
if len(posts) > target_total_posts:
    posts = random.sample(posts, target_total_posts)

# 3) Save raw data
with open(raw_json_path, 'w', encoding='utf-8') as f:
    json.dump(posts, f, ensure_ascii=False, indent=4)
print(f"Raw data saved at: {raw_json_path}")

# 4) Remove duplicates
unique_posts = []
seen_pairs = set()
for post in posts:
    pair = (post['title'], post['comment'])
    if pair not in seen_pairs:
        unique_posts.append(post)
        seen_pairs.add(pair)

print(f"Collected {len(posts)} total records (pre-dedup).")
print(f"Unique submission-comment combos: {len(unique_posts)}")

# 5) Save no dupes data
with open(no_dupes_json_path, 'w', encoding='utf-8') as f:
    json.dump(unique_posts, f, ensure_ascii=False, indent=4)
print(f"Cleaned data saved at: {no_dupes_json_path}")

Collecting posts:  32%|###################1                                       | 3247/10000 [06:09<16:06,  6.99it/s]

[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response[ERROR] Unexpected: received 429 HTTP response



Collecting posts:  33%|###################1                                       | 3253/10000 [06:20<16:05,  6.99it/s]

[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexp

Collecting posts:  97%|#########################################################  | 9665/10000 [16:14<00:47,  7.02it/s]

[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response[ERROR] Unexpected: received 429 HTTP response

[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexp

Collecting posts: 12904it [26:16,  6.19it/s]                                                                           

[ERROR] Unexpected: received 429 HTTP response


Collecting posts: 12906it [26:18,  3.90it/s]

[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response
[ERROR] Unexpected: received 429 HTTP response


Collecting posts: 14512it [33:59,  7.11it/s]


Raw data saved at: C:\Users\laure\Desktop\dissertation_notebook\Data\reddit_raw_data.json
Collected 10000 total records (pre-dedup).
Unique submission-comment combos: 5906
Cleaned data saved at: C:\Users\laure\Desktop\dissertation_notebook\Data\reddit_raw_data.json


## ID Assignment Process

Created unique identifiers for Reddit posts and comments:
1. **Post IDs**: Format `P_N` where N is incremental
2. **Comment IDs**: Format `C_P_N_X` where:
    - P_N matches parent post ID
    - X is truncated hash of comment text

This enables:
- Tracking unique posts via titles
- Linking comments to parent posts
- Consistent reference system for analysis

In [None]:
# Load the existing raw data
with open(raw_json_path, 'r', encoding='utf-8') as f:
    posts = json.load(f)

# Add IDs to posts and comments
post_ids = {}  # Keep track of post IDs asigned

# Process each post/comment pair
for post in posts:
    title = post['title']  # Use title as key to identify unique posts
    
    # generate a new post ID
    if title not in post_ids:
        post_ids[title] = f"P_{len(post_ids) + 1}"
    
    # Add IDs to the post
    post['post_id'] = post_ids[title]
    post['comment_id'] = f"C_{post['post_id']}_{hash(post['comment']) % 10000}"  # Add comment ID

# Save the updated data back
with open(raw_json_path, 'w', encoding='utf-8') as f:
    json.dump(posts, f, ensure_ascii=False, indent=4)

print(f"Added IDs to {len(posts)} posts and their comments")
print(f"Number of unique posts: {len(post_ids)}")

## Aggregation Step

Data is aggregated so each post has a list of all associated comments. The result is saved in a new JSON file.

In [None]:
# 1) Define base folder and create "Data" subfolder
base_folder = r"C:\Users\laure\Desktop\dissertation_notebook"
data_folder = os.path.join(base_folder, "Data")

# 2) Load the cleaned data
with open(no_dupes_json_path, 'r', encoding='utf-8') as file:
    cleaned_data = json.load(file)

aggregated_data = defaultdict(lambda: {
    "post_id": "",
    "subreddit": "",
    "keyword": "",
    "title": "",
    "selftext": "",
    "submission_score": 0,
    "comments": [],
    "permalink": "",
    "created_utc": ""
})

# 3) Aggregate cleaned data
for post in cleaned_data:
    utc_time = post['created_utc']
    # Only set post details once
    if not aggregated_data[utc_time]["title"]:
        aggregated_data[utc_time].update({
            "post_id": post['post_id'],  #Include post ID
            "subreddit": post['subreddit'],
            "keyword": post['keyword'],
            "title": post['title'],
            "selftext": post['selftext'],
            "submission_score": post['submission_score'],
            "permalink": post['permalink'],
            "created_utc": post['created_utc']
        })
    # Append the comment with its ID
    aggregated_data[utc_time]["comments"].append({
        "comment_id": post["comment_id"],  #Include comment ID
        "comment": post["comment"],
        "comment_score": post["comment_score"],
        "comment_permalink": post["permalink"]
    })

# 4) Convert dict to a list
final_result = list(aggregated_data.values())

# 5) Save the aggregated data to a file in the Data folder
with open(aggregated_path, 'w', encoding='utf-8') as output_file:
    json.dump(final_result, output_file, ensure_ascii=False, indent=4)

print(f"Aggregation complete. Output saved to {aggregated_path}")

Aggregation complete. Output saved to C:\Users\laure\Desktop\dissertation_notebook\Data\aggregated_raw_reddit_data.json


## Conclusion

This notebook successfully:
1. **Collected** Reddit data from 11 autism-related subreddits using PRAW API
2. **Filtered** and **cleaned** posts/comments based on healthcare keywords and quality metrics 
3. **Removed duplicates** and **aggregated** comments under their respective posts
4. **Saved** processed data in JSON format for further analysis

The resulting dataset provides valuable insights into autism healthcare experiences as discussed on Reddit, ready for further cleaning to operate topic modelling.

## References

**Reference:**  
PRAW Development Team (2024) *PRAW: Python Reddit API Wrapper v7.8.1* [computer program].  
Available from: [https://praw.readthedocs.io/en/stable/](https://praw.readthedocs.io/en/stable/) [Accessed 6 May 2024].

**Git Repo:**  
- [PRAW GitHub](https://github.com/praw-dev/praw)

**Reference:**  
tqdm contributors (2024) *tqdm: A Fast, Extensible Progress Bar for Python and CLI v4.67.1* [computer program].  
Available from: [https://tqdm.github.io/](https://tqdm.github.io/) [Accessed 8 May 2024].

**Git Repo:**  
- [tqdm GitHub](https://github.com/tqdm/tqdm)

**Reference:**  
Hunter, J. D. (2007) *Matplotlib: A 2D Graphics Environment* [computer program]. *Computing in Science & Engineering*, 9(3), pp. 90–95.  
Available from: [https://matplotlib.org/](https://matplotlib.org/) [Accessed 6 May 2024].

**Git Repo:**  
- [Matplotlib GitHub](https://github.com/matplotlib/matplotlib)

**Reference:**  
Harris, C. R., Millman, K. J., van der Walt, S. J. et al. (2020) *Array programming with NumPy* [computer program]. *Nature*, 585, pp. 357–362.  
Available from: [https://numpy.org/](https://numpy.org/) [Accessed 6 May 2024].

**Git Repo:**  
- [NumPy GitHub](https://github.com/numpy/numpy)

**Reference:**  
Pandas Development Team (2024) *pandas: Powerful data structures for data analysis v2.2.3* [computer program].  
Available from: [https://pandas.pydata.org/](https://pandas.pydata.org/) [Accessed 11 May 2024].

**Git Repo:**  
- [Pandas GitHub](https://github.com/pandas-dev/pandas)

**Reference:**  
Waskom, M. (2024) *seaborn: Statistical data visualization v0.13.2* [computer program].  
Available from: [https://joss.theoj.org/papers/10.21105/joss.03021](https://joss.theoj.org/papers/10.21105/joss.03021) [Accessed 11 May 2024].

**Git Repo:**  
- [Seaborn GitHub](https://github.com/mwaskom/seaborn)