# Data Collection

## Articles

- [Oklahoma schools rank 50th in the nation in latest education quality study](https://www.oklahoman.com/story/news/education/2025/07/24/oklahoma-schools-ranked-nearly-the-worst-in-the-nation-in-new-study/85310196007/)
- [Public School Rankings by State 2025](https://worldpopulationreview.com/state-rankings/public-school-rankings-by-state)
- [Worst School Districts by State 2025](https://worldpopulationreview.com/state-rankings/worst-school-districts-by-state)
- [2025 Best School Districts in Massachusetts](https://www.niche.com/k12/search/best-school-districts/s/massachusetts/)
- [2025 Best School Districts in California](https://www.niche.com/k12/search/best-school-districts/s/california/)

### Palo Alto, CA

- [Henry M. Gunn High School](https://gunn.pausd.org/)
  - [Website](https://gunn.pausd.org/)
  - [Niche.com](https://www.niche.com/k12/henry-m-gunn-high-school-palo-alto-ca/)
  - [Yelp.com](https://www.yelp.com/biz/henry-m-gunn-high-school-palo-alto-2?osq=henry+m+gunn+high+school)
- [Palo Alto Highschool](https://www.paly.net/)
  - [Website](https://www.paly.net/)
  - [Niche.com](https://www.niche.com/k12/palo-alto-high-school-palo-alto-ca/)
  - [Yelp.com](https://www.yelp.com/biz/palo-alto-high-school-palo-alto?osq=palo+alto+high+school)
- [Hope Technology School - Private](https://hopetechschool.org/)
  - [Website](https://hopetechschool.org/)
  - [Yelp.com](https://www.yelp.com/biz/hope-technology-school-palo-alto-6)
- [Palo Alto Middle College High School](https://mc.pausd.org/)
  - [Website](https://mc.pausd.org/)
  - [Yelp.com](https://www.yelp.com/biz/palo-alto-middle-college-high-school-los-altos-hills?osq=middle+college+high+school)

### Oklahoma City, OK

- [Boulevard Academy](https://boulevardacademy.edmondschools.net/o/boulevardacademy)
  - [Website](https://boulevardacademy.edmondschools.net/o/boulevardacademy)
- [Memorial High School](https://memorial.edmondschools.net/o/memorial)
  - [Website](https://memorial.edmondschools.net/o/memorial)
  - [Niche.com](https://www.niche.com/k12/memorial-high-school-edmond-ok/)
  - [Yelp.com](https://www.yelp.com/biz/memorial-high-school-tulsa?osq=memorial+high+school)
- [North High School](https://north.edmondschools.net/o/north)
- [Santa Fe High School](https://santafe.edmondschools.net/o/santafe)


## Import libraries


In [20]:
import pandas as pd
import praw
import re
import sys
import time

from datetime import datetime
from pathlib import Path

In [21]:
# set dataset folder location
dataset_folder = Path("../datasets")

## Connect to Reddit API

**Note:** Default credentials are stored in the `praw.ini` file. This file must
be created for the Reddit API to work. Use the `praw.ini.template` as a
reference to create the `praw.ini` file.


In [22]:
# default credentials stored in `praw.ini` file
try:

    reddit = praw.Reddit("default")
    print(f"Authenticated as: {reddit.user.me()}")
except Exception as e:
    print("[ERROR] Error initializing Reddit instance.")
    print("Check that the `praw.ini` is configured correctly. ")
    print(f"Details: {e}")
    sys.exit(1)

# Check for connection
# for post in subreddit.hot(limit=5):
#     print(post.title, post.score, post.id, post.url)

Authenticated as: None


## Reddit Query Functions


In [33]:
def word_count(s: str) -> int:
    return len(re.findall(r"\w+", s or ""))


def comments_to_corpus(submission, min_score=6, min_words=21):
    """
    Extract and filter comments from a Reddit submission.

    Iterates through all comments in a submission and keeps only those that meet
    both a minimum score threshold and a minimum word count. Each qualifying
    comment is returned in two parallel forms:

    - "nested": [["comment1"], ["comment2"], ...]
    - "flat":   ["comment1", "comment2", ...]

    Parameters
    ----------
    submission : praw.models.Submission
        Reddit submission object from which to collect comments.
    min_score : int, optional, default=6
        Minimum upvote score required for a comment to be included.
    min_words : int, optional, default=21
        Minimum number of words required for a comment to be included.

    Returns
    -------
    dict
        Dictionary with two keys:
        - "nested": list of list of str
        - "flat":   list of str

    Notes
    -----
    - Comments with `[deleted]` or `[removed]` text are skipped.
    - Comments with no body text or missing score are excluded.
    - `submission.comments.replace_more(limit=0)` is used to ensure that
      all comments are fully loaded before filtering.
    """
    submission.comments.replace_more(limit=0)
    nested, flat = [], []

    for c in submission.comments.list():
        body = c.body
        if not isinstance(body, str):
            continue
        body = body.strip()

        if body.lower() in ("[deleted]", "[removed]"):
            continue
        if c.score is None:
            continue

        if c.score >= min_score and word_count(body) >= min_words:
            nested.append([body])
            flat.append(body)

    return {"nested": nested, "flat": flat}


def build_df_for_query(
    subreddit_name: str,
    query: str,
    limit=20,
    sort="relevance",
    time_filter="all",
    min_score=6,
    min_words=21,
):
    """
    Query Reddit submissions and build a DataFrame of post titles and filtered comments.

    This function searches a specified subreddit for submissions matching a query
    and collects comments from each submission that meet the given thresholds
    (minimum score and minimum word count). The comments are returned both as a
    nested form (list of single-element lists) and as a flat form (list of strings).
    The result is a DataFrame where each row corresponds to one submission.

    Parameters
    ----------
    subreddit_name : str
        Name of the subreddit to search (without the "r/").
    query : str
        Search query string to use within the subreddit.
    limit : int, optional, default=20
        Maximum number of submissions to retrieve.
    sort : {"relevance", "hot", "top", "new", "comments"}, optional, default="relevance"
        Sorting method for the search results.
    time_filter : {"all", "day", "hour", "month", "week", "year"}, optional, default="all"
        Restrict search results to a specific time window.
    min_score : int, optional, default=6
        Minimum upvote score required for a comment to be included.
    min_words : int, optional, default=21
        Minimum number of words required for a comment to be included.

    Returns
    -------
    pandas.DataFrame
        DataFrame with one row per submission and the following columns:

        - ``source`` : str
            Constant value "reddit".
        - ``query`` : str
            The original search query string.
        - ``topic`` : str
            Title of the submission.
        - ``corpus`` : list of list of str
            Nested list of comments (each comment wrapped in a list).
        - ``flat_corpus`` : list of str
            Flat list of comments.
    """
    rows = []
    sr = reddit.subreddit(subreddit_name)

    for subm in sr.search(query, sort=sort, time_filter=time_filter, limit=limit):
        try:
            corpus = comments_to_corpus(subm, min_score=min_score, min_words=min_words)

            if corpus["nested"]:  # only keep submissions with qualifying comments
                rows.append(
                    {
                        "source": "reddit",
                        "query": query,
                        "topic": (subm.title or "").strip(),
                        "corpus": corpus["nested"],
                        "flat_corpus": corpus["flat"],
                    }
                )

        except Exception as e:
            print(f"[warn] {subm.id}: {e}")
            time.sleep(0.2)  # be polite to the API

    return pd.DataFrame(rows)

### Variables for Reddit Query


In [37]:
MIN_WORD = 10  # default = 6
MIN_SCORE = 5  # default = 21
LIMIT = 150  # default = 20

subreddit_var = "all"
query = '"Palo Alto" (school OR schools OR district OR education OR homework OR teacher OR teachers OR students)'

In [38]:
# ---- Example usage ----
df = build_df_for_query(
    subreddit_name=subreddit_var,
    query=query,
    limit=LIMIT,
    sort="relevance",
    time_filter="all",
    min_words=MIN_WORD,
    min_score=MIN_SCORE,
)

print(df.shape)
df.head(5)

(78, 5)


Unnamed: 0,source,query,topic,corpus,flat_corpus
0,reddit,"""Palo Alto"" (school OR schools OR district OR ...",Mark Zuckerberg and his wife shut down their s...,[[It's almost as if schools should be funded b...,[It's almost as if schools should be funded by...
1,reddit,"""Palo Alto"" (school OR schools OR district OR ...",Mark Zuckerberg and his wife shut down their s...,[[Delete your Facebook. Delete IG. Delete Wh...,[Delete your Facebook. Delete IG. Delete Wha...
2,reddit,"""Palo Alto"" (school OR schools OR district OR ...","Bay Area teen rejected by 16 colleges, hired b...",[[Quote from another comment on this topic fro...,[Quote from another comment on this topic from...
3,reddit,"""Palo Alto"" (school OR schools OR district OR ...",To be a philanthropist. Mark Zuckerberg and hi...,"[[Nothing says ""cutting-edge wave of the futur...","[Nothing says ""cutting-edge wave of the future..."
4,reddit,"""Palo Alto"" (school OR schools OR district OR ...",Are you kicking kids out by 18? (Or if you wer...,[[Fuck no. I am doing my best to ensure that t...,[Fuck no. I am doing my best to ensure that th...


## Save Datasets


In [39]:
# create timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

In [40]:
filename = f"{timestamp}_reddit_posts.pkl"
df.to_pickle(dataset_folder / filename)
print(f"Saved as {filename}. ")

Saved as 20251001_173707_reddit_posts.pkl. 
