# Data Collection

## Articles

- [Oklahoma schools rank 50th in the nation in latest education quality study](https://www.oklahoman.com/story/news/education/2025/07/24/oklahoma-schools-ranked-nearly-the-worst-in-the-nation-in-new-study/85310196007/)
- [Public School Rankings by State 2025](https://worldpopulationreview.com/state-rankings/public-school-rankings-by-state)
- [Worst School Districts by State 2025](https://worldpopulationreview.com/state-rankings/worst-school-districts-by-state)
- [2025 Best School Districts in Massachusetts](https://www.niche.com/k12/search/best-school-districts/s/massachusetts/)
- [2025 Best School Districts in California](https://www.niche.com/k12/search/best-school-districts/s/california/)

### Palo Alto, CA

- [Henry M. Gunn High School](https://gunn.pausd.org/)
  - [Website](https://gunn.pausd.org/)
  - [Niche.com](https://www.niche.com/k12/henry-m-gunn-high-school-palo-alto-ca/)
  - [Yelp.com](https://www.yelp.com/biz/henry-m-gunn-high-school-palo-alto-2?osq=henry+m+gunn+high+school)
- [Palo Alto Highschool](https://www.paly.net/)
  - [Website](https://www.paly.net/)
  - [Niche.com](https://www.niche.com/k12/palo-alto-high-school-palo-alto-ca/)
  - [Yelp.com](https://www.yelp.com/biz/palo-alto-high-school-palo-alto?osq=palo+alto+high+school)
- [Hope Technology School - Private](https://hopetechschool.org/)
  - [Website](https://hopetechschool.org/)
  - [Yelp.com](https://www.yelp.com/biz/hope-technology-school-palo-alto-6)
- [Palo Alto Middle College High School](https://mc.pausd.org/)
  - [Website](https://mc.pausd.org/)
  - [Yelp.com](https://www.yelp.com/biz/palo-alto-middle-college-high-school-los-altos-hills?osq=middle+college+high+school)

### Oklahoma City, OK

- [Boulevard Academy](https://boulevardacademy.edmondschools.net/o/boulevardacademy)
  - [Website](https://boulevardacademy.edmondschools.net/o/boulevardacademy)
- [Memorial High School](https://memorial.edmondschools.net/o/memorial)
  - [Website](https://memorial.edmondschools.net/o/memorial)
  - [Niche.com](https://www.niche.com/k12/memorial-high-school-edmond-ok/)
  - [Yelp.com](https://www.yelp.com/biz/memorial-high-school-tulsa?osq=memorial+high+school)
- [North High School](https://north.edmondschools.net/o/north)
- [Santa Fe High School](https://santafe.edmondschools.net/o/santafe)


In [2]:
import pandas as pd
import praw
import re
import sys
import time

## Connect to Reddit API


**Note:** Default credentials are stored in the `praw.ini` file. This file must
be created for the Reddit API to work. Use the `praw.ini.template` as a
reference to create the `praw.ini` file.


In [3]:
# default credentials stored in `praw.ini` file
try:

    reddit = praw.Reddit("default")
    print(f"Authenticated as: {reddit.user.me()}")
except Exception as e:
    print("[ERROR] Error initializing Reddit instance.")
    print("Check that the `praw.ini` is configured correctly. ")
    print(f"Details: {e}")
    sys.exit(1)

# Check for connection
# for post in subreddit.hot(limit=5):
#     print(post.title, post.score, post.id, post.url)

Authenticated as: None


## Reddit Query Functions


In [4]:
def word_count(s: str) -> int:
    return len(re.findall(r"\w+", s or ""))


def comments_to_corpus(submission, min_score=6, min_words=21, as_tokens=False):
    """
    Extract and filter comments from a Reddit submission.

    Iterates through all comments in a submission and keeps only those that meet
    both a minimum score threshold and a minimum word count. The resulting comments
    are returned either as strings or as tokenized word lists.

    Parameters
    ----------
    submission : praw.models.Submission
        Reddit submission object from which to collect comments.
    min_score : int, optional, default=6
        Minimum upvote score required for a comment to be included.
    min_words : int, optional, default=21
        Minimum number of words required for a comment to be included.
    as_tokens : bool, optional, default=False
        If True, each comment is split into a list of words (tokens).
        If False, comments are returned as strings.

    Returns
    -------
    list of str or list of list of str
        A list of comments that passed the filters. Each comment is a string
        if `as_tokens=False`, or a list of tokens if `as_tokens=True`.

    Notes
    -----
    - Comments with `[deleted]` or `[removed]` text are skipped.
    - Comments with no body text or missing score are excluded.
    - `submission.comments.replace_more(limit=0)` is used to ensure that
      all comments are fully loaded before filtering.

    Examples
    --------
    >>> submission = reddit.submission(id="abc123")
    >>> comments_to_corpus(submission, min_score=10, min_words=30)
    ["This is a long qualifying comment ...", "Another example comment ..."]

    >>> comments_to_corpus(submission, min_score=10, min_words=30, as_tokens=True)
    [["This", "is", "a", "long", "qualifying", "comment", ...],
     ["Another", "example", "comment", ...]]
    """
    submission.comments.replace_more(limit=0)
    keep = []

    for c in submission.comments.list():
        body = c.body
        if not isinstance(body, str):
            continue
        body = body.strip()

        if c.score is None:
            continue

        if c.score >= min_score and word_count(body) >= min_words:
            if as_tokens:
                keep.append(body.split())  # list of words
            else:
                keep.append(body)
    return keep


def build_df_for_query(
    subreddit_name: str,
    query: str,
    limit=20,
    sort="relevance",
    time_filter="all",
    min_score=6,
    min_words=21,
    as_tokens=False,
):
    """
    Query Reddit submissions and build a DataFrame of post titles and filtered comments.

    This function searches a specified subreddit for submissions matching a query
    and collects comments from each submission that meet the given thresholds
    (minimum score and minimum word count). The collected comments can be returned
    either as raw strings or as tokenized word lists. The result is a DataFrame
    where each row corresponds to one submission.

    Parameters
    ----------
    subreddit_name : str
        Name of the subreddit to search (without the "r/").
    query : str
        Search query string to use within the subreddit.
    limit : int, optional, default=20
        Maximum number of submissions to retrieve.
    sort : {"relevance", "hot", "top", "new", "comments"}, optional, default="relevance"
        Sorting method for the search results.
    time_filter : {"all", "day", "hour", "month", "week", "year"}, optional, default="all"
        Restrict search results to a specific time window.
    min_score : int, optional, default=6
        Minimum upvote score required for a comment to be included.
    min_words : int, optional, default=21
        Minimum number of words required for a comment to be included.
    as_tokens : bool, optional, default=False
        If True, each comment is split into a list of words (tokens).
        If False, comments are kept as strings.

    Returns
    -------
    pandas.DataFrame
        DataFrame with one row per submission and the following columns:

        - ``source`` : str
            Constant value "reddit".
        - ``query`` : str
            The original search query string.
        - ``topic`` : str
            Title of the submission.
        - ``corpus`` : list of str or list of list of str
            List of comments that passed the filters, either as strings or
            token lists depending on `as_tokens`.

    Notes
    -----
    - Deleted or removed comments are excluded.
    - Submissions without qualifying comments are skipped.
    - Be mindful of Reddit API rate limits; the function sleeps briefly
      after encountering exceptions to remain polite to the API.

    Examples
    --------
    >>> df = build_df_for_query("AskReddit", "school homework", limit=5)
    >>> df[["topic", "corpus"]].head()
    """
    rows = []
    sr = reddit.subreddit(subreddit_name)

    for subm in sr.search(query, sort=sort, time_filter=time_filter, limit=limit):
        try:
            corpus = comments_to_corpus(
                subm, min_score=min_score, min_words=min_words, as_tokens=as_tokens
            )

            # keep only if corpus is non-empty (handles list[str] and list[list[str]])
            has_text = False
            if isinstance(corpus, list):
                if len(corpus) > 0:
                    if all(isinstance(c, list) for c in corpus):  # list of token lists
                        has_text = any(len(c) > 0 for c in corpus)
                    else:  # list of strings
                        has_text = any(isinstance(c, str) and c.strip() for c in corpus)

            if has_text:
                rows.append(
                    {
                        "source": "reddit",
                        "query": query,
                        "topic": (subm.title or "").strip(),
                        "corpus": corpus,
                    }
                )

        except Exception as e:
            print(f"[warn] {subm.id}: {e}")
            time.sleep(0.2)  # be polite to the API

    return pd.DataFrame(rows)

### Variables for Reddit Query

In [7]:
MIN_WORD = 5  # default = 6
MIN_SCORE = 10  # default = 21
LIMIT = 5  # default = 20

subreddit_var = "paloalto"
query = "palo alto unified school district high school"

In [8]:
# ---- Example usage ----
df = build_df_for_query(
    subreddit_name=subreddit_var,
    query=query,
    limit=LIMIT,
    sort="relevance",
    time_filter="all",
    min_words=MIN_WORD,
    min_score=MIN_SCORE,
)

print(df.shape)
df.head(10)

(4, 4)


Unnamed: 0,source,query,topic,corpus
0,reddit,palo alto unified school district high school,Palo Alto high school (or Paly),[current paly student here! paly is pretty saf...
1,reddit,palo alto unified school district high school,supporting the family of Ash who our Palo Alto...,"[This family is grieving, and if you have noth..."
2,reddit,palo alto unified school district high school,Why is it so hard to find a decent high-end ap...,"[Had similar issues years back, they literally..."
3,reddit,palo alto unified school district high school,Congressman slams Palo Alto school district ov...,[What business does Khanna have in commenting ...
