# Document Chunking and Minimal Cleaning

This notebook performs the following tasks:

1. **Import** necessary libraries and set up basic logging.  
2. **Define** helper functions to:  
   - Clean raw text minimally (removing links, HTML tags, etc.)  
   - Optionally *chunk* any document exceeding a specified token threshold into smaller units  
3. **Implement** a main pipeline that:  
   - Loads the original data (`aggregated_raw_redit_data.json`)  
   - Cleans and chunks the data based on parameters (`token_thresh` and `max_sents`)  
   - Filters out empty results  
   - Saves the output as `bertopic_ready_data.json`

This approach ensures that very long posts or comments are not too large for subsequent topic-modelling frameworks like **BERTopic** or **LDA**, while short texts remain unaltered.

In [2]:
# ----------------------------------------------------------------------------------------
# 1) Imports and Setup
# ----------------------------------------------------------------------------------------

import json
import re
import html
import logging
from pathlib import Path
from typing import Dict
import os

import nltk
from nltk.tokenize import sent_tokenize 

logging.basicConfig(
    level=logging.WARNING,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# File paths for data folder
base_folder = r"C:\Users\laure\Desktop\dissertation_notebook"
data_folder = os.path.join(base_folder, "Data")
os.makedirs(data_folder, exist_ok=True)

# Input/output paths
aggregated_path = os.path.join(data_folder, "aggregated_raw_reddit_data.json")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\laure\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## 2) Helper Functions

- **`strip_html_tags(doc)`**: Removes raw HTML tags (e.g., `<p>`, `<div>`).  
- **`remove_standalone_links(doc)`**: Strips away standalone links such as `https://something.com`.  
- **`minimal_clean(doc)`**: Performs a series of minimal text cleaning steps (e.g., lowercasing, unescaping HTML entities, removing Markdown links).  
- **`maybe_chunk_document(doc, max_sents)`**: Splits a document into multiple chunks of up to `max_sents` sentences if chunking is enabled.  
- **`process_post(post, max_sents, token_thresh)`**: Applies the above cleaning logic to a single post, including chunking if needed, and processes comments if required.

In [4]:
def strip_html_tags(doc: str) -> str:
    """Remove any <tag> HTML markup."""
    return re.sub(r"<[^>]+>", "", doc)

def remove_standalone_links(doc: str) -> str:
    """
    Remove raw links,
    """
    link_pattern = r"(https?://\S+)|(www\.\S+)"
    doc = re.sub(link_pattern, "", doc)
    return doc

def minimal_clean(doc: str) -> str:
    """
    Perform minimal text cleaning:
      1. Lowercase text
      2. Unescape HTML entities
      3. Strip real HTML tags
      4. Remove or replace standalone links
      5. Remove markdown links [text](url) -> text
      6. Remove bold/italic/code markers
      7. Remove double quotes
      8. Collapse extra spaces
    """
    if not doc:
        return ""

    # 1) Lowercase
    doc = doc.lower()

    # 2) Convert HTML entities
    doc = html.unescape(doc)

    # 3) Strip real HTML ags
    doc = strip_html_tags(doc)

    # 4) Remove or replace raw links
    doc = remove_standalone_links(doc)

    # 5) Remove markdown links: [text](url) -> text
    doc = re.sub(r"\[([^\]]+)\]\(([^)]+)\)", r"\1", doc)

    # 6) Remove bold/italic/code markers: **, *, __, _, `
    doc = re.sub(r"(\*\*|\*|__|_|`)", "", doc)

    # 7) Remove double quotes
    doc = doc.replace('"', "")

    # 8) Collapse extra spaces
    doc = re.sub(r"\s+", " ", doc).strip()

    return doc

def maybe_chunk_document(doc: str, max_sents: int) -> list[str]:
    """
    Split a doc into multiple chunks, each up to 'max_sents' sentences,
If max_sents <= 0, just return doc as a single item.
    """
    if not doc:
        return []

    if max_sents <= 0:
        return [doc]

    sents = sent_tokenize(doc)
    chunks = []
    for i in range(0, len(sents), max_sents):
        snippet = " ".join(sents[i : i + max_sents]).strip()
        if snippet:
            chunks.append(snippet)
    return chunks

def process_post(post: Dict, max_sents: int = 5, token_thresh: int = 200) -> Dict:
    """
    1) Keep metadata: _id, post_id, subreddit, keyword, submission_score.
    2) Keep raw fields: title, selftext, comment.
    3) If the cleaned main text > token_thresh, chunk into multiple segments,
       else keep it as one segment.
    4) Same logic for comments if they're also lengthy, if desired.
    5) Preserve post_id and comment_ids throughout processing.
    """

    # ---------------------
    # 1) Basic metadata
    # ---------------------
    post_id = post.get("post_id", post.get("_id", ""))
    if isinstance(post_id, dict) and "$oid" in post_id:
        post_id = post_id["$oid"]
    
    if not post_id:
        logger.warning("Post found without post_id")

    subreddit = post.get("subreddit", "")
    keyword = post.get("keyword", "")
    submission_score = post.get("submission_score", None)

    # ---------------------
    # 2) Raw text fields
    # ---------------------
    raw_title = post.get("title", "") or ""
    raw_selftext = post.get("selftext", "") or ""

    # Combine them
    combined_text = f"{raw_title} {raw_selftext}".strip()
    # Minimal clean
    cleaned_main = minimal_clean(combined_text)
    # Count tokens
    main_tokens = cleaned_main.split()

    # If the doc is "long" (above token_thresh), chunk it >.<
    if len(main_tokens) > token_thresh and max_sents > 0:
        main_chunks = maybe_chunk_document(cleaned_main, max_sents)
        combined_processed = main_chunks[0] if main_chunks else ""
    else:
        combined_processed = cleaned_main

    # ---------------------
    # 3) Process comments
    # ---------------------
    processed_comments = []
    for c in post.get("comments", []):
        comment_id = c.get("comment_id", "")
        if not comment_id:
            logger.warning(f"Comment found without comment_id in post {post_id}")

        c_raw = c.get("comment", "") or ""
        c_clean = minimal_clean(c_raw)
        c_tokens = c_clean.split()

        if len(c_tokens) > token_thresh and max_sents > 0:
            c_chunks = maybe_chunk_document(c_clean, max_sents)
            c_processed = c_chunks[0] if c_chunks else ""
        else:
            c_processed = c_clean

        processed_comments.append({
            "comment_id": comment_id,
            "comment": c_raw,
            "comment_processed": c_processed
        })

    return {
        # Metadata
        "post_id": str(post_id),
        "subreddit": subreddit,
        "keyword": keyword,
        "submission_score": submission_score,

        # Raw fields
        "title": raw_title,
        "selftext": raw_selftext,

        # Potentially chunked result
        "combined_processed": combined_processed,

        # Comments with preserved IDs
        "comments": processed_comments
    }

## 3) Main Pipeline

The main function performs the following steps:

1. **Loads** documents from `raw_grouped_data.json`  
2. **Applies** cleaning and optional chunking  
3. **Filters** out empty results  
4. **Saves** the final data as `bertopic_ready_data.json`, along with basic length statistics

In [6]:
def main():
    """
    Script to chunk only "long" docs above `token_thresh`,
    keep them as-is if short, and store in a final JSON matching the LDA/BERTopic-like schema.
    """
    input_path = Path(aggregated_path)
    output_path = Path(os.path.join(data_folder, "bertopic_ready_data.json"))
    output_path.parent.mkdir(parents=True, exist_ok=True)

    #chunk if doc has more than 200 tokens, in chunks of 5 sentences
    token_thresh = 200
    max_sents = 5

    with open(input_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    processed_data = []
    for post in data:
        processed_post = process_post(post, max_sents=max_sents, token_thresh=token_thresh)
        processed_data.append(processed_post)

    # Filter out if both combined_processed is empty AND all comment_processed are empty
    final_list = []
    for doc in processed_data:
        if doc["combined_processed"].strip():
            final_list.append(doc)
        else:
            any_comment_nonempty = any(c["comment_processed"].strip() for c in doc["comments"])
            if any_comment_nonempty:
                final_list.append(doc)

    processed_data = final_list

    # Save
    with open(output_path, "w", encoding="utf-8") as out_f:
        json.dump(processed_data, out_f, indent=2, ensure_ascii=False)

    # Basic doc length stats (just for combined_processed)
    doc_lengths = [len(d["combined_processed"].split()) for d in processed_data]
    if doc_lengths:
        avg_len = sum(doc_lengths) / len(doc_lengths)
        sorted_lens = sorted(doc_lengths)
        median_len = sorted_lens[len(sorted_lens)//2]
        min_len = min(doc_lengths)
        max_len = max(doc_lengths)
        print(f"Total posts, comments not counted: {len(doc_lengths)}")
        print(f"Average token-length (combined_processed): {avg_len:.2f}")
        print(f"Median token-length: {median_len}")
        print(f"Min token-length: {min_len}, Max token-length: {max_len}")
    else:
        print("No documents after filtering!")

    print(f"\nSaved cleaned data to: {output_path}")

if __name__ == "__main__":
    main()

Total posts, comments not counted: 575
Average token-length (combined_processed): 110.84
Median token-length: 106
Min token-length: 4, Max token-length: 298

Saved cleaned data to: C:\Users\laure\Desktop\dissertation_notebook\Data\bertopic_ready_data.json


## Conclusion

This notebook demonstrates a **minimal** text cleaning strategy—removing links, HTML tags, and Markdown markers—and **optionally** splitting lengthy texts into smaller chunks based on sentence count. These steps ensure that large documents do not exceed certain token limits, thereby improving the manageability of subsequent analyses such as **BERTopic** or **LDA**.

Basic statistics on token lengths are also provided to clarify the size distribution of the final documents. The `bertopic_ready_data.json` file can now be used directly in the next notebook or script for advanced topic modelling.

## References

**Reference:**  
Bird, S., Klein, E., and Loper, E. (2009) *Natural Language Processing with Python*. O'Reilly Media Inc.  
Available from: [https://www.nltk.org/](https://www.nltk.org/) [Accessed 12 January 2025].

**Git Repo:**  
- [NLTK GitHub](https://github.com/nltk/nltk)