# LDA Preprocessing

In this notebook, we:
1. **Import** needed libraries and define patterns/constants.
2. **Define** preprocessing classes and functions (cleaning artifacts, tokenising, bigram building, etc.).
3. **Load** raw data and **build** the bigram model.
4. **Process** dataset posts (title, selftext, comments) into fields ready for LDA or BERT-based approaches.
5. **Output** the final `lda_ready_data.json`.

## Key Libraries

- **transformers** (Hugging Face for tokenisation):  
  [Hugging Face Transformers docs](https://huggingface.co/docs/transformers)

- **nltk** (natural language toolkit):  
  [NLTK documentation](https://www.nltk.org/)

- **gensim** (topic modelling, phrases):  
  [Gensim documentation](https://radimrehurek.com/gensim/)

## Imports & Setting Up Logging

In [38]:
import re               # Regular expressions for text patterns
import html             # Handling HTML entities
import json             # Reading/writing JSON files
import unicodedata      # Unicode normalisation
import logging          # Logging setup
from typing import Dict, List
from dataclasses import dataclass
from pathlib import Path
import string           # Punctuation constants, string operations
import os               # Operating system utilities

# Transformers (Hugging Face) for tokenisation
from transformers import AutoTokenizer

# NLTK for stopwords, tokenisation, POS tagging, lemmatisation
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

# Gensim for bigram/phrase models (topic modelling)
from gensim.models.phrases import Phrases, Phraser

# Collections for counters/frequency distributions
from collections import Counter

# --------------------------------------------
# Logging Set Up
# --------------------------------------------

# Configure basic logging to show warnings and above.
logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger(__name__)

# Indicate successful import and logger setup
print("Libraries imported successfully and logging set up.")

Libraries imported successfully and logging set up.


## Defining Patterns, Stopwords & Global Variables

In this section
1. Specify various regular expression (regex) patterns for cleaning and detecting potential artefacts.
2. Define both default and custom stopwords, as well as non-relevant bigrams.
3. Initialise our lemmatiser, a placeholder for the bigram model, and store a set of part-of-speech tags we allow.

In [21]:
# ----------------------------------------------------------------------------------------
# Pattern Definitions
# ----------------------------------------------------------------------------------------
ARTIFACT_PATTERNS = {
    # Matches HTML entities like &#xAB;, &#123;, etc.
    'html_entities': r'&#x?[0-9a-fA-F]+;',
    # Removes zero-width characters such as \u200B, \uFEFF
    'zero_width': r'[\u200B\uFEFF]',
    # Detects Markdown-style links, e.g. [text](URL)
    'markdown_links': r'\[([^\]]+)\]\(([^)]+)\)',
    # Matches common Markdown formatting symbols (*, **, _, __, `)
    'markdown_format': r'(\*\*|\*|__|_|`)',
    # Matches quoted lines that often start with ">"
    'quote_blocks': r'>\s*(.+?)(\n|$)',
    # Matches “smart quotes” and other similar characters
    'smart_quotes': r'[’‘´`“”]',
    # Consolidates multiple whitespace characters into one
    'whitespace': r'\s+',
    # Looks for encoding artefacts such as �
    'encoding_artifacts': r'[�]',
    # Removes extra block quote symbols (e.g. >, >>, >>>)
    'blockquotes_extra': r'^>+\s*|\s*>+\s*',
}

# Patterns that look suspicious or might need further checks.
SUSPICIOUS_PATTERNS = [
    r'&#\w+;',
    r'\[.*?\]\(.*?\)',
    r'[“”‘’´`]',
]

# ----------------------------------------------------------------------------------------
# Stopwords and Custom Stopwords
# ----------------------------------------------------------------------------------------
# NLTK stopwords for the English language
STOPWORDS = set(stopwords.words('english'))

# A user-defined set of common or domain-specific words 
# that we do not want to treat as meaningful.
CUSTOM_STOPWORDS = {
    'im', 'ive', 'dont', 'like', 'just', 'know', 'thing', 'really', 'get', 'got',
    'would', 'could', 'also', 'one', 'even', 'much', 'still', 'thats', 'well',
    'cant', 'im', 'thats', 'going', 'make', 'time', 'this', 'http', 'www', 'com',
    'good_doctor', 'thank_you', '40_hours', 'does_anyone', 'for_example', 'non_verbal',
    # Domain-specific if too frequent/unhelpful:
    'autism', 'autistic', 'people', 'child', 'children', 'parent', 'kid', 'kids',
    'year', 'years', 'thing',
}

# Merge the above sets of stopwords
STOPWORDS = STOPWORDS.union(CUSTOM_STOPWORDS)

# A set of bigrams irrlevant for our analysis
NON_RELEVANT_BIGRAMS = {
    "tik_tok", "vice_versa", "grain_salt",
    "sliding_scale", "advantages_disadvantages", "et_al",
    "blah_blah", "mutually_exclusive", "daddy_daddy"
}

# Incorporate the non-relevant bigrams into the custom stopwords set
CUSTOM_STOPWORDS.update(NON_RELEVANT_BIGRAMS)

# Rebuild the full STOPWORDS set
STOPWORDS = STOPWORDS.union(CUSTOM_STOPWORDS)

# Initialises the WordNet Lemmatiser for word normalisation
lemmatizer = WordNetLemmatizer()

# Placeholder for a bigram model that will be built later using Gensim's Phrases
bigram_model = None

# A set of allowed part-of-speech tags used for filtering (nouns, adjectives, verbs, etc.)
ALLOWED_POS = {
    'NN', 'NNS', 'NNP', 'NNPS',
    'JJ', 'JJR', 'JJS',
    'RB', 'RBR', 'RBS',
    'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'
}

# Tracks the distribution of part-of-speech tags encountered
pos_distribution = Counter()

print("Pattern definitions, stopwords, and global variables have been initialised.")

Pattern definitions, stopwords, and global variables have been initialised.


## 2) Utility Functions for Cleaning & Tokenising

Below are helper functions used to cleanse text of various artefacts, normalise spacing/punctuation, and perform tokenisation. This includes removal of suspicious patterns, lemmatisation, stopword filtering, and optional bigram detection.

In [19]:
def clean_artefacts(text: str) -> str:
    """
    Removes or replaces various artefacts (HTML entities, Markdown formatting, zero-width 
    characters, etc.) from the provided text.
    """
    # Convert HTML entities to their standard characters
    text = html.unescape(text)
    
    # Remove quote block markers (e.g., > Quote)
    text = re.sub(ARTIFACT_PATTERNS['quote_blocks'], r'\1 ', text)
    
    # Remove HTML entities and zero-width characters
    text = re.sub(ARTIFACT_PATTERNS['html_entities'], '', text)
    text = re.sub(ARTIFACT_PATTERNS['zero_width'], '', text)
    
    # Replace Markdown-style links with just the link text
    text = re.sub(ARTIFACT_PATTERNS['markdown_links'], r'\1', text)
    
    # Remove common Markdown formatting symbols (*, **, _, __, `)
    text = re.sub(ARTIFACT_PATTERNS['markdown_format'], '', text)
    
    # Replace stray encoding character
    text = text.replace("�", "'")
    
    # Convert “smart quotes” to normal apostrophes
    text = re.sub(ARTIFACT_PATTERNS['smart_quotes'], "'", text)
    text = text.replace('"', "'")
    
    # Consolidate multiple whitespace characters into a single space
    text = re.sub(ARTIFACT_PATTERNS['whitespace'], ' ', text)
    
    return text.strip()


def check_artefacts(text: str) -> bool:
    """
    Checks if the text still contains any suspicious patterns.
    """
    # If any pattern is found, we return False
    return not any(re.search(pattern, text) for pattern in SUSPICIOUS_PATTERNS)


def clean_parenthetical(text: str, acronyms=None) -> str:
    """
    Removes parenthetical content (e.g., '(some text)') unless it contains 
    a recognised acronym to preserve meaningful information.
    """
    if not acronyms:
        # If no acronyms are provided, remove all parenthetical content.
        return re.sub(r'\([^)]*\)', '', text)

    def contains_acronym(content: str) -> bool:
        # Checks if a given snippet contains any of the specified acronyms.
        return any(acr in content.upper() for acr in acronyms)

    parts = []
    idx = 0
    # Find all parenthetical substrings
    for match in re.finditer(r'\(([^()]+)\)', text):
        start, end = match.span()
        content = match.group(1)
        
        # Add the segment before the match
        parts.append(text[idx:start])
        
        # If acronym is present, keep the parenthetical segment
        if contains_acronym(content):
            parts.append(f"({content})")
        
        # Move past this parenthetical
        idx = end
    
    # Add any remaining text after the final parenthetical
    parts.append(text[idx:])
    
    # Rebuild the string and normalise whitespace
    final = ' '.join(filter(None, parts))
    final = re.sub(r'\s+', ' ', final)
    return final.strip()


def minimal_tokenise(text: str) -> List[str]:
    """
    A simplified tokenisation function for building the bigram model. 
    Splits text on whitespace, removes punctuation, and filters out stopwords.

    """
    if text is None:
        text = ''
    text = text.lower()
    tokens = text.split()
    
    cleaned = []
    for token in tokens:
        # Strip punctuation from the start/end of the token
        t = token.strip(string.punctuation)
        
        # Check if the token is alphabetic and not in stopwords
        if t and t.isalpha() and t not in STOPWORDS:
            cleaned.append(t)
    return cleaned


def tokenise_lemmatise_stopwords_bigrams(text: str) -> str:
    """
    Full tokenisation pipeline: lowercasing, punctuation removal, stopword filtering, 
    lemmatisation, optional bigram detection, and final POS-based filtering.
    """
    if not text:
        return ""
    
    # Initial split
    tokens = text.split()

    # Clean punctuation and normalise case
    cleaned_tokens = []
    for token in tokens:
        token = token.lower().strip(string.punctuation)
        # Skip empty or purely punctuation tokens
        if not token or all(ch in string.punctuation for ch in token):
            continue
        cleaned_tokens.append(token)

    # Lemmatise and filter stopwords
    filtered = []
    for token in cleaned_tokens:
        if token not in STOPWORDS and len(token) > 1:
            lemma = lemmatizer.lemmatize(token)
            if lemma not in STOPWORDS:
                filtered.append(lemma)

    # Capture the distribution of POS tags
    tagged = pos_tag(filtered)
    global pos_distribution
    pos_distribution.update(pos for _, pos in tagged)

    # Keep only allowed POS tags (e.g., nouns, verbs, adjectives, etc.)
    pos_filtered_tokens = [word for (word, pos) in tagged if pos in ALLOWED_POS]

    # Apply bigram model if it exists
    if bigram_model:
        bigram_tokens = bigram_model[pos_filtered_tokens]
    else:
        bigram_tokens = pos_filtered_tokens

    # Final pass to remove tokens that contain stopwords or are non-alphabetic
    final_tokens = []
    for token in bigram_tokens:
        if '_' in token:
            # For bigrams, split and check each part
            parts = token.split('_')
            if any(p in STOPWORDS or not p.isalpha() for p in parts):
                continue
        else:
            # For single tokens, check if it's in stopwords or purely non-alpha
            if token in STOPWORDS or not token.isalpha():
                continue
        
        final_tokens.append(token)

    return ' '.join(final_tokens)


def clean_punctuation_spacing(text: str) -> str:
    """
    Tidies up punctuation spacing: condenses multiple periods, fixes spacing around 
    commas, apostrophes, and similar issues.

    """
    if not text:
        return ""
    # Convert multiple '.' to '...'
    text = re.sub(r'\.{2,}', '...', text)
    # Ensure one space after punctuation
    text = re.sub(r'\s*([.,!?;:])\s*', r'\1 ', text)
    # Insert a space after sentence-ending punctuation if it's missing
    text = re.sub(r'([.!?])(?=\w)', r'\1 ', text)
    # Handle contractions like "'s" or "'t" by removing extra spaces
    text = re.sub(r"\s*'\s*s\b", "'s", text)
    text = re.sub(r"\s*'\s*t\b", "'t", text)
    # Consolidate extra spaces
    text = re.sub(r'\s{2,}', ' ', text)
    return text.strip()


def normalise_whitespace(text: str) -> str:
    """
    Ensures consistent single spacing for whitespace in the text.
    """
    if not text:
        return ""
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

print("Utility functions for cleaning and tokenising have been defined.")

Utility functions for cleaning and tokenising have been defined.


## 3) Preprocessing Configuration & Classes

This section defines:
1. A data class (`PreprocessingConfig`) containing constraints and acronyms for text processing.
2. A text preprocessor class (`TextPreprocessor`) that orchestrates cleaning, tokenisation, acronym handling, bigram application, and truncation.
3. A Reddit data processor class (`RedditDataProcessor`) that uses the text preprocessor on Reddit posts, comments, etc.

In [27]:
from dataclasses import dataclass

@dataclass
class PreprocessingConfig:
    """
    Holds general configuration values for text preprocessing,
    such as character/word limits and acronyms to preserve.
    """
    MIN_CHARS: int = 50
    MAX_CHARS: int = 50000
    MIN_WORDS: int = 10
    MIN_SENTENCES: int = 2
    MAX_TOKENS: int = 512

    # A set of acronymsto preserve inside parentheses or text
    ACRONYMS = {
        'ABA', 'ASD', 'NHS', 'GP', 'ADHD', 'IEP', 'OT', 'SLP', 'PT',
        'CBT', 'DBT', 'PDA', 'SPD', 'DLD', 'AAC', 'PECS'
    }


class TextPreprocessor:
    """
    A class that handles the main text preprocessing pipeline:
    - Cleaning artefacts
    - Handling acronyms
    - Tokenising and lemmatising
    - Applying bigram models
    - Truncating text if it exceeds a token limit
    """

    def __init__(self, model_name="bert-base-uncased"):
        # Initialise the config and tokenizer
        self.config = PreprocessingConfig()
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            clean_up_tokenization_spaces=False
        )
        self.logger = logging.getLogger(__name__)

    def preprocess_text(self, text: str) -> str:
        """
        Main entry point for cleaning, acronym handling, tokenisation, bigrams, etc.
        """
        if not text:
            return ""

        # Basic checks on length (characters, words, sentences)
        if len(text) < self.config.MIN_CHARS or len(text) > self.config.MAX_CHARS:
            return ""
        if len(text.split()) < self.config.MIN_WORDS:
            return ""
        sentences = re.split(r'[.!?]+', text)
        if len([s for s in sentences if s.strip()]) < self.config.MIN_SENTENCES:
            return ""

        # Normalise whitespace
        text = normalise_whitespace(text)
        
        # Clean pipeline-specific artefacts
        text = self._clean_artifacts(text)
        
        # Remove certain parenthetical content unless acronyms are detected
        text = clean_parenthetical(text, acronyms=self.config.ACRONYMS)
        
        # Perform tokenisation, lemmatisation, stopword filtering, and bigram handling
        text = tokenise_lemmatise_stopwords_bigrams(text)
        
        # Tidy punctuation spacing one more time
        text = clean_punctuation_spacing(text)
        
        # Normalise whitespace again in case of new changes
        text = normalise_whitespace(text)

        # Check if any suspicious patterns remain
        if not check_artefacts(text):
            self.logger.warning("Remaining artefacts detected in processed text: %s", text[:200])

        # Truncate tokens if exceeding maximum allowed
        text = self._truncate_tokens(text)

        return text.strip()

    def _clean_artifacts(self, text: str) -> str:
        """
        Additional cleaning of artefacts, such as block quotes, zero-width characters,
        HTML entities, Markdown links, etc.
        """
        # Convert to a standard Unicode form
        text = unicodedata.normalize('NFKC', text)
        
        # Replace stray encoding characters
        text = text.replace('�', "'")
        
        # Unescape any remaining HTML entities
        text = html.unescape(text)

        # Remove or substitute patterns as defined in the global ARTIFACT_PATTERNS
        text = re.sub(ARTIFACT_PATTERNS['quote_blocks'], r'\1', text, flags=re.MULTILINE)
        text = re.sub(ARTIFACT_PATTERNS['blockquotes_extra'], '', text, flags=re.MULTILINE)
        text = re.sub(ARTIFACT_PATTERNS['zero_width'], '', text)
        text = re.sub(ARTIFACT_PATTERNS['html_entities'], '', text)
        text = re.sub(ARTIFACT_PATTERNS['markdown_links'], r'\1', text)
        text = re.sub(ARTIFACT_PATTERNS['markdown_format'], '', text)

        # Replace remaining “smart quotes” with normal apostrophes
        text = re.sub(r"[“”‘’´`]", "'", text)
        text = text.replace('"', "'")

        # Consolidate multiple whitespace occurrences
        text = re.sub(ARTIFACT_PATTERNS['whitespace'], ' ', text)
        return text.strip()

    def _truncate_tokens(self, text: str) -> str:
        """
        Checks the token length using the model's tokenizer. If it exceeds MAX_TOKENS,
        truncate and optionally warn the user up to 5 times.
        """
        if not text:
            return ""
        
        # Encode the text for the model
        encoded = self.tokenizer.encode(text, add_special_tokens=True)
        if len(encoded) > self.config.MAX_TOKENS:
            # Check or create a counter attribute on the TextPreprocessor instance
            if not hasattr(self, "_trunc_warn_count"):
                self._trunc_warn_count = 0

            # Log a warning only if 5 warnings
            if self._trunc_warn_count < 5:
                self.logger.warning(
                    f"Text exceeded max tokens ({len(encoded)} > {self.config.MAX_TOKENS}). Truncating..."
                )
                self._trunc_warn_count += 1

            # Keep space for special tokens (e.g., CLS, SEP)
            max_len = self.config.MAX_TOKENS - 2
            truncated_enc = encoded[:max_len]
            truncated = self.tokenizer.decode(
                truncated_enc,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=False
            )

            # Attempt to end on a sentence boundary for smooth truncation
            sents = re.split(r'(?<=[.!?]) +', truncated)
            if len(sents) > 1:
                truncated = ' '.join(sents[:-1])
            return truncated.strip()
        
        return text.strip()


class RedditDataProcessor:
    """
    A class that processes Reddit-like data structures (posts, comments), 
    leveraging TextPreprocessor for thorough text cleaning and tokenisation.
    """

    def __init__(self, model_name="bert-base-uncased"):
        # Instantiate a TextPreprocessor with a specified model name
        self.text_processor = TextPreprocessor(model_name=model_name)

    def process_post(self, post: Dict) -> Dict:
        """
        Process a single post to produce processed versions of title, selftext, and combined text,
        and apply the same to each comment if present. Preserves post_id and comment_id fields.
        """
        processed = post.copy()

        # Preserve the post_id
        post_id = processed.get('post_id', '')
        if not post_id:
            logger.warning("Post found without post_id")

        raw_title = processed.get('title', '')
        raw_selftext = processed.get('selftext', '')

        # Preprocess title, selftext, and a combined version
        processed_title = self.text_processor.preprocess_text(raw_title)
        processed_selftext = self.text_processor.preprocess_text(raw_selftext)
        combined_raw = f"{raw_title} {raw_selftext}"
        processed_combined = self.text_processor.preprocess_text(combined_raw)

        processed['post_id'] = post_id  # Ensure post_id is preserved
        processed['title_processed'] = processed_title
        processed['selftext_processed'] = processed_selftext
        processed['combined_processed'] = processed_combined

        # If comments exist, process each one while preserving comment_ids
        if 'comments' in processed and isinstance(processed['comments'], list):
            for comment_dict in processed['comments']:
                # Preserve the comment_id
                comment_id = comment_dict.get('comment_id', '')
                if not comment_id:
                    logger.warning(f"Comment found without comment_id in post {post_id}")
                
                raw_comment = comment_dict.get('comment', '')
                comment_dict['comment_id'] = comment_id  # Ensure comment_id is preserved
                comment_dict['comment_processed'] = self.text_processor.preprocess_text(raw_comment)

        return processed

    def process_dataset(self, data: List[Dict]) -> List[Dict]:
        """
        Applies process_post to each element in a dataset (list of dictionaries).
        """
        return [self.process_post(post) for post in data]

print("Preprocessing configuration and classes have been defined.")


Preprocessing configuration and classes have been defined.


## Main Function: Build Bigram Model, Process Data, and Save Output

This section orchestrates the entire flow:
1. Loads the aggregated raw Reddit data from JSON.
2. Builds a bigram model from minimal-tokenised text.
3. Processes the dataset using the `RedditDataProcessor`.
4. Prints statistics (top bigrams, top POS tags).
5. Saves the processed data to a final JSON output file (`lda_ready_data.json`).

In [30]:
# File paths for your data folder
base_folder = r"C:\Users\laure\Desktop\dissertation_notebook"
data_folder = os.path.join(base_folder, "Data")
os.makedirs(data_folder, exist_ok=True)

aggregated_path = os.path.join(data_folder, "aggregated_raw_reddit_data.json")

def main():
    """
    Main entry point for building bigram model, processing Reddit data, and saving the results.

    Steps:
    1. Reads the aggregated raw data from JSON.
    2. Performs minimal tokenisation on titles, selftexts, and comments to build a bigram model.
    3. Prints the top 20 bigrams found in the corpus, along with a POS distribution.
    4. Uses RedditDataProcessor for thorough text preprocessing (artefact cleaning, 
       tokenisation, lemmatisation, bigram detection, etc.).
    5. Saves the final processed data to 'lda_ready_data.json'.
    Now preserves post_id and comment_id fields throughout processing.
    """
    # Use aggregated_path as the input to the pipeline
    input_path = Path(aggregated_path)
    
    # Output path in the same Data folder
    output_path = Path(os.path.join(data_folder, "lda_ready_data.json"))
    output_path.parent.mkdir(parents=True, exist_ok=True)

    try:
        # Load raw aggregated data from JSON
        with open(input_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        # Build bigram model from minimal-tokenised data
        all_texts = []
        for post in data:
            # Get IDs for logging purposes
            post_id = post.get('post_id', 'unknown')
            
            raw_title = post.get('title', '')
            raw_selftext = post.get('selftext', '')
            combined_raw = f"{raw_title} {raw_selftext}"

            # Tokenise the combined post text using minimal_tokenize
            post_tokens = minimal_tokenize(combined_raw)
            if post_tokens:
                all_texts.append(post_tokens)

            # Tokenise each comment if present
            for c in post.get('comments', []):
                comment_id = c.get('comment_id', 'unknown')
                c_text = c.get('comment', '')
                c_tokens = minimal_tokenize(c_text)
                if c_tokens:
                    all_texts.append(c_tokens)

        # Create a global bigram model using Gensim Phrases & Phraser
        global bigram_model
        phrases = Phrases(all_texts, min_count=5, threshold=20)
        bigram_model = Phraser(phrases)

        # Display top 20 bigrams with their scores
        bigram_freqs = bigram_model.phrasegrams
        sorted_bigrams = sorted(bigram_freqs.items(), key=lambda x: x[1], reverse=True)
        print("Top 20 Bigrams:")
        for i, (bg, score) in enumerate(sorted_bigrams[:20]):
            print(f"{i+1}. {bg} - Score: {score}")

        # Process the dataset using our RedditDataProcessor
        processor = RedditDataProcessor(model_name="bert-base-uncased")
        processed_data = processor.process_dataset(data)

        # Display top 20 part-of-speech tags encountered
        global pos_distribution
        print("\nPOS Distribution (Top 20):")
        for tag, count in pos_distribution.most_common(20):
            print(f"{tag}: {count}")

        # Write out the final JSON result
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(processed_data, f, indent=2, ensure_ascii=False)

        logger.info("Successfully processed posts with threshold checks, POS tagging, bigrams, and re-ordered fields.")
    except Exception as e:
        logger.error(f"Processing failed: {e}")
        raise

if __name__ == "__main__":
    main()

print("Bigram model successfully built, dataset processed, and output saved.")

Top 20 Bigrams:
1. tik_tok - Score: 12281.454545454546
2. vice_versa - Score: 11258.0
3. grain_salt - Score: 8405.973333333333
4. facial_expressions - Score: 6754.8
5. sqs_informant - Score: 6433.142857142857
6. sliding_scale - Score: 6433.142857142857
7. advantages_disadvantages - Score: 6332.625
8. et_al - Score: 6259.274131274131
9. steph_jones - Score: 5514.122448979591
10. suicidal_ideation - Score: 5081.9811912225705
11. pros_cons - Score: 4946.306636155607
12. rabbit_hole - Score: 4912.581818181818
13. trial_error - Score: 4890.352941176471
14. status_quo - Score: 4824.857142857142
15. blah_blah - Score: 4796.307692307692
16. cancelling_headphones - Score: 4658.482758620689
17. mutually_exclusive - Score: 4503.2
18. daddy_daddy - Score: 4221.75
19. operant_conditioning - Score: 4045.84375
20. political_climate - Score: 3752.6666666666665


Token indices sequence length is longer than the specified maximum sequence length for this model (553 > 512). Running this sequence through the model will result in indexing errors



POS Distribution (Top 20):
NN: 167588
JJ: 88004
RB: 31871
VBG: 23928
VBD: 22437
VBP: 15925
VBN: 12850
VB: 11634
IN: 7483
NNS: 6173
CD: 3791
VBZ: 2365
MD: 2211
JJR: 1928
RBR: 1582
JJS: 1349
DT: 1343
FW: 528
PRP: 521
CC: 486
Bigram model successfully built, dataset processed, and output saved.


## Concluson

This notebook demonstrates a full end-to-end pipeline for preparing Reddit data for topic modelling and other NLP tasks. Patterns and stopwords are defined, bigram models are built and applied, and a series of text-processing steps are used to clean, lemmatise, and tokenise the raw data. By integrating these procedures into a coherent framework, an LDA-ready dataset is produced, preserving valuable linguistic features while removing extraneous noise. This final output, enhanced with part-of-speech tagging and bigrams, can then be employed for subsequent topic or text analysis.

## References

**Reference:**  
Thomas, W., Debut, L., Sanh, V., et al. (2024) *Transformers v4.47.1* [computer program].  
Available from: [https://huggingface.co/docs/transformers](https://huggingface.co/docs/transformers) [Accessed 25 May 2024].

**Git Repo:**  
- [Transformers GitHub](https://github.com/huggingface/transformers)

**Reference:**  
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018) *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*.  
Available from: [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805) [Accessed 28 May 2024].

**Model:**  
- [BERT-Base, Uncased](https://huggingface.co/bert-base-uncased)

**Reference:**  
Bird, S., Klein, E., and Loper, E. (2009) *Natural Language Processing with Python*. O'Reilly Media Inc.  
Available from: [https://www.nltk.org/](https://www.nltk.org/) [Accessed 28 May 2024].

**Git Repo:**  
- [NLTK GitHub](https://github.com/nltk/nltk)

**Reference:**  
Řehůřek, R., and Sojka, P. (2010) *Software Framework for Topic Modelling with Large Corpora*. In *Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks*.  
Available from: [https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/) [Accessed 3 June 2024].

**Git Repo:**  
- [Gensim GitHub](https://github.com/RaRe-Technologies/gensim)

**Reference:**  
Pandas Development Team (2024) *pandas: Powerful data structures for data analysis v2.2.3* [computer program].  
Available from: [https://pandas.pydata.org/](https://pandas.pydata.org/) [Accessed 11 May 2024].

**Git Repo:**  
- [Pandas GitHub](https://github.com/pandas-dev/pandas)
