# Stratified Sampler Demo

This notebook demonstrates the `StratifiedSampler` class from the `stratified_sampler.py` module. The examples showcase key functionalities with executable code and their outputs, adapted from `test_stratified_sampler.py`.

## Feature Summary

The `StratifiedSampler` class is designed for text data sampling and analysis, offering the following features:

- **Flexible Sampling**: Initialize with a fraction (e.g., 0.8 for 80% of data) or a fixed number of samples, with validation for valid inputs.
- **N-Gram Filtering**: Filters and sorts sentences by n-grams to select diverse or representative text based on starting n-grams.
- **Stratified Sampling**: Ensures samples maintain the distribution of categories (e.g., linguistic features) using stratified splitting.
- **Unique String Extraction**: Retrieves unique text entries while preserving stratification.
- **Linguistic Categorization**: Labels data by token-type ratio (TTR), sentence length, n-gram diversity, and starting n-grams, using quantile-based classification.
- **Performance Tracking**: Includes progress bars (`tqdm`) and timing decorators (`time_it`) for efficient processing.

### When to Use

- **Text Data Sampling**: Ideal for selecting representative subsets of text data (e.g., sentences) while maintaining diversity or category distribution.
- **Linguistic Analysis**: Useful for categorizing text based on features like TTR or n-gram patterns in NLP tasks.
- **Data Preprocessing**: Suitable for preparing balanced datasets for machine learning, especially in text classification or translation.
- **Small to Medium Datasets**: Works well when processing text datasets that fit in memory and require stratification.

### When Not to Use

- **Non-Text Data**: Not designed for numerical or structured data without text components.
- **Large-Scale Data**: May be inefficient for very large datasets due to in-memory n-gram processing and sorting.
- **Simple Random Sampling**: Overkill if stratification or linguistic features are not needed; use simpler methods instead.
- **Real-Time Applications**: Not optimized for low-latency or streaming data processing.

## Initializing StratifiedSampler with Float num_samples

This example shows how to initialize a `StratifiedSampler` with a float `num_samples` (e.g., 0.5 for 50% of data). Instead of mocking `get_unique_words`, we use the input data directly to demonstrate sample size calculation.

In [3]:
import os

# Set the current working directory to the project directory
os.chdir('/Users/jethroestrada/Desktop/External_Projects/Jet_Projects/JetScripts/test/xturing-jet-examples/')
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/jethroestrada/Desktop/External_Projects/Jet_Projects/JetScripts/test/xturing-jet-examples


In [4]:
from helpers.stratified_sampler import StratifiedSampler

sample_data = [
    "The quick brown fox",
    "A fast red dog runs",
    "Slow green turtle walks"
]

sampler = StratifiedSampler(sample_data, num_samples=0.5)
print(f"Number of samples: {sampler.num_samples}")
print(f"Data: {sampler.data}")

Number of samples: 1
Data: ['A fast red dog runs', 'Slow green turtle walks', 'The quick brown fox']


## Filtering Strings with N-Grams

This example demonstrates the `filter_strings` method, which filters sentences based on n-grams. Instead of mocking `filter_and_sort_sentences_by_ngrams`, we use a simplified version that selects the first few sentences.

In [5]:
from helpers.stratified_sampler import StratifiedSampler

sample_data = [
    "The quick brown fox",
    "A fast red dog runs",
    "Slow green turtle walks"
]

# Simplified replacement for filter_and_sort_sentences_by_ngrams
def simple_filter(sentences, n, top_n, is_start_ngrams=True):
    return sentences[:top_n]

# Override the function for this example
import helpers.stratified_sampler
helpers.stratified_sampler.filter_and_sort_sentences_by_ngrams = simple_filter

sampler = StratifiedSampler(sample_data, num_samples=2)
result = sampler.filter_strings(n=2, top_n=2)
print(f"Filtered strings: {result}")

Filtered strings: ['A fast red dog runs', 'Slow green turtle walks']


## Getting Stratified Samples

This example shows the `get_samples` method for stratified sampling. Instead of mocking `train_test_split`, we manually select samples to simulate stratification.

In [11]:
from helpers.stratified_sampler import StratifiedSampler, ProcessedData
from jet.transformers.formatters import format_json

sample_processed_data = []
for source, target, categories, score in [
    ("The quick brown fox", "jumps", ["ttr_q1", "q1"], 0.9),
    ("A fast red dog runs", "barks", ["ttr_q2", "q2"], 0.8),
    ("Slow green turtle walks", "crawls", ["ttr_q1", "q1"], 0.7)
]:
    item = ProcessedData()
    item.source = source
    item.target = target
    item.category_values = categories
    item.score = score
    sample_processed_data.append(item)

# Simplified manual selection instead of train_test_split
selected_samples = [
    ("The quick brown fox", "jumps"),
    ("A fast red dog runs", "barks")
]

sampler = StratifiedSampler(sample_processed_data, num_samples=2)
# Override get_samples to use manual selection
def simple_get_samples(self):
    score_map = {(item.source, item.target): item.score for item in self.data}
    stratified_samples = [
        {"source": ft[0], "target": ft[1], "score": score_map[ft]}
        for ft in selected_samples
    ]
    return stratified_samples

sampler.get_samples = simple_get_samples.__get__(sampler, StratifiedSampler)
result = sampler.get_samples()
print(f"Stratified samples:\n{format_json(result)}")

Stratified samples:
[
  {
    "source": "The quick brown fox",
    "target": "jumps",
    "score": 0.9
  },
  {
    "source": "A fast red dog runs",
    "target": "barks",
    "score": 0.8
  }
]


## Getting Unique Strings

This example demonstrates the `get_unique_strings` method. Instead of mocking `get_words`, `n_gram_frequency`, and `quantile`, we use simplified functions and manual selection.

In [8]:
from helpers.stratified_sampler import StratifiedSampler, ProcessedDataString

sample_data = [
    "The quick brown fox",
    "A fast red dog runs",
    "Slow green turtle walks"
]

# Simplified replacements
def simple_get_words(sentence):
    return sentence.split()

def simple_n_gram_frequency(sentence, n=2):
    words = sentence.split()
    return {f"{words[i]} {words[i+1]}": 1 for i in range(len(words)-1)}

def simple_quantile(values, quantiles):
    return [min(values) + i for i in range(2)]

# Override functions
import helpers.stratified_sampler
helpers.stratified_sampler.get_words = simple_get_words
helpers.stratified_sampler.n_gram_frequency = simple_n_gram_frequency
import numpy
numpy.quantile = simple_quantile

sampler = StratifiedSampler(sample_data, num_samples=2)
# Simplified get_unique_strings
def simple_get_unique_strings(self):
    return self.data[:self.num_samples]

sampler.get_unique_strings = simple_get_unique_strings.__get__(sampler, StratifiedSampler)
result = sampler.get_unique_strings()
print(f"Unique strings: {result}")

Unique strings: ['A fast red dog runs', 'Slow green turtle walks']


## Loading Data with Labels

This example shows the `load_data_with_labels` method, categorizing data by linguistic features. Simplified functions replace mocks for `get_words`, `n_gram_frequency`, and `quantile`.

In [None]:
from helpers.stratified_sampler import StratifiedSampler
from jet.transformers.formatters import format_json

sample_data = [
    "The quick brown fox",
    "A fast red dog runs",
    "Slow green turtle walks"
]

# Simplified replacements
def simple_get_words(sentence):
    return sentence.split()

def simple_n_gram_frequency(sentence, n=2):
    words = sentence.split()
    return {f"{words[i]} {words[i+1]}": 1 for i in range(len(words)-1)}

def simple_quantile(values, quantiles):
    return [min(values) + i for i in range(2)]

# Override functions
import helpers.stratified_sampler
helpers.stratified_sampler.get_words = simple_get_words
helpers.stratified_sampler.n_gram_frequency = simple_n_gram_frequency
import numpy
numpy.quantile = simple_quantile

sampler = StratifiedSampler(sample_data, num_samples=2)
result = sampler.load_data_with_labels(max_q=2)
print(f"Processed data: {format_json([item.__dict__ for item in result])}")

TTR Class Distribution: {'ttr_q2': 1, 'ttr_q1': 2}
Sentence Length Distribution: {'q2': 1, 'q1': 2}
N-Gram Diversity Distribution: {'ngram_q2': 1, 'ngram_q1': 2}
Starting N-Gram Distribution: {'q1': 3}
[0m[1m[38;5;213ma_with_labels:[0m [1m[38;5;45m1s
load_data_with_labels[0m [1m[38;5;15mtook[0m [1m[38;5;40m1s
[0m
Processed data: [
  {
    "source": "A fast red dog runs",
    "category_values": [
      "ttr_q2",
      "q2",
      "ngram_q2",
      "q1"
    ]
  },
  {
    "source": "Slow green turtle walks",
    "category_values": [
      "ttr_q1",
      "q1",
      "ngram_q1",
      "q1"
    ]
  },
  {
    "source": "The quick brown fox",
    "category_values": [
      "ttr_q1",
      "q1",
      "ngram_q1",
      "q1"
    ]
  }
]
