# AsTRiQue Participant Showcase

This notebook is a showcase of [AsTRiQue](https://github.com/prokophanzl/AsTRiQue)'s workflow with a human participant.

### 📊 Dataset
The showcase makes use of data from Bořil (YEAR), where he investigated the categorization of Czech sibilants /s/ vs. /z/ and /ʃ/ vs. /ʒ/ as a function of two acoustic parameters: voicing (quantified as the percentage of the segment exhibiting periodic vocal fold vibration) and segmental duration (in ms). For the purposes of this showcase, /s/ and /ʃ/ were batched together, as were /z/ and /ʒ/.

### 🔄 Customization Tips
* See how `MODEL_CERTAINTY_CUTOFF` affects the number of samples collected and prediction quality
* See how `CLEANSER_FREQUENCY` affects fatigue (by preventing long stretches of ambiguous stimuli)


TODO: Bořil citation


## Prerequisites

First, you'll need to clone the AsTRiQue Git repository and install dependencies.

In [None]:
!git clone https://github.com/prokophanzl/AsTRiQue/

!pip install numpy pandas matplotlib scikit-learn playsound

## Config

You can tweak the model's configuration here.

| constant | description |
| - | - |
|`INIT_RANDOM_SAMPLES` | The model will run random sampling for this many samples at the beginning of its runtime. |
| `MIN_ITERATIONS` | The minimum number of iterations the model will go through |
| `CLEANSER_FREQUENCY` | If this value is set to `n > 0`, every `n`th iteration will select a high-certainty sample (instead of a low-certainty one) to prevent participant fatigue. |
| `MODEL_CERTAINTY_CUTOFF` | Once the model reaches this prediction certainty for all stimuli the oracle (participant) has not answered yet, it will stop. |

In [None]:
# ======================
# CONFIG (tweak this and see what happens!)
# ======================

INIT_RANDOM_SAMPLES = 10      # initial random samples to collect
MIN_ITERATIONS = 30           # minimum number of iterations
CLEANSER_FREQUENCY = 8        # insert a high-certainty sample every nth iteration to prevent participant fatigue; 0 to disable
MODEL_CERTAINTY_CUTOFF = 0.95 # stopping certainty threshold

## Run the model

Once you've set everything up, you can run the simulation!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
)
from playsound import playsound
import os
import time

# ======================
# CONSTANTS
# ======================

PREDICTOR1 = 'voicing'                # first predictor column name
PREDICTOR2 = 'duration'               # second predictor column name
TARGET = 'answer'                     # target column name
FILENAME_COL = 'filename'             # filename column name
LABEL_MAPPING = {'s': 0, 'z': 1}      # binary output label mapping
DATA_PATH = 'AsTRiQue/data/data.csv'  # sound info data file path
AUDIO_FOLDER = 'AsTRiQue/data/audio'  # audio file directory
PROCESSED_PATH = 'data_processed.csv' # processed data file path; leave blank to disable


# ======================
# DATA LOADING
# ======================

data = pd.read_csv(DATA_PATH)

# remove rows with missing audio files
data = data[data[FILENAME_COL].apply(lambda f: os.path.exists(os.path.join(AUDIO_FOLDER, f)))]

# initialize columns
if 'answered' not in data.columns:
    data['answered'] = False
if TARGET not in data.columns:
    data[TARGET] = np.nan
data['used_for_training'] = False


# ======================
# HELPER FUNCTIONS
# ======================

def get_human_response(filename, wait_for_enter=False):
    """Plays audio and waits for valid human response ('s' or 'z'). Replays on invalid input."""
    filepath = os.path.join(AUDIO_FOLDER, filename)
    if not os.path.exists(filepath):
        print(f"Missing file: {filepath}. Skipping.")
        return None

    while True:
        if wait_for_enter:
            input(f"\nReady to hear the sound '{filename}'? Press Enter to play...")

        try:
            playsound(filepath)
        except Exception as e:
            print(f"Error playing sound: {e}")
            return None

        response = input("Enter your response ('s' or 'z'): ").strip().lower()
        if response in LABEL_MAPPING:
            return LABEL_MAPPING[response]
        else:
            print("Invalid input. Please enter 's' or 'z'. Replaying sound...")
            wait_for_enter = False  # skip Enter prompt on replays

def calculate_uncertainty(probs):
    return 1 - np.maximum(probs, 1 - probs)

def plot_results(answered_data, unanswered_data, model):
    plt.figure(figsize = (10, 6))

    if answered_data[TARGET].dtype == 'object':
        answered_data = answered_data.copy()
        answered_data[TARGET] = answered_data[TARGET].map(LABEL_MAPPING)

    if not answered_data.empty:
        scatter = plt.scatter(
            answered_data[PREDICTOR1],
            answered_data[PREDICTOR2],
            c = answered_data[TARGET],
            cmap = 'coolwarm',
            edgecolors = 'k',
            label = 'Answered (training)',
            vmin = 0, vmax = 1
        )

    if not unanswered_data.empty:
        plt.scatter(
            unanswered_data[PREDICTOR1],
            unanswered_data[PREDICTOR2],
            c = 'gray', alpha = 0.5, label = 'Evaluation samples'
        )

    x_min, x_max = data[PREDICTOR1].min() - 1, data[PREDICTOR1].max() + 1
    y_min, y_max = data[PREDICTOR2].min() - 1, data[PREDICTOR2].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    grid_points = pd.DataFrame(
        np.c_[xx.ravel(), yy.ravel()],
        columns = [PREDICTOR1, PREDICTOR2]
    )
    Z = model.predict_proba(grid_points)[:, 1].reshape(xx.shape)

    plt.contourf(xx, yy, Z, alpha = 0.3, levels = 20, cmap = 'coolwarm')
    plt.colorbar(scatter, label = 'Predicted Probability')
    plt.xlabel(PREDICTOR1)
    plt.ylabel(PREDICTOR2)
    plt.title('Human Experiment Results')
    plt.legend()
    plt.show()


# ======================
# EXPERIMENT EXECUTION
# ======================

start_time = time.time()
print("Starting initial random sampling...")

def collect_sample(mark_training = True, first = False):
    unanswered = data[~data['answered']]
    if unanswered.empty:
        return False

    sample = unanswered.sample(1)
    filename = sample[FILENAME_COL].values[0]
    answer = get_human_response(filename, wait_for_enter = first)
    if answer is not None:
        data.loc[data[FILENAME_COL] == filename, TARGET] = answer
        data.loc[data[FILENAME_COL] == filename, 'answered'] = True
        if mark_training:
            data.loc[data[FILENAME_COL] == filename, 'used_for_training'] = True
        return True
    return False

# initial random samples
samples_collected = 0
while samples_collected < INIT_RANDOM_SAMPLES:
    if not collect_sample(first = (samples_collected == 0)):
        break
    samples_collected += 1

# ensure at least two classes
answered_data = data[data['answered']]
unique_classes = answered_data[TARGET].dropna().unique()
while len(unique_classes) < 2 and not data[~data['answered']].empty:
    if not collect_sample():
        break
    answered_data = data[data['answered']]
    unique_classes = answered_data[TARGET].dropna().unique()

if len(unique_classes) < 2:
    print("Only one class after initial sampling. Exiting.")
    exit()

print("\nStarting active learning phase...")
iteration = INIT_RANDOM_SAMPLES

while True:
    answered_data = data[data['answered']]
    X_train = answered_data[[PREDICTOR1, PREDICTOR2]]
    y_train = answered_data[TARGET]

    if len(y_train.unique()) < 2:
        print("Not enough class diversity to train.")
        break

    model = LogisticRegression()
    model.fit(X_train, y_train)

    unanswered_data = data[~data['answered']]
    if unanswered_data.empty:
        print("All samples answered.")
        break

    X_unanswered = unanswered_data[[PREDICTOR1, PREDICTOR2]].copy()
    probs = model.predict_proba(X_unanswered)[:, 1]
    uncertainties = calculate_uncertainty(probs)

    if np.all(uncertainties <= (1 - MODEL_CERTAINTY_CUTOFF)) and iteration >= MIN_ITERATIONS:
        print(f"\nCertainty threshold {MODEL_CERTAINTY_CUTOFF} met after {iteration} iterations.")
        break

    if CLEANSER_FREQUENCY > 0 and (iteration - INIT_RANDOM_SAMPLES + 1) % CLEANSER_FREQUENCY == 0:
        # most certain sample
        min_uncertainty = uncertainties.min()
        candidates = unanswered_data.iloc[(uncertainties == min_uncertainty).nonzero()[0]]
        selected_sample = candidates.sample(1)
        selected_filename = selected_sample[FILENAME_COL].values[0]
        print(f"Iteration {iteration}: CLEANSER - selected most certain sample '{selected_filename}' (uncertainty: {min_uncertainty:.3f})")
    else:
        # most uncertain sample
        max_uncertainty = uncertainties.max()
        candidates = unanswered_data.iloc[(uncertainties == max_uncertainty).nonzero()[0]]
        selected_sample = candidates.sample(1)
        selected_filename = selected_sample[FILENAME_COL].values[0]
        print(f"Iteration {iteration}: selected most uncertain sample '{selected_filename}' (uncertainty: {max_uncertainty:.3f})")

    # play audio and get response
    response = get_human_response(selected_filename)
    if response is None:
        print(f"Skipping '{selected_filename}' due to playback error.")
        continue

    # store response
    data.loc[data[FILENAME_COL] == selected_filename, TARGET] = response
    data.loc[data[FILENAME_COL] == selected_filename, 'answered'] = True
    data.loc[data[FILENAME_COL] == selected_filename, 'used_for_training'] = True

    iteration += 1

# evaluation-only phase (do not mark these as used_for_training)
for filename in data[~data['answered']][FILENAME_COL]:
    answer = get_human_response(filename)
    if answer is not None:
        data.loc[data[FILENAME_COL] == filename, TARGET] = answer
        data.loc[data[FILENAME_COL] == filename, 'answered'] = True


# ======================
# FINAL MODEL & RESULTS
# ======================

runtime = time.time() - start_time
train_mask = data['used_for_training']
eval_mask = ~data['used_for_training'] & data[TARGET].notna()

print("\n=== Experiment Summary ===")
print(f"Runtime: {runtime:.2f}s")
print(f"Training samples:   {train_mask.sum()}")
print(f"Evaluation samples: {eval_mask.sum()}")
print(f"Total answered:     {data['answered'].sum()}/{len(data)}")

# train model on training set
final_model = LogisticRegression()
X_train = data[train_mask][[PREDICTOR1, PREDICTOR2]]
y_train = data[train_mask][TARGET]
final_model.fit(X_train, y_train)

# predict all
data['prediction'] = final_model.predict(data[[PREDICTOR1, PREDICTOR2]])
data['certainty'] = final_model.predict_proba(data[[PREDICTOR1, PREDICTOR2]]).max(axis=1)

# evaluation results
if eval_mask.sum() > 0:
    y_eval = data[eval_mask][TARGET]
    y_pred_eval = data[eval_mask]['prediction']

    print("\n=== Evaluation on Held-Out Samples ===")
    print(f"Accuracy:  {accuracy_score(y_eval, y_pred_eval):.3f}")
    print(f"Precision: {precision_score(y_eval, y_pred_eval):.3f}")
    print(f"Recall:    {recall_score(y_eval, y_pred_eval):.3f}")
    print(f"F1 Score:  {f1_score(y_eval, y_pred_eval):.3f}")
    print("Confusion Matrix:")
    print(confusion_matrix(y_eval, y_pred_eval))
else:
    print("\nNo evaluation samples collected. Only training data used.")

# plot and save
plot_results(data[data['used_for_training']], data[eval_mask], final_model)

# save processed data with predictions
if PROCESSED_PATH:
    data.to_csv(PROCESSED_PATH, index = False)
    print(f"\nProcessed data with predictions saved to {PROCESSED_PATH}")
else:
    print("\nProcessed data with predictions not saved - PROCESSED_PATH is empty")