# 📘 M2C2 DataKit Notebook: Universal Loading, Assurance, and Scoring

This notebook demonstrates a full analytic pipeline using the `m2c2-datakit` package. It uses the `LASSIE` class to load, validate, score, and optionally export data from multiple source types.

---

## 🎯 Purpose

Enable researchers to plug in data from varied sources (e.g., MongoDB, UAS, MetricWire, CSV bundles) and apply a consistent pipeline for:

- Input validation

- Scoring via predefined rules

- Inspection and summarization

- Tidy export and codebook generation

---

## Inspired by:

<img src="https://m.media-amazon.com/images/M/MV5BNDNkZDk0ODktYjc0My00MzY4LWE3NzgtNjU5NmMzZDA3YTA1XkEyXkFqcGc@._V1_FMjpg_UX1000_.jpg" alt="Inspiration for Package, Lassie Movie" width="100"/>

## 🧠 L.A.S.S.I.E. Pipeline Summary

| Step | Method           | Purpose                                                                 |
|------|------------------|-------------------------------------------------------------------------|
| L    | `load()`         | Load raw data from a supported source (e.g., MongoDB, UAS, MetricWire). |
| A    | `assure()`       | Validate that required columns exist before processing.                 |
| S    | `score()`        | Apply scoring logic based on predefined or custom rules.                |
| S    | `summarize()`    | Aggregate scored data by participant, session, or custom groups.        |
| I    | `inspect()`      | Visualize distributions or pairwise plots for quality checks.           |
| E    | `export()`       | Save scored and summarized data to tidy files and optionally metadata.  |

---

## Verify Data

## 🛠️ Setup for Developers of this Package

## Setup Environment to run Notebook

In [2]:
!make dev-install
%pip install --upgrade --no-cache-dir m2c2_datakit

rm -rf dist build *.egg-info .venv
uv venv .venv --python=/opt/homebrew/bin/python3.10
Using CPython 3.10.16 interpreter at: [36m/opt/homebrew/opt/python@3.10/bin/python3.10[39m
Creating virtual environment at: [36m.venv[39m
Activate with: [32msource .venv/bin/activate[39m
uv pip install --python=.venv/bin/python -r requirements.txt --no-cache-dir m2c2_datakit
[2K[2mResolved [1m92 packages[0m [2min 2.44s[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/92)                                                  
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/92)---[0m[0m     0 B/1.50 MiB                      [1A
[2mpydantic  [0m [32m[2m------------------------------[0m[0m     0 B/433.50 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/92)---[0m[0m     0 B/1.50 MiB                      [2A
[2mtoolz     [0m [32m[2m------------------------------[0m[0m     0 B/55.06 KiB
[2mpydantic  [0m [32m[2m--------

In [1]:
import os
import re
import glob
import json
import pandas as pd
from dotenv import load_dotenv
import m2c2_datakit as m2c2
m2c2.core.get_package_version()

ModuleNotFoundError: No module named 'm2c2_datakit'

## Configure output folder and summary functions

## Step 1: Load Data

In [None]:
output_folder = "tidy"

summary_func_map = {
    "Symbol Search": m2c2.tasks.symbol_search.summarize,
    "Grid Memory": m2c2.tasks.grid_memory.summarize,
}

# ^ this also means that a user could specify their own summarize functions as needed!

# Data from REBOOT Study (UCF and PSU) was manually merged so we have two csvs to load
source_map = {
    "Symbol Search": "~/Documents/GitHub/datakit/data/reboot/m2c2kit_manualmerge_symbol_search_all_ts-20250402_151939.csv",
    "Grid Memory": "~/Documents/GitHub/datakit/data/reboot/m2c2kit_manualmerge_grid_memory_all_ts-20250402_151940.csv"
}

# Data from REBOOT Study (UCF and PSU) was manually merged so we have two csvs to load
mcsv = m2c2.core.pipeline.LASSIE().load(source_name="multicsv", source_map=source_map)
mcsv.assure(required_columns=['participant_id'])
mcsv.score()
mcsv.summarize(summary_func_map = summary_func_map)
ss = mcsv.grouped_summary.get("Symbol Search")


In [None]:
%pip install pymc arviz


In [None]:
import pymc as pm
import arviz as az
import numpy as np
import pandas as pd

def flag_implausible_rts(df, participant_col='participant_id', rt_col='median_response_time_filtered',
                         correct_col='n_trials_correct', incorrect_col='n_trials_incorrect',
                         z_thresh=0.025, accuracy_thresh=0.6, samples=1000, tune=1000, chains=2):
    """
    Hierarchical Bayesian model using PyMC v4 (Aesara-based) to flag implausibly low RTs.

    Parameters:
        df (pd.DataFrame): Input dataframe.
        participant_col (str): Column containing participant IDs.
        rt_col (str): Column with response times.
        correct_col (str): Column with correct trial counts.
        incorrect_col (str): Column with incorrect trial counts.
        z_thresh (float): Quantile cutoff for credible interval (default 0.025).
        accuracy_thresh (float): Accuracy threshold below which low RT is considered implausible.
        samples (int): Number of posterior samples.
        tune (int): Number of tuning steps.
        chains (int): Number of chains.

    Returns:
        pd.DataFrame: Original dataframe with added columns:
            - accuracy_rate
            - rt_outlier (boolean)
            - implausible_rt (boolean)
    """
    df = df.copy()
    df = df.dropna(subset=[participant_col, rt_col, correct_col, incorrect_col])

    # Encode participants
    df['participant_idx'] = df[participant_col].astype('category').cat.codes
    n_participants = df['participant_idx'].nunique()

    rt_obs = df[rt_col].values
    participant_ids = df['participant_idx'].values
    accuracy_rate = df[correct_col] / (df[correct_col] + df[incorrect_col] + 1e-6)
    df['accuracy_rate'] = accuracy_rate

    with pm.Model() as model:
        # Hyperpriors
        mu_group = pm.Normal('mu_group', mu=1500, sigma=500)
        sigma_group = pm.HalfNormal('sigma_group', sigma=200)

        # Individual-level priors
        mu_individual = pm.Normal('mu_individual', mu=mu_group, sigma=sigma_group, shape=n_participants)
        sigma_individual = pm.HalfNormal('sigma_individual', sigma=100, shape=n_participants)

        # Likelihood
        rt = pm.Normal('rt', mu=mu_individual[participant_ids],
                       sigma=sigma_individual[participant_ids],
                       observed=rt_obs)

        # Sample posterior
        trace = pm.sample(draws=samples, tune=tune, chains=chains, target_accept=0.9)

        # Posterior predictive
        ppc = pm.sample_posterior_predictive(trace, model=model, var_names=["rt"])

    # Get bounds from posterior predictive
    ppc_rt = ppc["rt"]  # shape: (samples * chains, n_obs)
    lower_bounds = np.percentile(ppc_rt, 100 * z_thresh, axis=0)
    upper_bounds = np.percentile(ppc_rt, 100 * (1 - z_thresh), axis=0)

    # Flag outliers
    df['rt_outlier'] = (rt_obs < lower_bounds) | (rt_obs > upper_bounds)
    df['implausible_rt'] = df['rt_outlier'] & (df['accuracy_rate'] < accuracy_thresh)

    return df

# Assuming your DataFrame is called `a`
df_flagged = flag_implausible_rts(a)
df_flagged[df_flagged['implausible_rt']]
