# 📘 M2C2 DataKit Notebook: Universal Loading, Assurance, and Scoring

This notebook demonstrates a full analytic pipeline using the `m2c2-datakit` package. It uses the `LASSIE` class to load, validate, score, and optionally export data from multiple source types.

---

## 🎯 Purpose

Enable researchers to plug in data from varied sources (e.g., MongoDB, UAS, MetricWire, CSV bundles) and apply a consistent pipeline for:

- Input validation

- Scoring via predefined rules

- Inspection and summarization

- Tidy export and codebook generation

---

## Inspired by:

<img src="https://m.media-amazon.com/images/M/MV5BNDNkZDk0ODktYjc0My00MzY4LWE3NzgtNjU5NmMzZDA3YTA1XkEyXkFqcGc@._V1_FMjpg_UX1000_.jpg" alt="Inspiration for Package, Lassie Movie" width="100"/>

## 🧠 L.A.S.S.I.E. Pipeline Summary

| Step | Method           | Purpose                                                                 |
|------|------------------|-------------------------------------------------------------------------|
| L    | `load()`         | Load raw data from a supported source (e.g., MongoDB, UAS, MetricWire). |
| A    | `assure()`       | Validate that required columns exist before processing.                 |
| S    | `score()`        | Apply scoring logic based on predefined or custom rules.                |
| S    | `summarize()`    | Aggregate scored data by participant, session, or custom groups.        |
| I    | `inspect()`      | Visualize distributions or pairwise plots for quality checks.           |
| E    | `export()`       | Save scored and summarized data to tidy files and optionally metadata.  |

---


## 📦 Supported Sources

You may have used M2C2kit tasks via our various integrations, including the ones listed below. Each integration has its own loader class, which is responsible for reading the data and converting it into a format that can be processed by the `m2c2_datakit` package. Keep in mind that you are responsible for ensuring that the data is in the correct format for each loader class.

In the future we anticipate creating loaders for downloading data via API.

| Source Type   | Loader Class          | Key Arguments                            | Notes                                 |
|---------------|------------------------|-------------------------------------------|----------------------------------------|
| `mongodb`     | `MongoDBImporter`      | `source_path` (URL, to JSON)                      | Expects flat or nested JSON documents. |
| `multicsv`    | `MultiCSVImporter`     | `source_map` (dict of CSV paths)          | Each activity type is its own file.    |
| `metricwire`  | `MetricWireImporter`   | `source_path` (glob pattern or default)   | Processes JSON files from unzipped export. |
| `qualtrics`    | `QualtricsImporter`     | `source_path` (URL to CSV)         | Each activity's trial saves data to a new column.    |
| `uas`         | `UASImporter`          | `source_path` (URL, to pseudo-JSON)                       | Parses newline-delimited JSON.         |


---

## 🚀 Example Pipeline Steps

### Step 1: Load Data

```python
source_map = {
    "Symbol Search": "data/reboot/m2c2kit_manualmerge_symbol_search_all_ts-20250402_151939.csv",
    "Grid Memory": "data/reboot/m2c2kit_manualmerge_grid_memory_all_ts-20250402_151940.csv"
}

mcsv = m2c2.core.pipeline.LASSIE().load(source_name="multicsv", source_map=source_map)
mw = m2c2.core.pipeline.LASSIE().load(source_name="metricwire", source_path="data/metricwire/unzipped/*/*/*.json")
mdb = m2c2.core.pipeline.LASSIE().load(source_name="mongodb", source_path="data/production-mongo-export/data_exported_120424_1010am.json")
uas = m2c2.core.pipeline.LASSIE().load(source_name="UAS", source_path= "https://uas.usc.edu/survey/uas/m2c2_ess/admin/export_m2c2.php?k=<INSERT KEY HERE>")
```

---

### Step 2: Verify Data

```python
mcsv.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mw.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION_METRICWIRE)
mdb.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
uas.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
```
---

### Step 3: Score Data

```python
mcsv.score()
mw.score()
mdb.score()
uas.score()
```

### Step 4: Inspect Data

```python
mcsv.inspect()
mw.inspect()
mdb.inspect()
uas.inspect()
```
---

### Step 5: Summarize Data

```python
mcsv.summarize(grouping=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mw.summarize(grouping=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION_METRICWIRE)
mdb.summarize(grouping=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
uas.summarize(grouping=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
```

---

### Step 6: Export Data

```python
output_folder = "tidy
mcsv.export(file_basename="uas_export", directory=output_folder)
mw.export(file_basename="metricwire", directory=output_folder)
mdb.export(file_basename="mongodb_export", directory=output_folder)
uas.export(file_basename="uas_export", directory=output_folder)

```

#### Oh yeah, and export the codebook too!

```python
mcsv.export_codebook(filename="codebook_uas.md", directory=output_folder)
mw.export_codebook(filename="codebook_metricwire.md", directory=output_folder)
mdb.export_codebook(filename="codebook_mongo.md", directory=output_folder)
uas.export_codebook(filename="codebook_uas.md", directory=output_folder)
```

## Verify Data

## 🛠️ Setup for Developers of this Package

## Setup Environment to run Notebook

In [1]:
!make dev-install

rm -rf dist build *.egg-info .venv
uv venv .venv --python=/opt/homebrew/bin/python3.10
Using CPython 3.10.16 interpreter at: [36m/opt/homebrew/opt/python@3.10/bin/python3.10[39m
Creating virtual environment at: [36m.venv[39m
Activate with: [32msource .venv/bin/activate[39m
uv pip install --python=.venv/bin/python -r requirements.txt
[2K[2mResolved [1m74 packages[0m [2min 762ms[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)------[0m[0m     0 B/51.86 KiB                   [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)------[0m[0m 14.95 KiB/51.86 KiB                 [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)------[0m[0m 30.95 KiB/51.86 KiB                 [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)[2m--[0m[0m 46.95 KiB/51.86 KiB                 [1A
[2K[1A[37m⠙

In [2]:
import os
import re
import glob
import json
import pandas as pd
from dotenv import load_dotenv
import m2c2_datakit as m2c2

In [3]:
output_folder = "tidy"

summary_func_map = {
    "Symbol Search": m2c2.tasks.symbol_search.summarize,
    "Grid Memory": m2c2.tasks.grid_memory.summarize,
}

# ^ this also means that a user could specify their own summarize functions as needed!

## Step 1: Load Data

In [4]:
# Define source_folder relative to current working directory
source_folder_mw = os.path.abspath(os.path.join(os.pardir, 'datakit/data/metricwire/unzipped'))
source_path_mw = f"{source_folder_mw}/*/*/*.json"

source_folder_mdb = os.path.abspath(os.path.join(os.pardir, "datakit/data/production-mongo-export"))
source_path_mdb = f"{source_folder_mdb}/data_exported_120424_1010am.json"

source_folder_qualtrics = os.path.abspath(os.path.join(os.pardir, "datakit/data/qualtrics"))
source_path_qualtrics = f"{source_folder_qualtrics}/Qualtrics+QSF+M2C2Kit+-+Grid+Memory+&+Color+Shapes_April+17,+2025_18.05.csv"

# Data from REBOOT Study (UCF and PSU) was manually merged so we have two csvs to load
source_map = {
    "Symbol Search": "~/Documents/GitHub/datakit/data/reboot/m2c2kit_manualmerge_symbol_search_all_ts-20250402_151939.csv",
    "Grid Memory": "~/Documents/GitHub/datakit/data/reboot/m2c2kit_manualmerge_grid_memory_all_ts-20250402_151940.csv"
}

In [5]:
# Data from Metricwire
mw = m2c2.core.pipeline.LASSIE().load(source_name="metricwire", source_path=source_path_mw)

# Data from demo M2C2 study on PSU production server
mdb = m2c2.core.pipeline.LASSIE().load(source_name="mongodb", source_path=source_path_mdb)

# Data from REBOOT Study (UCF and PSU) was manually merged so we have two csvs to load
mcsv = m2c2.core.pipeline.LASSIE().load(source_name="multicsv", source_map=source_map)

# Data from Qualtrics
qualtrics = m2c2.core.pipeline.LASSIE().load(source_name="qualtrics", source_path=source_path_qualtrics)

# Data from UAS
load_dotenv()
UAS_LATEST_KEY = os.getenv("UAS_LATEST_KEY")
#uas = m2c2.core.pipeline.LASSIE().load(source_name="UAS", source_path= f"https://uas.usc.edu/survey/uas/m2c2_ess/admin/export_m2c2.php?k={UAS_LATEST_KEY}")

  mw = m2c2.core.pipeline.LASSIE().load(source_name="metricwire", source_path=source_path_mw)


## Step 2: Verify Data

In [None]:
mdb.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mcsv.assure(required_columns=['participant_id'])
mw.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION_METRICWIRE)
qualtrics.assure(required_columns=['ResponseId'])
#uas.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)

: 

: 

: 

## Step 3: Score Data

In [6]:
mdb.score()
mcsv.score()
mw = mw.score()
qualtrics.score()
#uas.score()

[WARN] No scoring functions defined for activity 'Color Dots', skipping.
[WARN] No scoring functions defined for activity 'Color Match', skipping.
[WARN] No scoring functions defined for activity 'Color Shapes', skipping.
[WARN] No scoring functions defined for activity 'Color Squares', skipping.
[WARN] No scoring functions defined for activity 'Digit Span', skipping.
[WARN] No scoring functions defined for activity 'Digit Span (Audio)', skipping.
[WARN] No scoring functions defined for activity 'Drawing', skipping.
[WARN] No scoring functions defined for activity 'Face Naming Task', skipping.
[WARN] No scoring functions defined for activity 'Go No Go', skipping.
[WARN] No scoring functions defined for activity 'Go No Go Fade', skipping.
[WARN] No scoring functions defined for activity 'Grid Forage', skipping.
[WARN] No scoring functions defined for activity 'JOLO', skipping.
[WARN] No scoring functions defined for activity 'Motion', skipping.
[WARN] No scoring functions defined for ac

<m2c2_datakit.core.pipeline.LASSIE at 0x10f3d1780>

## Step 4: Summarize Data

In [None]:
# mcsv.summarize(grouping = ["participant_id", 'session_id', 'session_uuid'],
#                summary_func_map = summary_func_map,
#                activity_col = "activity_name")

In [10]:
mdb.summarize(summary_func_map = summary_func_map)

<m2c2_datakit.core.pipeline.LASSIE at 0x10d2fe560>

In [27]:
mdb.grouped

{'Color Dots':                participant_id                            session_id group  \
 0    650af3a6adf93fcb20605f34  9f89dacc-510f-48c0-b533-3234197c0314  None   
 2    650af3a6adf93fcb20605f34  9f89dacc-510f-48c0-b533-3234197c0314  None   
 5    650af3a6adf93fcb20605f34  9f89dacc-510f-48c0-b533-3234197c0314  None   
 9    650af3a6adf93fcb20605f34  9f89dacc-510f-48c0-b533-3234197c0314  None   
 14   650af3a6adf93fcb20605f34  9f89dacc-510f-48c0-b533-3234197c0314  None   
 ..                        ...                                   ...   ...   
 765  64d4f4567ce16ae0f145437d  3c83c949-03ad-49b3-8794-45e1c7cfdaba  None   
 782  64d4f4567ce16ae0f145437d  3c83c949-03ad-49b3-8794-45e1c7cfdaba  None   
 800  64d4f4567ce16ae0f145437d  3c83c949-03ad-49b3-8794-45e1c7cfdaba  None   
 819  64d4f4567ce16ae0f145437d  3c83c949-03ad-49b3-8794-45e1c7cfdaba  None   
 839  64d4f4567ce16ae0f145437d  3c83c949-03ad-49b3-8794-45e1c7cfdaba  None   
 
      wave activity_id         study_id         

In [22]:
summary_func_map

{'Symbol Search': <function m2c2_datakit.tasks.symbol_search.summarize(x, trials_expected=20, rt_outlier_low=100, rt_outlier_high=10000)>,
 'Grid Memory': <function m2c2_datakit.tasks.grid_memory.summarize(x, trials_expected=4)>}

In [43]:
summary_func_map = {
    "Symbol Search": m2c2.tasks.symbol_search.summarize,
    # "Grid Memory": m2c2.tasks.grid_memory.summarize,
}

summary_results = {}

for group_key, df in mcsv.grouped.items():
    summary_func = summary_func_map.get(group_key)
    
    if summary_func is not None:
        cur_df = mcsv.grouped_scored.get(group_key)

        if cur_df is None:
            print(f"No scored dataframe found for group key: {group_key}")
            continue

        summary_df = (
            cur_df
            .groupby(['participant_id', 'session_id', 'session_uuid'])
            .apply(summary_func)
            .reset_index()
        )
        
        summary_results[group_key] = summary_df
    else:
        print(f"No summary function found for group key: {group_key}")


No summary function found for group key: Grid Memory


In [41]:
summary_results.get("Symbol Search")

Unnamed: 0,participant_id,activity_begin_iso8601_timestamp,n_trials,flag_trials_match_expected,flag_trials_lt_expected,flag_trials_gt_expected,n_trials_total,n_trials_lure,n_trials_normal,n_trials_correct,n_trials_incorrect,median_response_time_filtered,median_response_time_overall,median_response_time_correct,median_response_time_incorrect
0,61089502bfeffd3bc72ef066,2023-11-16T16:58:07Z,12,False,True,False,12,240,240,445,35,1122.0,1122.00,1122.00,955.00
1,62a76562a0e0443ef0adff79,2024-03-05T18:28:10Z,12,False,True,False,12,12,12,12,12,363.2,363.20,358.60,368.55
2,63ebd08f97b0db20efc6fe7a,2024-08-08T21:33:58Z,12,False,True,False,12,54,54,54,54,322.8,258.50,265.55,257.95
3,63ee65fa97b0db20efceb32d,2023-11-13T18:47:27Z,12,False,True,False,12,42,42,63,21,1089.0,1012.00,1121.00,452.00
4,64624f8bd957b73d388a4dbe,2023-08-30T13:08:42Z,12,False,True,False,12,358,349,678,29,2264.0,2269.00,2304.50,1657.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211,67bf562be4c0bf049db00589,2025-02-27T14:34:42Z,12,False,True,False,12,288,288,570,6,2205.8,2221.35,2224.10,1257.60
212,67db3fade01125cbfcf8a4c5,2025-03-20T14:18:38Z,12,False,True,False,12,180,180,331,29,1403.0,1403.00,1421.00,1149.00
213,|*participantUserId*|,2024-03-05T18:39:05Z,12,False,True,False,12,6,6,11,1,2011.5,2011.50,2137.00,1324.00
214,|Example|,2024-02-06T00:32:20Z,12,False,True,False,12,1294,1294,2390,198,2272.5,2273.00,2339.50,1268.50


In [14]:
display(mdb.flat_summary)

0     2024-11-07T14:09:41.388Z
1                            4
2                         True
3                        False
4                        False
5                     0.504031
6                          0.0
7                          0.0
8                     2.236068
9                     8.064495
10                    0.813849
11                    1.269003
12                    1.333333
13                         0.0
14                    2.981424
15                   20.304053
16                    0.980295
17                     3.80701
18                         4.0
19                         0.0
20                    8.944272
21                   60.912159
22                    2.940886
23    2024-11-07T14:11:22.424Z
24                          20
25                        True
26                       False
27                       False
28                          20
29                          40
30                          40
31                          76
32      

In [13]:
display(mdb.grouped_summary.get("Grid Memory"))

activity_begin_iso8601_timestamp          2024-11-07T14:09:41.388Z
n_trials                                                         4
flag_trials_match_expected                                    True
flag_trials_lt_expected                                      False
flag_trials_gt_expected                                      False
metric_error_distance_hausdorff_mean                      0.504031
metric_error_distance_hausdorff_median                         0.0
metric_error_distance_hausdorff_min                            0.0
metric_error_distance_hausdorff_max                       2.236068
metric_error_distance_hausdorff_sum                       8.064495
metric_error_distance_hausdorff_std                       0.813849
metric_error_distance_mean_mean                           1.269003
metric_error_distance_mean_median                         1.333333
metric_error_distance_mean_min                                 0.0
metric_error_distance_mean_max                            2.98

## Step 5: Inspect Data (plots)

In [None]:
#mdb.inspect()
#mcsv.inspect()
#mw.inspect()
#qualtrics.inspect()
#uas.inspect()

## Step 6: Export Data and Codebooks

In [None]:
mw.export(file_basename="export_metricwire", directory=output_folder)
mw.export_codebook(filename="codebook_metricwire.md", directory=output_folder)

mdb.export(file_basename="export_mongodb", directory=output_folder)
mdb.export_codebook(filename="codebook_mongo.md", directory=output_folder)

mcsv.export(file_basename="export_multicsv", directory=output_folder)
mcsv.export_codebook(filename="codebook_multicsv.md", directory=output_folder)

#uas.export(file_basename="export_uas", directory=output_folder)
#uas.export_codebook(filename="codebook_uas.md", directory=output_folder)

#qualtrics.export(file_basename="export_qualtrics", directory=output_folder)
#qualtrics.export_codebook(filename="codebook_qualtrics.md", directory=output_folder)

## Step X: See What's Inside

In [None]:
mw.whats_inside()
mdb.whats_inside()
mcsv.whats_inside()
#uas.whats_inside()
#qualtrics.whats_inside()