# 📘 M2C2 DataKit Notebook: Universal Loading, Assurance, and Scoring with `LASSIE`

This notebook demonstrates a full analytic pipeline using the `m2c2-datakit` package. It uses the `LASSIE` class to load, validate, score, and optionally export data from multiple source types.

---

## 🎯 Purpose

Enable researchers to plug in data from varied sources (e.g., MongoDB, UAS, MetricWire, CSV bundles) and apply a consistent pipeline for:

- Input validation

- Scoring via predefined rules

- Inspection and summarization

- Tidy export and codebook generation

---

## Inspired by:

<img src="https://m.media-amazon.com/images/M/MV5BNDNkZDk0ODktYjc0My00MzY4LWE3NzgtNjU5NmMzZDA3YTA1XkEyXkFqcGc@._V1_FMjpg_UX1000_.jpg" alt="Inspiration for Package, Lassie Movie" width="100"/>

## 🧠 L.A.S.S.I.E. Pipeline Summary

| Step | Method           | Purpose                                                                 |
|------|------------------|-------------------------------------------------------------------------|
| L    | `load()`         | Load raw data from a supported source (e.g., MongoDB, UAS, MetricWire). |
| A    | `assure()`       | Validate that required columns exist before processing.                 |
| S    | `score()`        | Apply scoring logic based on predefined or custom rules.                |
| S    | `summarize()`    | Aggregate scored data by participant, session, or custom groups.        |
| I    | `inspect()`      | Visualize distributions or pairwise plots for quality checks.           |
| E    | `export()`       | Save scored and summarized data to tidy files and optionally metadata.  |

---


## 📦 Supported Sources

| Source Type   | Loader Class          | Key Arguments                            | Notes                                 |
|---------------|------------------------|-------------------------------------------|----------------------------------------|
| `mongodb`     | `MongoDBImporter`      | `source_path` (JSON)                      | Expects flat or nested JSON documents. |
| `uas`         | `UASImporter`          | `source_path` (URL)                       | Parses newline-delimited JSON.         |
| `metricwire`  | `MetricWireImporter`   | `source_path` (glob pattern or default)   | Processes JSON files from unzipped export. |
| `multicsv`    | `MultiCSVImporter`     | `source_map` (dict of CSV paths)          | Each activity type is its own file.    |

---

## 🚀 Example Pipeline Steps

### Step 1: Load Data

```python
mw = m2c2.core.pipeline.LASSIE().load(source_name="metricwire", source_path="data/metricwire/unzipped/*/*/*.json")
mw.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION_METRICWIRE)
mw_scored = mw.score()
mw.inspect()
mw.export(file_basename="metricwire", directory="tidy/metricwire_scored")
mw.export_codebook(filename="codebook_metricwire.md", directory="tidy/metricwire_scored")

# -----------------------------------------------------------------------------------------------------

mdb = m2c2.core.pipeline.LASSIE().load(source_name="mongodb", source_path="data/production-mongo-export/data_exported_120424_1010am.json")
mdb.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mdb.score()
mdb.inspect()
mdb.export(file_basename="mongodb_export", directory="tidy/mongodb_scored")
mdb.export_codebook(filename="codebook_mongo.md", directory="tidy/mongodb_scored")

# -----------------------------------------------------------------------------------------------------

uas = m2c2.core.pipeline.LASSIE().load(source_name="UAS", source_path= "https://uas.usc.edu/survey/uas/m2c2_ess/admin/export_m2c2.php?k=<INSERT KEY HERE>")
uas.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
uas.score()
uas.inspect()
uas.export(file_basename="uas_export", directory="tidy/uas_scored")
uas.export_codebook(filename="codebook_uas.md", directory="tidy/uas_scored")

# -----------------------------------------------------------------------------------------------------

source_map = {
    "Symbol Search": "data/reboot/m2c2kit_manualmerge_symbol_search_all_ts-20250402_151939.csv",
    "Grid Memory": "data/reboot/m2c2kit_manualmerge_grid_memory_all_ts-20250402_151940.csv"
}

mcsv = m2c2.core.pipeline.LASSIE().load(source_name="multicsv", source_map=source_map)

mcsv.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mcsv.score()
uas.inspect()
mcsv.export(file_basename="uas_export", directory="tidy/uas_scored")
mcsv.export_codebook(filename="codebook_uas.md", directory="tidy/uas_scored")

## 🛠️ Setup for Developers of this Package

In [None]:
!make clean
!make dev-install

## Run Tests

In [None]:
import pandas as pd
import json
import re

In [None]:
import m2c2_datakit as m2c2
m2c2.core.utils.get_filename_timestamp()

In [None]:
# Data from Metricwire
mw = m2c2.core.pipeline.LASSIE().load(source_name="metricwire", source_path="data/metricwire/unzipped/*/*/*.json")
mw.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION_METRICWIRE)
mw_scored = mw.score()
mw.inspect()
mw.export(file_basename="metricwire", directory="tidy/metricwire_scored")
mw.export_codebook(filename="codebook_metricwire.md", directory="tidy/metricwire_scored")

In [None]:
# Data from demo M2C2 study on PSU production server
mdb = m2c2.core.pipeline.LASSIE().load(source_name="mongodb", source_path="data/production-mongo-export/data_exported_120424_1010am.json")
mdb.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mdb.score()
mdb.inspect()
mdb.export(file_basename="mongodb_export", directory="tidy/mongodb_scored")
mdb.export_codebook(filename="codebook_mongo.md", directory="tidy/mongodb_scored")

In [None]:
# Data from REBOOT Study (UCF and PSU) was manually merged so we have two csvs to load
source_map = {
    "Symbol Search": "data/reboot/m2c2kit_manualmerge_symbol_search_all_ts-20250402_151939.csv",
    "Grid Memory": "data/reboot/m2c2kit_manualmerge_grid_memory_all_ts-20250402_151940.csv"
}

mcsv = m2c2.core.pipeline.LASSIE().load(source_name="multicsv", source_map=source_map)
mcsv.assure(required_columns=['participant_id'])
mcsv.score()
mcsv.inspect()
mcsv.export(file_basename="multicsv_export", directory="tidy/multicsv_scored")
mcsv.export_codebook(filename="codebook_multicsv.md", directory="tidy/multicsv_scored")

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
UAS_LATEST_KEY = os.getenv("UAS_LATEST_KEY")

# Data from UAS
uas = m2c2.core.pipeline.LASSIE().load(source_name="UAS", source_path= f"https://uas.usc.edu/survey/uas/m2c2_ess/admin/export_m2c2.php?k={UAS_LATEST_KEY}")
uas.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
uas.score()
uas.inspect()
uas.export(file_basename="uas_export", directory="tidy/uas_scored")
uas.export_codebook(filename="codebook_uas.md", directory="tidy/uas_scored")

In [7]:
# Qualtrics (coming soon)

# === Step 3: Define JSON Parser ===
def parse_trial_data(json_str):
    try:
        return json.loads(json_str)
    except (json.JSONDecodeError, TypeError):
        return {}

### Split dataframe / JSON by task
# This is essentially the format that the API would return - a bucket of JSON data per task.
def parse_json_to_dfs(df, activity_name_col="activity_name"):
    """
    Parse the JSON data into a list of dataframes, one for each participant.
    """
    # Group the DataFrame by the 'activity_name' column
    grouped = df.groupby(activity_name_col)

    # Split into separate DataFrames for each group
    grouped_dataframes = {name: group.reset_index(drop=True) for name, group in grouped}
    return grouped_dataframes

# === Step 1: Load Qualtrics File ===
file_path = "data/qualtrics/Qualtrics+QSF+M2C2Kit+-+Grid+Memory+&+Color+Shapes_April+17,+2025_18.05.csv"

# Use 2nd row as headers and skip the 3rd metadata row
df = pd.read_csv(file_path, header=1, skiprows=[2])

# Rename for easier access and reset index for row mapping
df = df.rename(columns={"Response ID": "ResponseId"}).reset_index()  # 'index' will help us join later

# === Step 2: Identify M2C2 Trial Data Columns ===
pattern = re.compile(r"(M2C2_ASSESSMENT_\d+)_TRIAL_DATA_(\d+)")
trial_columns = [col for col in df.columns if pattern.match(col)]

# Group columns by assessment
assessment_map = {}
for col in trial_columns:
    match = pattern.match(col)
    if match:
        assessment_name, _ = match.groups()
        assessment_map.setdefault(assessment_name, []).append(col)

# === Step 4: Long-Format Parsing Across All Assessments ===
long_format_data = []

for assessment_name, columns in assessment_map.items():
    melted = df.melt(id_vars=["index", "ResponseId", 'M2C2_ASSESSMENT_ORDER', 'M2C2_AUTO_ADVANCE', 'M2C2_LANGUAGE'],
                     value_vars=columns,
                     var_name="trial_label",
                     value_name="trial_data_json")
    
    melted["assessment"] = assessment_name
    melted["trial_number"] = melted["trial_label"].str.extract(r"TRIAL_DATA_(\d+)")
    melted['trial_number_int'] = pd.to_numeric(melted['trial_number'], errors='coerce').astype('Int64')

    melted = melted.dropna(subset=["trial_data_json"])

    parsed = melted["trial_data_json"].apply(parse_trial_data)
    expanded = pd.json_normalize(parsed)

    combined = pd.concat([
        melted.drop(columns=["trial_data_json"]).reset_index(drop=True),
        expanded.reset_index(drop=True)
    ], axis=1)

    long_format_data.append(combined)

# === Step 5: Final Combined Long Format Dataset ===
final_df = pd.concat(long_format_data, ignore_index=True)

final_df = final_df.drop('index', axis=1)

# === Optional: Save or Preview ===
# final_df.to_csv("parsed_trials_long_format.csv", index=False)
display(final_df.head())

# df, grouped_dataframes, validation, activity_names

grouped_dataframes = parse_json_to_dfs(final_df, activity_name_col="activity_id")
grouped_dataframes

Unnamed: 0,ResponseId,M2C2_ASSESSMENT_ORDER,M2C2_AUTO_ADVANCE,M2C2_LANGUAGE,trial_label,assessment,trial_number,trial_number_int,document_uuid,study_id,...,device_metadata.screen.height,device_metadata.screen.orientation.type,device_metadata.screen.orientation.angle,device_metadata.screen.pixelDepth,device_metadata.screen.width,device_metadata.webGlRenderer,present_shapes,response_shapes,user_response,user_response_correct
0,R_1E6dnY8uUGgnpO6,"GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS",True,en-US,M2C2_ASSESSMENT_00_TRIAL_DATA_00,M2C2_ASSESSMENT_00,0,0,f6304d7d-dd87-4143-aa09-c5cb4d505676,,...,892,portrait-primary,0,24,412,"ARM, Mali-G710",,,,
1,R_72bY7cA1XNZjWQy,"GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS",True,en-US,M2C2_ASSESSMENT_00_TRIAL_DATA_00,M2C2_ASSESSMENT_00,0,0,1e551ce5-0b02-4843-a92e-8f20b7c0cba4,,...,892,portrait-primary,0,24,412,"ARM, Mali-G710",,,,
2,R_1E6dnY8uUGgnpO6,"GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS",True,en-US,M2C2_ASSESSMENT_00_TRIAL_DATA_01,M2C2_ASSESSMENT_00,1,1,7f1610b5-2200-4d97-947b-a58acdc54d07,,...,892,portrait-primary,0,24,412,"ARM, Mali-G710",,,,
3,R_72bY7cA1XNZjWQy,"GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS",True,en-US,M2C2_ASSESSMENT_00_TRIAL_DATA_01,M2C2_ASSESSMENT_00,1,1,01d05ab3-ae0f-446a-b2c6-b52c62265a93,,...,892,portrait-primary,0,24,412,"ARM, Mali-G710",,,,
4,R_1E6dnY8uUGgnpO6,"GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS",True,en-US,M2C2_ASSESSMENT_00_TRIAL_DATA_02,M2C2_ASSESSMENT_00,2,2,628c2265-f277-448b-b219-a686d020c8c5,,...,892,portrait-primary,0,24,412,"ARM, Mali-G710",,,,


{'color-shapes':            ResponseId                              M2C2_ASSESSMENT_ORDER  \
 0   R_1E6dnY8uUGgnpO6  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 1   R_72bY7cA1XNZjWQy  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 2   R_1E6dnY8uUGgnpO6  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 3   R_72bY7cA1XNZjWQy  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 4   R_1E6dnY8uUGgnpO6  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 5   R_72bY7cA1XNZjWQy  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 6   R_1E6dnY8uUGgnpO6  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 7   R_72bY7cA1XNZjWQy  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 8   R_1E6dnY8uUGgnpO6  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 9   R_72bY7cA1XNZjWQy  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 10  R_1E6dnY8uUGgnpO6  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEARCH,COLOR_DOTS   
 11  R_72bY7cA1XNZjWQy  GRID_MEMORY,COLOR_SHAPES,SYMBOL_SEAR