# 📘 M2C2 DataKit Notebook: Universal Loading, Assurance, and Scoring with `LASSIE`

This notebook demonstrates a full analytic pipeline using the `m2c2-datakit` package. It uses the `LASSIE` class to load, validate, score, and optionally export data from multiple source types.

---

## 🎯 Purpose

Enable researchers to plug in data from varied sources (e.g., MongoDB, UAS, MetricWire, CSV bundles) and apply a consistent pipeline for:

- Input validation

- Scoring via predefined rules

- Inspection and summarization

- Tidy export and codebook generation

---

## Inspired by:

<img src="https://m.media-amazon.com/images/M/MV5BNDNkZDk0ODktYjc0My00MzY4LWE3NzgtNjU5NmMzZDA3YTA1XkEyXkFqcGc@._V1_FMjpg_UX1000_.jpg" alt="Inspiration for Package, Lassie Movie" width="100"/>

## 🧠 L.A.S.S.I.E. Pipeline Summary

| Step | Method           | Purpose                                                                 |
|------|------------------|-------------------------------------------------------------------------|
| L    | `load()`         | Load raw data from a supported source (e.g., MongoDB, UAS, MetricWire). |
| A    | `assure()`       | Validate that required columns exist before processing.                 |
| S    | `score()`        | Apply scoring logic based on predefined or custom rules.                |
| S    | `summarize()`    | Aggregate scored data by participant, session, or custom groups.        |
| I    | `inspect()`      | Visualize distributions or pairwise plots for quality checks.           |
| E    | `export()`       | Save scored and summarized data to tidy files and optionally metadata.  |

---


## 📦 Supported Sources

| Source Type   | Loader Class          | Key Arguments                            | Notes                                 |
|---------------|------------------------|-------------------------------------------|----------------------------------------|
| `mongodb`     | `MongoDBImporter`      | `source_path` (JSON)                      | Expects flat or nested JSON documents. |
| `uas`         | `UASImporter`          | `source_path` (URL)                       | Parses newline-delimited JSON.         |
| `metricwire`  | `MetricWireImporter`   | `source_path` (glob pattern or default)   | Processes JSON files from unzipped export. |
| `multicsv`    | `MultiCSVImporter`     | `source_map` (dict of CSV paths)          | Each activity type is its own file.    |

---

## 🚀 Example Pipeline Steps

### Step 1: Load Data

```python
mw = m2c2.core.pipeline.LASSIE().load(source_name="metricwire", source_path="data/metricwire/unzipped/*/*/*.json")
mw.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION_METRICWIRE)
mw_scored = mw.score()
mw.inspect()
mw.export(file_basename="metricwire", directory="tidy/metricwire_scored")
mw.export_codebook(filename="codebook_metricwire.md", directory="tidy/metricwire_scored")

# -----------------------------------------------------------------------------------------------------

mdb = m2c2.core.pipeline.LASSIE().load(source_name="mongodb", source_path="data/production-mongo-export/data_exported_120424_1010am.json")
mdb.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mdb.score()
mdb.inspect()
mdb.export(file_basename="mongodb_export", directory="tidy/mongodb_scored")
mdb.export_codebook(filename="codebook_mongo.md", directory="tidy/mongodb_scored")

# -----------------------------------------------------------------------------------------------------

uas = m2c2.core.pipeline.LASSIE().load(source_name="UAS", source_path= "https://uas.usc.edu/survey/uas/m2c2_ess/admin/export_m2c2.php?k=<INSERT KEY HERE>")
uas.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
uas.score()
uas.inspect()
uas.export(file_basename="uas_export", directory="tidy/uas_scored")
uas.export_codebook(filename="codebook_uas.md", directory="tidy/uas_scored")

# -----------------------------------------------------------------------------------------------------

source_map = {
    "Symbol Search": "data/reboot/m2c2kit_manualmerge_symbol_search_all_ts-20250402_151939.csv",
    "Grid Memory": "data/reboot/m2c2kit_manualmerge_grid_memory_all_ts-20250402_151940.csv"
}

mcsv = m2c2.core.pipeline.LASSIE().load(source_name="multicsv", source_map=source_map)

mcsv.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mcsv.score()
uas.inspect()
mcsv.export(file_basename="uas_export", directory="tidy/uas_scored")
mcsv.export_codebook(filename="codebook_uas.md", directory="tidy/uas_scored")

## 🛠️ Setup for Developers of this Package

In [None]:
!make clean
!make dev-install

## Run Tests

In [None]:
import pandas as pd
import json
import re

In [None]:
import m2c2_datakit as m2c2
m2c2.core.utils.get_filename_timestamp()

In [None]:
# Data from Metricwire
mw = m2c2.core.pipeline.LASSIE().load(source_name="metricwire", source_path="data/metricwire/unzipped/*/*/*.json")
mw.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION_METRICWIRE)
mw_scored = mw.score()
mw.inspect()
mw.export(file_basename="metricwire", directory="tidy/metricwire_scored")
mw.export_codebook(filename="codebook_metricwire.md", directory="tidy/metricwire_scored")

In [None]:
# Data from demo M2C2 study on PSU production server
mdb = m2c2.core.pipeline.LASSIE().load(source_name="mongodb", source_path="data/production-mongo-export/data_exported_120424_1010am.json")
mdb.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
mdb.score()
mdb.inspect()
mdb.export(file_basename="mongodb_export", directory="tidy/mongodb_scored")
mdb.export_codebook(filename="codebook_mongo.md", directory="tidy/mongodb_scored")

In [None]:
# Data from REBOOT Study (UCF and PSU) was manually merged so we have two csvs to load
source_map = {
    "Symbol Search": "data/reboot/m2c2kit_manualmerge_symbol_search_all_ts-20250402_151939.csv",
    "Grid Memory": "data/reboot/m2c2kit_manualmerge_grid_memory_all_ts-20250402_151940.csv"
}

mcsv = m2c2.core.pipeline.LASSIE().load(source_name="multicsv", source_map=source_map)
mcsv.assure(required_columns=['participant_id'])
mcsv.score()
mcsv.inspect()
mcsv.export(file_basename="multicsv_export", directory="tidy/multicsv_scored")
mcsv.export_codebook(filename="codebook_multicsv.md", directory="tidy/multicsv_scored")

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
UAS_LATEST_KEY = os.getenv("UAS_LATEST_KEY")

# Data from UAS
uas = m2c2.core.pipeline.LASSIE().load(source_name="UAS", source_path= f"https://uas.usc.edu/survey/uas/m2c2_ess/admin/export_m2c2.php?k={UAS_LATEST_KEY}")
uas.assure(required_columns=m2c2.core.config.settings.STANDARD_GROUPING_FOR_AGGREGATION)
uas.score()
uas.inspect()
uas.export(file_basename="uas_export", directory="tidy/uas_scored")
uas.export_codebook(filename="codebook_uas.md", directory="tidy/uas_scored")