<a href="https://colab.research.google.com/github/mDOT-Center/mResearch-toolkit/blob/main/mDOT_mResearch_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction and Overview

There are many components that have been created through the mResearch project and this tutorial seeks to bring most of them together through a unified example of stress classification.  

For the purposes of this example, we utilize a publicly available dataset, [WESAD](https://archive.ics.uci.edu/ml/datasets/WESAD+%28Wearable+Stress+and+Affect+Detection%29), which contains the necessary ECG and PPG data along with labeled stress and non-stress time blocks.  We will utilize the [Cerebral Cortex](https://github.com/MD2Korg/CerebralCortex-Kernel/) platform to perform the initial data ingestion and signal cleaning as well as use the platform as an data storage system for the other components. Cerebral Cortex will demonstrate its computation and algorithm capabilities along with some basic data visualizations to help you understand the data.

Next, we will move in the [mFlow](https://github.com/mlds-lab/mFlow) system will be used to run a range of experimental workflows that focus on test/train or cross-validation performance of several stress clasifiers when run against the WESAD data.  

To further augment the capabililites of the mResearch project, the [BioGen LINK REQUIRED]() components allows signals to be generated (e.g. ECG from PPG or PPG from ECG).  We will utilize this to transform and generate new signal from the existing WESAD data.  The mFlow system will be utilized to compare the performance of the modeling system as well as the performance of the BioGen data generators.



# Tools



## Cerebral Cortex

Cerebral Cortex is MD2K's big data cloud tool designed to support population-scale data analysis, visualization, model development, and intervention design for mobile-sensor data. It provides the ability to do machine learning model development on population scale datasets and provides interoperable interfaces for aggregation of diverse data sources.

### Additional Information

1. Getting Started
  * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MD2Korg/CerebralCortex-Kernel/blob/master/examples/datastream_operation.ipynb) Workflow showing how to perform a basic operation (e.g., read, write, analyze data).

2. How to use builtin CC algorithms
  * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MD2Korg/CerebralCortex-Kernel/blob/master/examples/cc_algorithms.ipynb) Working example on how to use CC builtin algorithms. GPS clustering algorithm is used in this example.

3. How to import data and run analysis on it using CC
  * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MD2Korg/CerebralCortex-Kernel/blob/master/examples/import_and_analyse_data.ipynb) This notebook shows how you can import your data into CC and use CC to analyse the data.

4. Complete Computational Pipeline (ECG Data to Stress Probabilities)
  * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MD2Korg/CerebralCortex-Kernel/blob/master/examples/CC3-Stress-From-ECG.ipynb) This notebook contains a complete computational pipeline for calculating stress probabilities from ECG raw data using CC-kernel builtin algorithms and feature computational capabilities. 

## BioGen Simulator

A hierarchical generative model for biological signals (PPG,ECG etc.) that keeps the physiological characteristics intact.

## Synthetic Data Generation


### Additional Information
  * [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1AtLMjPT_hCosX4sr403qA8AWA-XxL5AZ)

## mFLow

### Additional Information
#### Experiment Design Examples
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-BasicTrainTest.ipynb) Workflow showing how to perform a basic train-test experiment.

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-ComparingModels.ipynb) Workflow comparing multiple models under a basic train-test experiment design.

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-BasicCV.ipynb) Workflow showing how to perform a cross-validation assessment experimental workflow comparing two models.  

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-BasicLOSO.ipynb) Workflow showing how to perform a basic leave-one-subject-out experiment.

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-BasicWithin.ipynb) Workflow showing how to perform a within-subject train-test split experiment. 

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-ComparingPersonalization.ipynb ) Worflow comparing the leave-one-subject-out and within-subject experimental designs. 

#### Data Preprocessing Comparisons:

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-ComparingNormalization.ipynb) Workflow comparing the use of normalization to no normalization in a basic train-test experiment design.

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-ComparingImputers.ipynb) Workflow comparing mean imputation to zero-imputation in a train-test experiment design.

#### Hyper-parameter Optimization Examples

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-ComparingModels-NestedCV.ipynb) Workflow comparing default hyper-parameters to optimized hyper-parameters for the same models in a basic train-test experiment design.

#### Workflow Execution Examples

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/ExtraSensory-CompareBackends.ipynb) Experiment comparing workflow scheduling backends.

#### Cerebral Cortex Integration Examples

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/WESAD-ECG-RR.ipynb) Workflow using Cerebral Cortex feature extractors to get ECG with data quality indicators as well as RR intervals from ECG data. 

* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlds-lab/mFlow/blob/master/Examples/WESAD-BasicTrainTest.ipynb) Basic train/test experimental workflow comparing multiple models using Cerebral Cortex feature extractors. 


# Data

We use [WESAD](https://archive.ics.uci.edu/ml/datasets/WESAD+%28Wearable+Stress+and+Affect+Detection%29) dataset to demonstrate Cerebral Cortex Kernel capabilities. WESAD is a publicly available dataset for wearable stress and affect detection. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects during a lab study. The following sensor modalities are included: blood volume pulse, electrocardiogram, electrodermal activity, electromyogram, respiration, body temperature, and three-axis acceleration. Moreover, the dataset bridges the gap between previous lab studies on stress and emotions, by containing three different affective states (neutral, stress, amusement). In addition, self-reports of the subjects, which were obtained using several established questionnaires, are contained in the dataset. 

# Setting up the enviornment

Colab does not contain the necessary runtime environments necessary to run this tutorial. The following commands will download and install these tools, framework, and datasets.


In [None]:
import importlib, sys, os
from os.path import expanduser
sys.path.insert(0, os.path.abspath('..'))

DOWNLOAD_USER_DATA=True
ALL_USERS=False #this will only  work if DOWNLOAD_USER_DATA=True
IN_COLAB = 'google.colab' in sys.modules
MD2K_JUPYTER_NOTEBOOK = "MD2K_JUPYTER_NOTEBOOK" in os.environ
if (get_ipython().__class__.__name__=="ZMQInteractiveShell"): IN_JUPYTER_NOTEBOOK = True
JAVA_HOME_DEFINED = "JAVA_HOME" in os.environ
SPARK_HOME_DEFINED = "SPARK_HOME" in os.environ
PYSPARK_PYTHON_DEFINED = "PYSPARK_PYTHON" in os.environ
PYSPARK_DRIVER_PYTHON_DEFINED = "PYSPARK_DRIVER_PYTHON" in os.environ
HAVE_CEREBRALCORTEX_KERNEL = importlib.util.find_spec("cerebralcortex") is not None
SPARK_VERSION = "3.1.2"
SPARK_URL = "https://archive.apache.org/dist/spark/spark-"+SPARK_VERSION+"/spark-"+SPARK_VERSION+"-bin-hadoop2.7.tgz"
SPARK_FILE_NAME = "spark-"+SPARK_VERSION+"-bin-hadoop2.7.tgz"
CEREBRALCORTEX_KERNEL_VERSION = "3.3.14"

DATA_PATH = expanduser("~")
if DATA_PATH[:-1]!="/":
    DATA_PATH+="/"
USER_DATA_PATH = DATA_PATH+"cc_data/"

if MD2K_JUPYTER_NOTEBOOK:
    print("Java, Spark, and CerebralCortex-Kernel are installed and paths are already setup.")
else:

    SPARK_PATH = DATA_PATH+"spark-"+SPARK_VERSION+"-bin-hadoop2.7/"
    

    if(not HAVE_CEREBRALCORTEX_KERNEL):
        print("Installing CerebralCortex-Kernel")
        !pip -q install cerebralcortex-kernel==$CEREBRALCORTEX_KERNEL_VERSION
    else:
        print("CerebralCortex-Kernel is already installed.")

    if not JAVA_HOME_DEFINED:
        if not os.path.exists("/usr/lib/jvm/java-8-openjdk-amd64/") and not os.path.exists("/usr/lib/jvm/java-11-openjdk-amd64/"):
            print("\nInstalling/Configuring Java")
            !sudo apt update
            !sudo apt-get install -y openjdk-8-jdk-headless
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/"
        elif os.path.exists("/usr/lib/jvm/java-8-openjdk-amd64/"):
            print("\nSetting up Java path")
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/"
        elif  os.path.exists("/usr/lib/jvm/java-11-openjdk-amd64/"):
            print("\nSetting up Java path")
            os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
    else:
        print("JAVA is already installed.")

    if (IN_COLAB or IN_JUPYTER_NOTEBOOK) and not MD2K_JUPYTER_NOTEBOOK:
        if SPARK_HOME_DEFINED:
            print("SPARK is already installed.")
        elif not os.path.exists(SPARK_PATH):
            print("\nSetting up Apache Spark ", SPARK_VERSION)
            !pip -q install findspark
            import pyspark
            spark_installation_path = os.path.dirname(pyspark.__file__)
            import findspark
            findspark.init(spark_installation_path)
            if not os.getenv("PYSPARK_PYTHON"):
                os.environ["PYSPARK_PYTHON"] = os.popen('which python3').read().replace("\n","")
            if not os.getenv("PYSPARK_DRIVER_PYTHON"):
                os.environ["PYSPARK_DRIVER_PYTHON"] = os.popen('which python3').read().replace("\n","")
        else:
            print("SPARK is already installed.")
    else:
        raise SystemExit("Please check your environment configuration at: https://github.com/MD2Korg/CerebralCortex-Kernel/")

if DOWNLOAD_USER_DATA:
    if not os.path.exists(USER_DATA_PATH):
        if ALL_USERS:
            print("\nDownloading all users' data.")
            !rm -rf $USER_DATA_PATH
            !wget -q http://mhealth.md2k.org/images/datasets/cc_data.tar.bz2 && tar -xf cc_data.tar.bz2 -C $DATA_PATH && rm cc_data.tar.bz2
        else:
            print("\nDownloading a user's data.")
            !rm -rf $USER_DATA_PATH
            !wget -q http://mhealth.md2k.org/images/datasets/s2_data.tar.bz2 && tar -xf s2_data.tar.bz2 -C $DATA_PATH && rm s2_data.tar.bz2
    else:
        print("Data already exist. Please remove folder", USER_DATA_PATH, "if you want to download the data again")

CerebralCortex-Kernel is already installed.
JAVA is already installed.
SPARK is already installed.
Data already exist. Please remove folder /root/cc_data/ if you want to download the data again


# Load and explore data

Cerebral Cortex will be used to load the WESAD dataset into this notebook.  The following panes will demonstrate how CC can be used to load, process, and visualize the WESAD data.  

# mFlow Experiements

mFlow is a workflow execution framework that allows you to run repeatable experiments.  

Details...

# Synthetic Data Generation
We use the BioGen simulator to generate synthetic data based on the raw signals in the WESAD dataset.  These signal will be read from Cerebral Cortex and the generated signals stored back in the system so that mFlow can access them.

# mFlow experiments with synthetic data

mFlow easily allows you to run the same experiements to evaluate the performance of synthetic data generators on the outcomes of stress classifier models.  We compare and contrast the performance within the WESAD dataset by changing which inputs mFlow uses between the acutual and synthetic data.
