## Abstract

First Data for Passive Data Collection using Smartwatches and GPS from the PREACT Study. 

## Introduction

Treatment personalization is highly discussed to counteract insufficient response rates in psychotherapy. In the quest for criteria allowing informed selection or adaptation, ambulatory assessment data (i.e. EMA, passive sensing)are a key component, as processes happening outside of therapy sessions can be depicted in high temporal and/or spatial resolution.

PREACT is a multicenter prospective-longitudinal study investigating different predictors of non-response (i.e. EEG, fMRI) in around 500 patients undergoing cognitive behavioral therapy for internalizing disorders (https://forschungsgruppe5187.de/de). 

## Methods
Patients can enroll for therapy-accompanying ambulatory assessment. They are provided with a customized study app and a state-of-the-art smartwatch collecting passive data like GPS and heart rate for up to 365 days. In parallel, three 14-day EMA phases (pre-, mid- and post-therapy) cover transdiagnostic (i.e. emotion regulation), contextual and therapy-related aspects.  

Here, we present first results on data compliance and quality for the passive sensing data as well as EMA assessments. The results are based on data that was downloaded on **15.04.2024**.

## Results



In [1]:
import os
import glob
import pickle
from IPython.display import Markdown


import pandas as pd
import datetime as dt
from datetime import date, datetime
import numpy as np
from scipy.stats import kruskal

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from config import datapath, proj_sheet
import help_functions 

today = "15042024"

# load data

with open(datapath + f'/ema_data_{today}.pkl', 'rb') as file:
    df_active = pickle.load(file)

with open(datapath + f'/gps_data_{today}.pkl', 'rb') as file:
    df_gps = pickle.load(file)
    
with open(datapath + f'/passive_data_{today}.pkl', 'rb') as file:
    df_passive = pickle.load(file)

with open(datapath + f'/monitoring_data_{today}.pkl', 'rb') as file:
    df_monitoring = pickle.load(file)


### Demographics

In [2]:
# Specify your variables
continuous_vars = ['age', 'bsi_gsi', 'ses', 'costs_ema_burden', 'costs_ema_min']
categorical_vars = ['gender_description', 'scid_cv_description', 'ema_smartphone_description', 'employability_description',
                   'prior_treatment_description', 'ema_wear_exp', 'ema_special_event_description']


# Create the table
demographic_table = help_functions.format_demographics(df_monitoring, continuous_vars, categorical_vars)

# Output as Markdown
# Convert to Markdown and display
markdown_output = demographic_table.to_markdown()
display(Markdown(markdown_output))

| Variable                                                      | Overall        | Min   | Max    | Missing (%)   |
|:--------------------------------------------------------------|:---------------|:------|:-------|:--------------|
| **age, mean (SD)**                                            | 32.45 (10.88)  | 19.00 | 65.00  | 13 (6.31%)    |
| **bsi_gsi, mean (SD)**                                        | 1.30 (0.57)    | 0.32  | 3.68   | 25 (12.14%)   |
| **ses, mean (SD)**                                            | 2.32 (1.03)    | 1.00  | 5.00   | 13 (6.31%)    |
| **costs_ema_burden, mean (SD)**                               | 2.64 (1.09)    | 1.00  | 5.00   | 91 (44.17%)   |
| **costs_ema_min, mean (SD)**                                  | 90.47 (178.43) | 1.00  | 999.00 | 100 (48.54%)  |
| **gender_description**                                        |                |       |        | 13 (6.31%)    |
| female, n (%)                                                 | 118 (61.14%)   |       |        |               |
| male, n (%)                                                   | 69 (35.75%)    |       |        |               |
| not specified, n (%)                                          | 3 (1.55%)      |       |        |               |
| diverse, n (%)                                                | 2 (1.04%)      |       |        |               |
| no gender, n (%)                                              | 1 (0.52%)      |       |        |               |
| **scid_cv_description**                                       |                |       |        | 14 (6.80%)    |
| Depressive Disorder, n (%)                                    | 92 (47.92%)    |       |        |               |
| Social Anxiety Disorder, n (%)                                | 32 (16.67%)    |       |        |               |
| Obsessive-Compulsive Disorder, n (%)                          | 27 (14.06%)    |       |        |               |
| Generalized Anxiety Disorder, n (%)                           | 15 (7.81%)     |       |        |               |
| Agoraphobia and/or Panic Disorder, n (%)                      | 11 (5.73%)     |       |        |               |
| Post-Traumatic Stress Disorder, n (%)                         | 11 (5.73%)     |       |        |               |
| Specific Phobia, n (%)                                        | 4 (2.08%)      |       |        |               |
| **ema_smartphone_description**                                |                |       |        | 0 (0.00%)     |
| iPhone, n (%)                                                 | 109 (52.91%)   |       |        |               |
| Android, n (%)                                                | 97 (47.09%)    |       |        |               |
| **employability_description**                                 |                |       |        | 13 (6.31%)    |
| employable, n (%)                                             | 161 (83.42%)   |       |        |               |
| unemployable (on sick leave), n (%)                           | 22 (11.4%)     |       |        |               |
| other, n (%)                                                  | 5 (2.59%)      |       |        |               |
| on disability pension, n (%)                                  | 5 (2.59%)      |       |        |               |
| **prior_treatment_description**                               |                |       |        | 13 (6.31%)    |
| no prior treatment, n (%)                                     | 76 (39.38%)    |       |        |               |
| outpatient psychotherapy, n (%)                               | 65 (33.68%)    |       |        |               |
| both, n (%)                                                   | 30 (15.54%)    |       |        |               |
| inpatient or partial inpatient treatment/psychotherapy, n (%) | 15 (7.77%)     |       |        |               |
| yes, n (%)                                                    | 7 (3.63%)      |       |        |               |
| **ema_wear_exp**                                              |                |       |        | 20 (9.71%)    |
| 0.0, n (%)                                                    | 109 (58.6%)    |       |        |               |
| 1.0, n (%)                                                    | 77 (41.4%)     |       |        |               |
| **ema_special_event_description**                             |                |       |        | 0 (0.00%)     |
| usual, n (%)                                                  | 150 (72.82%)   |       |        |               |
| special event, n (%)                                          | 56 (27.18%)    |       |        |               |

In [3]:
# Pie chart for 'scid_cv_description'
scid_counts = df_monitoring['scid_cv_description'].value_counts().reset_index()
scid_counts.columns = ['SCID_CV_Category', 'Counts']  # Renaming columns for clarity in Plotly

fig = px.pie(scid_counts, values='Counts', names='SCID_CV_Category',
             title='Distribution of SCID Primary Categories',
             color_discrete_sequence=px.colors.qualitative.Set3,
             labels={'SCID_CV_Category': 'SCID Category'})  # Labels for hover data

fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()

### Study status

In [4]:
# Creating a subplot layout
fig = make_subplots(rows=1, cols=2, subplot_titles=('Histogram of Study Versions', 'Histogram of Status'))
total_participants = df_monitoring.customer.nunique()

# Histogram for 'study_version'
study_version_data = df_monitoring['study_version'].value_counts().reset_index()
study_version_data.columns = ['Study Version', 'Counts']
fig.add_trace(
    go.Bar(
        x=study_version_data['Study Version'],
        y=study_version_data['Counts'],
        marker=dict(color='lightseagreen'),
        text=study_version_data['Counts'],
        textposition='auto',
    ),
    row=1, col=1
)

# Define active statuses
active_statuses = ['Erhebung_1_aktiv', 'Post_Erhebung_1', 'Post_Erhebung_2', 'Erhebung_2_aktiv']

# Histogram for 'status' with conditional coloring
status_data = df_monitoring['status'].value_counts().reset_index()
status_data.columns = ['Status', 'Counts']

# Generate colors based on condition
colors = ['lightseagreen' if status in active_statuses else 'gray' for status in status_data['Status']]

fig.add_trace(
    go.Bar(
        x=status_data['Status'],
        y=status_data['Counts'],
        marker=dict(color=colors),
        text=status_data['Counts'],
        textposition='auto',
    ),
    row=1, col=2
)

# Update layout settings
fig.update_layout(
    title_text= f"Study version and study status for N= {total_participants} patients",
    showlegend=False,
    height=500, width=1000
)
fig.show()

### GPS

In [5]:
df_gps_merged = df_gps.merge(df_monitoring, on = "customer", how="inner")

In [6]:
df_gps_coverage = df_gps_merged[["customer", "data_coverage_per", "bsi_gsi", "age", "ema_smartphone_description",
                                "scid_cv_description", "gender_description"]].drop_duplicates(subset=["customer"])

In [7]:
# Assuming df_gps_coverage is your DataFrame
fig = px.violin(df_gps_coverage, 
                y="data_coverage_per", 
                x="ema_smartphone_description", 
                box=True,  # Shows a box plot inside the violin
                title="GPS Coverage per Smartphone")

fig.show()

In [8]:
# Assuming df_gps_coverage is your DataFrame
fig = px.violin(df_gps_coverage, 
                y="data_coverage_per", 
                x="scid_cv_description", 
                points='all',
                box=True,  # Shows a box plot inside the violin
                title="GPS Coverage per Scid Diagnosis")

fig.show()

In [9]:
fig = px.scatter(df_gps_coverage, 
                 x='bsi_gsi', 
                 y='data_coverage_per',
                 title='Correlation between data coverage and Global Severity Index',
                 labels={'bsi_gsi': 'BSI_GSI', 'data_coverage_per': 'Data Coverage (%)'},
                 trendline="ols")  # Adds a trend line using ordinary least squares regression

fig.show()

### Passive data

In [10]:
df_pd_merged = df_passive.merge(df_monitoring, on = "customer", how="inner")

In [11]:
df_pd_coverage = df_pd_merged[["customer", "data_coverage_per", "bsi_gsi", "age", "ema_smartphone_description",
                                "scid_cv_description", "gender_description"]].drop_duplicates(subset=["customer"])

In [12]:
# Assuming df_gps_coverage is your DataFrame
fig = px.violin(df_pd_coverage, 
                y="data_coverage_per", 
                x="ema_smartphone_description", 
                box=True,  # Shows a box plot inside the violin
                title="Passive Data Coverage per Smartphone")

fig.show()

In [13]:
# Assuming df_gps_coverage is your DataFrame
fig = px.violin(df_pd_coverage, 
                y="data_coverage_per", 
                x="scid_cv_description", 
                box=True,  # Shows a box plot inside the violin
                points='all',
                title="Passive Data Coverage SCID Diagnosis")

#fig.show()

### EMA data

In [14]:
df_active.columns

Index(['customer', 'sessionRun', 'quest_create', 'quest_complete', 'study',
       'ema_id', 'study_version', 'for_id', 'status', 'ema_start_date',
       'ema_smartphone', 'ema_wear_exp', 'ema_special_event', 'age', 'gender',
       'marital_status', 'partnership', 'graduation', 'profession',
       'years_of_education', 'employability', 'ses', 'somatic_problems',
       'scid_cv_prim_cat', 'bsi_somatization', 'prior_treatment',
       'bsi_compulsivity', 'bsi_insecurity', 'bsi_depression', 'bsi_anxiety',
       'bsi_aggression', 'bsi_phobia', 'bsi_paranoia', 'bsi_psychotizism',
       'bsi_additional', 'bsi_gs', 'bsi_gsi', 'gender_description',
       'scid_cv_description', 'marital_status_description',
       'employability_description', 'graduation_description',
       'profession_description', 'prior_treatment_description',
       'ema_smartphone_description', 'ema_special_event_description',
       'costs_ema_min', 'costs_ema_burden', 'costs_ema_text',
       'quest_create_day', 

In [15]:
df_gps.relative_day.max()

187

### Data coverage across time

In [16]:
df_gps_merged_active = df_gps_merged.loc[df_gps_merged.status.isin(['Erhebung_1_aktiv', 'Post_Erhebung_1', 'Post_Erhebung_2','Erhebung_2_aktiv', 'Abgeschlossen'])]
df_gps_merged_active = df_gps_merged_active.loc[df_gps_merged_active.study_version.isin(['Lang', 'Lang(Wechsel)'])]

In [17]:
df_pd_merged_active = df_pd_merged.loc[df_pd_merged.status.isin(['Erhebung_1_aktiv', 'Post_Erhebung_1', 'Post_Erhebung_2','Erhebung_2_aktiv', 'Abgeschlossen'])]
df_pd_merged_active = df_pd_merged_active.loc[df_pd_merged_active.study_version.isin(['Lang', 'Lang(Wechsel)'])]

In [18]:
df_gps_merged_active = df_gps_merged_active.loc[df_gps_merged_active.potential_days_coverage >=60]
df_pd_merged_active = df_pd_merged_active.loc[df_pd_merged_active.potential_days_coverage >=60]

In [19]:
# Calculate the total number of unique customers in each DataFrame
total_gps_customers = df_gps_merged_active['customer'].nunique()
total_smartwatch_customers = df_pd_merged_active['customer'].nunique()

# Calculate the number of unique customers per day for GPS and smartwatch data
gps_customer_data = df_gps_merged_active.groupby('relative_day')['customer'].nunique().reset_index()
smartwatch_customer_data = df_pd_merged_active.groupby('relative_day')['customer'].nunique().reset_index()

# Calculate percentages
gps_customer_data['percentage_of_customers'] = (gps_customer_data['customer'] / total_gps_customers) * 100
smartwatch_customer_data['percentage_of_customers'] = (smartwatch_customer_data['customer'] / total_smartwatch_customers) * 100

# Create a DataFrame with all days from 1 to 100
all_days = pd.DataFrame({'relative_day': range(1, 61)})

# Merge and fill missing days with 0 for both datasets
complete_gps_data = all_days.merge(gps_customer_data, on='relative_day', how='left').fillna(0)
complete_smartwatch_data = all_days.merge(smartwatch_customer_data, on='relative_day', how='left').fillna(0)

# Merge both datasets into a single DataFrame for plotting
plot_data = complete_gps_data[['relative_day', 'percentage_of_customers']].rename(columns={'percentage_of_customers': 'gps_percentage'})
plot_data = plot_data.merge(complete_smartwatch_data[['relative_day', 'percentage_of_customers']], on='relative_day', how='left').fillna(0).rename(columns={'percentage_of_customers': 'smartwatch_percentage'})

# Create the line plot
fig = px.line(plot_data,
              x='relative_day',
              y=['gps_percentage', 'smartwatch_percentage'],
              title=f'Coverage of GPS and Smartwatch data. N={str(total_smartwatch_customers)}',
              labels={'value': 'Percentage of available patients', 'variable': 'Data Source', 'relative_day': 'Relative Day'},
              markers=True)

# Add a vertical line at relative day 15
fig.add_vline(x=15, line_width=1, line_dash="dash", line_color="grey")

# Update legend titles
fig.update_layout(legend_title_text='Data Source')
fig.update_traces(mode='lines+markers')

fig.show()


In [20]:
# Calculate the total number of unique customers in each category from the updated DataFrame
total_customers_by_category = df_gps_merged_active.groupby('ema_smartphone_description')['customer'].nunique()

# Calculate the number of unique customers per day for each category from the updated DataFrame
category_customer_data = df_gps_merged_active.groupby(['relative_day', 'ema_smartphone_description'])['customer'].nunique().reset_index()

# Calculate percentages for each category
category_customer_data['percentage_of_customers'] = category_customer_data.apply(
    lambda row: (row['customer'] / total_customers_by_category[row['ema_smartphone_description']]) * 100, axis=1)

# Create a DataFrame with all days from 1 to 100 for merging
all_days = pd.DataFrame({'relative_day': range(1, 101)})

# Merge to ensure every day is represented in each category, fill missing values with 0
plot_data = pd.DataFrame()  # Initialize an empty DataFrame to store the merged data
for category in category_customer_data['ema_smartphone_description'].unique():
    temp_data = category_customer_data[category_customer_data['ema_smartphone_description'] == category]
    merged_data = all_days.merge(temp_data[['relative_day', 'percentage_of_customers']], on='relative_day', how='left').fillna(0)
    merged_data['ema_smartphone_description'] = category
    plot_data = pd.concat([plot_data, merged_data])

# Create the line plot
fig = px.line(plot_data,
              x='relative_day',
              y='percentage_of_customers',
              color='ema_smartphone_description',
              title='GPS coverage by Smartphone Type',
              labels={'percentage_of_customers': '% Patients', 'relative_day': 'Relative Day'},
              markers=True)

# Update the layout to adjust legend and traces
fig.update_layout(legend_title_text='Smartphone Type')
fig.update_traces(mode='lines+markers')

fig.show()