# Appendix C: Use Case Dataset and Discussion

Appendix C: Use Case Dataset and Discussion

C.1Use Case Demographics

The dataset of the pilot study used as an example in the Section “Error! Reference source not found.” is from a prospective case-control study involving 1584 male and female participants aged between 35 and 79 years who underwent the 6-Minute-Walk-Test (6MWT) indoors with an Apple Watch Series 3 (Apple Inc., CA, USA) using its Indoor Walk mode. All participants were enrolled via the Courtois Cardiovascular Signature (CCVS) Program, a clinical research program aiming to personalize the management of cardiovascular health using analytic approaches centered around digital health and artificial intelligence, genetics, and microbiomics, as well as environmental factors to optimize cardiovascular patient outcomes. The program involves 4000 participants in a longitudinal prospective case-control study over a 10-year period, through which information on cardiovascular events, diagnoses, and medication is collected in the CCVS biorepository. The participants for this study were approached first by their physician at their cardiovascular clinic visit and their informed consent to the program was obtained.

Participants were separated into three cohorts (Appendix Table 5): patients with heart failure (HF, n=86), patients with other cardiovascular conditions (CVD non-HF, n=172), and healthy participants (healthy, n=1326). Outliers were excluded based on the Q1/Q3 ± 1.5 IQR criterion, rendering the final sample sizes to be: n=74 (HF), n=161 (CVD non-HF), and n=1263 (healthy). Heart failure was defined as clinical signs and/or symptoms caused by a structural and/or functional cardiac abnormality, as defined by Bozkurt et al.,28 diagnosed at the cardiovascular clinic, as well as natriuretic Peptide (NT-proBNP) levels above 125 pg/L or objective evidence of cardiogenic pulmonary or systemic congestion. The HF cohort comprised patients with well compensated heart failure, who received regular cardiology care and had their quality of life optimized in spite of their condition. Additional exclusion criteria consisted of pregnancy, general MRI contraindications, and no understanding of English and French.

Based on American Thoracic Society Guidelines, participants were asked to complete a 6MWT as part of their standard of care. The test includes asking participants to walk back and forth between two cones along a corridor of 28 m lap length for six minutes wearing the Apple Watch on their dominant hand while the time and walking distance were being manually recorded. Patients were allowed to slow down, stop or rest during the task if necessary. Fatigue and shortness of breath (SOB) severity were collected using the modified Borg scale, along with oxygen saturation (SpO2), heart rate (HR), and systolic and diastolic blood pressure (SBP, DBP). Participant height was entered in the phone paired to the Apple Watch prior to the 6MWT for calibration.

C.2Correlation and Agreement Plots

This section includes the plots (Appendix Figure 3 and Appendix Figure 4) illustrating correlation and agreement, reflecting ρ and BA respectively. The correlation between digital and manual measurements of 6MWD is illustrated in the Appendix Figure 3 for each cohort. Manual 6WMT-measured distance is represented on the x axis, while the Apple Watch’s digital 6MWD distance is represented on the y axis. Each scatter plot displays the linear regression line and the corresponding ρ value.

C.3Use Case Interpretation

C.3.1Interpreting Metrics

First, the Pearson correlation coefficient (ρ) was computed to quantify the strength of the relationship between Apple Watch digital measurements and manual 6MWT coordinator measurements of distance. The ρ values indicated moderately strong correlation between the manual and digital 6MWD across all cohorts. Nevertheless, the CCC indicated poor levels of agreement for every group. The discrepancy between CCC and ρ values in each cohort highlights the difference in their respective functions and the nuance between correlation and agreement, showing that strong correlation is not synonym with satisfactory agreement.

Next, the ICC, giving insight into measurement replicability and capturing both correlation and agreement into a single metric, yielded moderate reliability levels using the two-way mixed-effects, single-rater, absolute agreement ICC.

The MAPE was then computed to quantify forecasting accuracy and interpreted using Nelson et al.’s threshold,43 also used by Boudreaux et al.44 and Vetrovsky et al.45 in the DHT literature, setting a MAPE ≤ 10% as the validity criterion. Of note, other thresholds exist in the literature, such as Fokkema et al.’s 5% MAPE validity threshold,46 but are used less frequently. In the context of this study, despite MAPE values falling below the 10% threshold in every cohort, indicating satisfactory systematic agreement, the values were marginal and approaching the 10% cut-off point (HF: 9.2%, CVD: 9.8%, and Healthy: 9.5%), which may warrant critical appraisal and careful interpretation.

C.3.2MCID Selection

Although DHT reliability assessment is based on a group of diverse metrics constituting a comprehensive statistical approach, Bland-Altman analysis, which constitutes the main component of agreement assessment, must ultimately yield a dichotomous output on the systematic reliability of a device in clinical research (i.e., reliable or not reliable). This requires setting a clinically relevant threshold appropriate to the specific test and patient population is necessary for the assessment of systematic reliability: the minimal clinically important difference (MCID).

This use case investigates the reliability of a device in measuring 6MWD in HF patients, so we selected a MCID appropriate to the task. Of note, the significance of the absolute MCID (in m) is dependent on patient 6MWD, as a 50 m 6MWD difference is not as significant in a patient with a 600 m 6MWD as it is in a patient with a 250 m 6MWD because differences are more significant to HF patients with advanced disease progression and severe symptoms. Hence, the relative MCID (in %), coupled with the relative difference (percentage) variation of the Bland-Altman plot, 26 represents a more universal accuracy threshold independent of disease severity, thus providing more relevant insight on device accuracy. Therefore, we set 10% as the clinically relevant threshold for the 6MWD, as it has been proposed as the relevant MCID by the Canadian Heart Failure Society ,28 as well as the HFC-ARC Panel in the Journal of the American College of Cardiology (JACC).3

C.3.3Bland-Altman Analysis

After computing the relevant reliability metrics and determining the appropriate MCID, Bland-Altman analysis is used to evaluate the systematic bias between digital and manual measurements by comparing it to the relative MCID of 10% for HF patients in the case of the Apple Watch study. In other words, digital measurements exhibiting a discrepancy exceeding 10% relative to their manual counterpart were categorized as clinically unreliable Apple Watch measurements.

The Bland-Altman plots for the HF, CVD non-HF, and healthy cohorts provided in Appendix Figure 4 highlight systematic overestimation of 6MWD by the Apple Watch, as illustrated by the mean bias line being above the perfect agreement line (y = 0%) in each cohort. Bland-Altman analysis yielded mean biases (CIs) of +5.9% (3.7, 8.1), +6.4% (4.8, 8.0), and +6.3% (6.8, 5.8) in HF, CVD non-HF, and healthy cohorts respectively. Hence, in every group, the mean bias and its 95% CI fall within the clinical threshold of ± 10%.

Bland-Altman plots also provide the opportunity to assess the spread of the data beyond simply interpreting the LoAs; looking for patterns in the spread of the dataset allows to draw additional conclusions on the tested device. For example, increasing absolute differences between digital and manual measurements (y axis) with incrementing average values (x axis) on a standard BA plot, but relative differences (y axis, %) staying constant on the corresponding relative BA plot could indicate that the device accumulates consistent small errors in measurements throughout the walk, thereby making agreement and reliability a function of the traveled distance (6MWD). Sevrukov et al. provide an example where the absolute difference between measurements appears to be an increasing function of the average.47 Although the lack of such a pattern suggests a different explanation for the suboptimal accuracy of the Apple Watch, it is always important to appreciate for patterns in the spread of the dataset.

Of note, Bland-Altman analysis can also be used to evaluate the replicability of multiple (i.e., two or more) DHT measurements, provided the measurements are taken within a short enough timeframe to avoid results being confounded by changes in the patient’s disease state. As an example, one could record 6MWD measurements on four consecutive days and assess whether the measurements from different days are in agreement with each other to evaluate the replicability of the results.27

List of Appendix Tables

Appendix Table 1: Concordance Coefficients and Metrics Used in DHT Analytical Validation

**Appendix Table 1**

![Appendix Table 1](Appendix_C_Table_1.png)

Appendix Table 2: Formulas for Coefficients and Metrics Used in DHT Analytical Validation

**Appendix Table 2**

![Appendix Table 2](Appendix_C_Table_2.png)

Where as:

d = Digital measurementsm = Manual measurementsn = Number of participants

s 1= Standard deviation of manual measurements  s 2= Standard deviation of digital measurements

s = Standard deviation of the differences (d – m)m = Mean of measurements

y= Ordered measurement samples a = Coefficients for best linear unbiased estimate

SW = Shapiro-Wilk test

Appendix Table 3: ICC Equation Selection Guide

**Appendix Table 3**

![Appendix Table 3](Appendix_C_Table_3.png)

Where as:

MSR: Mean Square for RowsMSC: Mean Square for Columns

MSE: Mean Square for ErrorMSW: Mean Square Within Groups

The above mentioned variables are derived from an ANOVA.

Appendix Table 4: Criteria and Thresholds Selected for the Interpretation of Reliability Metrics.

*MCID criteria is specific to 6MWD for HF patients (see Section “Error! Reference source not found.”).

**Appendix Table 4**

![Appendix Table 4](Appendix_C_Table_4.png)

* Of note, the only statistical approach which interpretation is based on clinically relevant thresholds is Bland-Altman analysis, for which both mean bias and limits of agreement are compared to the MCID. Hence, BA is the only component of the framework with variable thresholds depending on the clinical question and the relevant literature.

Despite being a widely adopted goodness-of-fit measure, there currently exists no standardized threshold for determining the validity of DHT measurements using MAPE. Nevertheless, some studies in the DHT literature43–45 the MAPE ≤ 10% threshold as the validity criterion, labeling any value exhibiting a discrepancy of greater than 10% of its corresponding gold standard measurement as an erroneous measurement.

Appendix Table 5: Demographics of the Apple Watch Use Case Study

**Appendix Table 5**

![Appendix Table 5](Appendix_C_Table_5.png)

List of Appendix Figures

**Appendix Figure 1**

![Appendix Figure 1](Appendix_C_Figure_1.png)

Appendix Figure 1: Example correlation between temperature units, illustrating the nuance between correlation and agreement.

**Appendix Figure 2**

![Appendix Figure 2](Appendix_C_Figure_2.png)

Appendix Figure 2: DHT selection strategy for stakeholders (sponsors)

**Appendix Figure 3**

![Appendix Figure 3](Appendix_C_Figure_3.png)

yHF = 1.0540xyCVD = 1.0607xyHealthy = 1.0607x

Appendix Figure 3: Correlation between digital (Apple Watch) and manual 6MWD measurements (in m) in HF, CVD non-HF, and healthy cohorts.

**Appendix Figure 4**

![Appendix Figure 4](Appendix_C_Figure_4.png)

Appendix Figure 4: Bland-Altman Analysis of Measurement Differences in HF, CVD non-HF, and Healthy Cohorts