# The Feasibility and Equity of Geolocation Data for Lapse Prediction in

AUD

Claire Punturieri  
John J. Curtin  
October 15, 2024

## Introduction

About 1 in 10 adults in the United States met diagnostic criteria for alcohol use disorder (AUD) in 2022 ([“Highlights for the 2022 National Survey on Drug Use and Health,” n.d.](#ref-Highlights2022National)). While some individuals will experience natural recovery (i.e., improvement without intervention) ([Tucker, Chandler, and Witkiewitz 2020](#ref-tuckerEpidemiologyRecoveryAlcohol2020)), for others AUD will present as a chronic, relapsing disorder marked by periods of recovery interspersed with returns back to harmful use ([“The Science of Drug Use and Addiction: The Basics NIDA Archives” n.d.](#ref-ScienceDrugUse); [Brandon, Vidrine, and Litvin 2007](#ref-brandonRelapseRelapsePrevention2007a)). For such individuals, continued monitoring may be beneficial in assisting with the maintenance of recovery goals and in identifying precipitants to lapses, or single instances of goal-inconsistent use that may lead to relapse ([Witkiewitz and Marlatt 2004](#ref-witkiewitzRelapsePreventionAlcohol2004a)). One sustainable and scalable way to provide this continuous monitoring to individuals who need it most is through developing algorithms to predict lapses using both personal sensing data and machine learning.

Personal sensing data are data derived via embedded sensors in technology ubiquitous in our daily lives, such as smartphones, smartwatches, or other wearables ([Mohr, Zhang, and Schueller 2017](#ref-mohrPersonalSensingUnderstanding2017a)). Because these devices are already so integrated within our day-to-day lives, one benefit of porting these data to clinical use is their proven ability to be collected unobtrusively and continuously. Importantly, these data do not require individuals to change their behavior or routines in any way. Moreover, when paired with machine learning models, statistical patterns connecting antecedents to lapse derived from these data (e.g., changes in mood, difficulty with close social connections, proximity to risky locations) to true lapse events can be uncovered. This is crucial for several reasons: 1) even when someone anticipates an oncoming lapse, it may be difficult to pinpoint the specific driving forces behind it; 2) these precipitating factors will have great variation both between- and within-people; and 3) uncovering these factors may help relieve some of the cognitive burden of recovery (i.e., constant monitoring of potential environmental risk factors).

### Geolocation Data for Risk Monitoring

Recovery and return to use are dynamic processes. Factors that contribute both to maintenance of recovery and return to use change from person-to-person and from moment-to-moment. A shift in social supports (e.g., a move, a break-up) may precede a lapse for one individual but not another. Time spent in locations where alcohol is available (e.g., bars, restaurants, concert venues) may precede a given lapse for another individual, but will not necessarily precede future lapses in that same individual. In order to best capture this fluidity, the ideal data type used within continuous risk monitoring systems should be able to provide a correspondingly appropriate level of granularity. One promising data source is geolocation data.

Geolocation data consist of latitude and longtitude coordinates and can be sampled at regular intervals using applications on smartphones with little to no input from the user beyond initial set-up. Many smartphones and smartwatches automatically collect these data by default. This fact, paired with increasing rates of smartphone ownership, suggest that there is high potential for these data to be feasibly harnessed for use in a risk-monitoring system ([Areàn, Hoa Ly, and Andersson 2016](#ref-areanMobileTechnologyMental2016)). The importance of location, such as environmental cues or one’s perceived riskiness of a setting, has been shown to play an important role in lapse ([Janak and Chaudhri 2010](#ref-janakPotentEffectEnvironmental2010); [Maureen A. Walton et al. 2003](#ref-waltonIndividualSocialEnvironmental2003); [M. A. Walton, Reischl, and Ramanthan 1995](#ref-waltonSocialSettingsAddiction1995)). This link with lapse risk has translated into the integration of coping skills that target substance-associated contexts in several treatment strategies like mindfulness-based relapse prevention ([LeCocq et al. 2020](#ref-lecocqConsideringDrugAssociatedContexts2020)). These findings underscore not only the potential wealth of information relating to relapse risk that an individual’s location can provide, but also demonstrate the proven integration of location information into treatment. Furthermore, geolocation data have been specifically identified as being of particular use in both understanding the precipitants to harmful substance use and its effective treatment ([Stahler, Mennis, and Baron 2013](#ref-stahlerGeospatialTechnologyExposome2013)).

Within the substance use literature, geolocation data have historically been used to examine risky locations, such as the influence of neighborhood characteristics on use ([Epstein et al. 2014](#ref-epsteinRealtimeTrackingNeighborhood2014a); [Kwan et al. 2019](#ref-kwanUncertaintiesGeographicContext2019)) and individual physical proximity to locations of potential or past harmful use such as bars (either estimated using geofencing or user-defined) ([Attwood et al. 2017](#ref-attwoodUsingMobileHealth2017); [Carreiro et al. 2021](#ref-carreiroRealizeAnalyzeEngage2021); [Gonzalez and Dulin 2015](#ref-gonzalezComparisonSmartphoneApp2015); [Gustafson et al. 2014](#ref-gustafsonSmartphoneApplicationSupport2014a); [Naughton et al. 2016](#ref-naughtonContextSensingMobilePhone2016)). Several of the applications implemented in these studies enable real-time notifications about locations to their users (e.g., a pop-up message on a smartphone which reads *“You are entering a high-risk zone”*).

On the other hand, affective scientists have focused instead more closely on factors relating to mood. Geolocation data have been used to estimate loneliness and isolation ([Doryab et al. 2019](#ref-doryabIdentifyingBehavioralPhenotypes2019a)), to demonstrate increases in positive affect from seeking out novel environments ([Heller et al. 2020](#ref-hellerAssociationRealworldExperiential2020)), and to quantify depressive symptoms ([Raugh et al. 2020](#ref-raughGeolocationDigitalPhenotyping2020)). Moreover, these data have not only been harnessed to measure mood symptoms, but to also predict their emergence (for review, see [Shin and Bae 2023](#ref-shinSystematicReviewLocation2023)).

An integration across these subfields can be in part accomplished by enriching geolocation data with brief, intermittent surveys probing specific information about frequently visited locations. For example, some of the more nuanced facets captured within location are associations with others (or lack thereof, e.g., social isolation), associations with previous drinking behaviors (e.g., whether or not alcohol is present), and associations with affect (i.e., negative versus positive emotions tied to a given location).

### Model Evaluation

Data selection, however, is only one component of the successful development of a continuous risk monitoring algorithm. Following algorithm development, it is imperative that these models be rigorously evaluated using performance metrics and eventually tested using independent observations (i.e., using data from individuals which were not used in model development). This workflow in machine learning is what enables researchers to anticipate how well a model could be expected to generalize to new populations and is key when aiming to develop algorithms for real-world healthcare implementation. While standard performance metrics like model accuracy, for example, have been standard reporting practice for years, recent literature has begun to urge researchers to also include assessments of how *fair* a model is ([Rajkomar et al. 2018](#ref-rajkomarEnsuringFairnessMachine2018a); [Wawira Gichoya et al. 2021](#ref-wawiragichoyaEquityEssenceCall2021)). A fair algorithm is one with no preference in performance with respect to inherent or acquired characteristics (e.g., gender, race, socioeconomic status; ([X. Wang, Zhang, and Zhu 2022](#ref-wangBriefReviewAlgorithmic2022))). In the context of a continuous risk monitoring algorithm for AUD, this would mean that lapse predictions are reasonably accurate and do not favor or disadvantage any particular group.

The motivating factors behind this call to action are clear. In the broader context of health-related data, historical patterns of health care inequities will almost certainly and unavoidably be embedded within data used to train algorithms. These inequities may unintentionally be carried forward in perpetuity by machine learning models if not critically examined. Without examining algorithmic fairness prior to deployment in the real-world, monitoring algorithms run the risk of providing sub-optimal mental health care to individuals who already face disadvantages.

For example, having a limited number of observations within underrepresented groups means that our models will not have as wide a range of individuals to learn from for making predictions of lapse as compared to white, non-Hispanic participants. Performance of these models for racialized minority individuals may therefore be less accurate as a result, particularly without the use of resampling techniques to amend these imbalances ([Japkowicz 2000](#ref-japkowiczClassImbalanceProblem2000); [A. Wang, Ramaswamy, and Russakovsky 2022](#ref-wangIntersectionalityMachineLearning2022)).

Fairness is also a particular concern specifically in the context of AUD, where the literature has historically been built upon research developed with male, predominantly white, participants. Despite the call to action brought forth by the NIH through their *Guidelines on Inclusion of Women and Minorities in Research*, recent work has highlighted that seminal research in the field on medications for the treatment of AUD have failed to consistently report participant demographics ([Schick, Spillane, and Hostetler 2020](#ref-schickCallActionSystematic2020)). This lack of reporting makes it difficult to assess how and if this lack of representation is being corrected. By the very nature of its historically limited participant pool, AUD research and its theory have been developed from a particular perspective using a particular group of individuals. This means that the variables that researchers decide are important to measure and input into models, informed by knowledge of AUD theory, will inherently be biased and may favor these groups. Therefore, researchers can also not assume that balanced classes are enough to compensate for biases brought on by the broader societal context. Both of these facts motivate the reasoning behind examining algorithmic fairness in the context of developing continuous risk monitoring systems.

### The Current Study

In order for these continuous risk monitoring systems to be implemented in the real-world, these models must both be developed outright and rigorously evaluated on both standard performance metrics and algorithmic fairness. To this end, this study utilized geolocation data collected from smartphones and corresponding self-reported contextual information for frequently visited locations to build a machine learning model to predict next-day alcohol use lapse among individuals with a diagnosis of AUD and a recovery goal of abstinence. Model features were engineered from both raw geolocation data, both context-dependent and context agnostic, and change in these data over the previous 6, 12, 24, 48, 72, and 168 hours.

Here we present characterization of model performance for this prediction model in a validation set. We also evaluated feature importance and model fairness. <!--will need to circle back to this and expand--> This study constitutes a preliminary evaluation of a model designed to predict lapse back to alcohol use using minimally burdensome data that has the potential to be integrated within a continuous risk monitoring platform.

## Methods

### Participants

One hundred and forty six individuals in early-recovery (1-8 weeks of abstinence) for AUD were recruited from the Madison area to take part in a three-month study on how mobile health technology can provide recovery support between 2017 and 2019 (R01 AA024391). Recruitment approaches included social media platforms (e.g., Facebook), television and radio advertisements, and clinic referrals. Prospective participants completed a phone screen to assess match with eligibility criteria (<a href="#tbl-elig" class="quarto-xref">Table 1</a>). Participants were excluded if they exhibited severe symptoms of paranoia or psychosis (a score \<= 2.24 on the SCL-90 psychosis scale or a score \<= 2.82 on the SCL-90 paranoia scale administered at screening).

| Eligibility Criteria                                           |
|----------------------------------------------------------------|
| \>= 18 years of age                                            |
| Ability to read and write in English                           |
| Diagnosis of moderate AUD (\>= 4 self-reported DSM-5 symptoms) |
| Abstinent from alcohol for 1-8 weeks                           |
| Willing to use only one smartphone\*\* while on study          |

Table 1: Eligibility criteria for study enrollment. \*\*Personal or study-provided.

### Procedure

Participants enrolled in a three-month study consisting of five in-person visits, daily surveys, and continuous passive monitoring of geolocation data. Following screening and enrollment visits in which participants consented to participate, learned how to manage location sharing (i.e., turn off location sharing when desired), and reported frequently visited locations, participants completed three follow-up visits one month apart. At each visit, participants were asked questions about frequently visited (\>2 times during the course of the previous month) locations. Participants were debriefed at the third and final follow-up visit. Participants were expected to provide continuous geolocation data while on study. Other personal sensing data streams (EMA, cellular communications, sleep quality, and audio check-ins) were collected as part of the parent grant’s aims (R01 AA024391).

### Geolocation data

To enable collection of geolocation data, participants downloaded either the Moves app or the FollowMee app during the intake visit. Moves was bought-out and subsequently deprecated while the study was ongoing (July 2018) and data collection continued using FollowMee until the end of the study. Both apps continuously tracked location via GPS and WiFi positioning technology.

Data were then processed to filter out duplicated points, fast movement speeds (\>100mph), sudden positional jumps, and periods of long duration suggesting sampling error issues (\>24 hours with no movement or \>2 hours with a positional jump of more than 0.31 miles or 500 meters). Data points were classified as “in transit” when spacing between individual positions suggested a movement speed of greater than 4mph per NIH health guidelines ([“Physical Activity Guidelines for Americans, 2nd Edition,” n.d.](#ref-PhysicalActivityGuidelines)).

### Contextual information

Contextual information for frequently visited locations (\>2 times in the previous month) was obtained during an interview at each follow-up visit (at month 1, 2, and 3; <a href="#tbl-context" class="quarto-xref">Table 2</a>). Participants were considered to be at a known contextual location if they were within 0.031 miles (50 meters) of a reported frequently visited location.

| Question                                                                                                      | Responses                                                                                                                                                                                                                                                                                      |
|------------------------|------------------------------------------------|
| Address                                                                                                       |                                                                                                                                                                                                                                                                                                |
| Type of place                                                                                                 | Work, School, Volunteer, Health care, Home of a friend, Home of a family member, Liquor store, Errands (e.g., grocery store, post office), Coffee shop or cafe, Restaurant, Park, Bar, Gym or fitness center, AA or recovery meeting, Religious location (e.g., church, mosque, temple), Other |
| Have you drank alcohol here before?                                                                           | No, Yes                                                                                                                                                                                                                                                                                        |
| Is alcohol available here?                                                                                    | No, Yes                                                                                                                                                                                                                                                                                        |
| How would you describe your experiences here?                                                                 | Pleasant, Unpleasant, Mixed, Neutral                                                                                                                                                                                                                                                           |
| Does being at this location put you at any risk to begin drinking?                                            | No risk, Low risk, Medium risk, High risk                                                                                                                                                                                                                                                      |
| Did the participant identify this place as a risky location they are trying to avoid now that they are sober? | No, Yes                                                                                                                                                                                                                                                                                        |

Table 2: Location information collected from frequently visited locations.

### Participant characteristics

Participants completed a baseline measure of demographics and other constructs relevant to lapse at the screening visit, which was used for fairness assessments (<a href="#tbl-demo-1" class="quarto-xref">Table 3</a>).

| Variable     | Measure                                                           |
|--------------------------|----------------------------------------------|
| Demographics | Age                                                               |
|              | Sex                                                               |
|              | Race                                                              |
|              | Ethnicity                                                         |
|              | Employment                                                        |
|              | Income                                                            |
|              | Marital Status                                                    |
| Alcohol      | Alcohol Use History                                               |
|              | DSM-5 Checklist for AUD                                           |
|              | Young Adult Alcohol Problems Test                                 |
|              | WHO-The Alcohol, Smoking and Substance Involvement Screening Test |

Table 3: Demographic and relevant alcohol use history variables sampled at screening visit.

### Lapses

Alcohol lapses were used as the outcome variable in this study and were used to provide labels for model training, for testing model performance, and for testing issues of algorithmic fairness across our predefined subgroups. Future lapse occurrence (here conceptualized as next-day lapse) was be predicted in 24-hour windows, beginning at 4:00am on a participant’s second day of participation to ensure one full day of data collection for the first window, and at every subsequent day on study thereafter. *Lapse* and *no lapse* occurrences were identified from the daily survey question, *“Have you drank any alcohol that you have not yet reported?”*. Participants who responded *yes* to this question were then asked to report the date and hour of the start and the end of the drinking episode. In this case, the prediction window was labeled *lapse*. Prediction windows were labeled *no lapse* if no alcohol use was reported within that window.

### Feature engineering

Feature engineering is the process of creating variables (or *“features”*) from unprocessed data and was used to transform raw data from geolocation data collected the prior day.

Features from geolocation data were generated that utilized both contextual information collected from monthly surveys (e.g., location valence, perceived riskiness) as well as features that were independent of further individual input (e.g., location variance, time spent out of the home in the evening). <!--list out all features? maybe make table?--> All features were calculated both as raw and change features based on previous geolocation data (i.e., change from past 6, 12, 24, 48, 72, and 168 hour periods) in order to capture individual variation.

Imputation of missing data and removal of zero-variance features are additional general processing steps that will also be undertaken during feature engineering.

### Algorithm development & performance

Several configurations of the XGBoost machine learning algorithm were considered which varied across a relevant and appropriate range of model-specific hyperparameters (mtry, tree depth, learning rate) as well as resampling techniques (up-sampling of the positive class, lapse, and down-sampling of the negative class, no lapse).

Models were trained and assessed using participant-grouped, nested *k*-fold cross-validation. Grouped cross-validation assigns all data from a participant as either held-in or held-out to avoid bias introduced when predicting a participant’s data from their own data. Nested cross-validation uses two nested loops for dividing and holding out folds: an outer loop, where held-out folds serve as test sets for model evaluation; and inner loops, where held-out folds serve as validation sets for model selection. Importantly, these sets are independent, maintaining separation between data used to train the models, select the best models, and evaluate those best models. Therefore, nested cross-validation removes optimization bias from the evaluation of model performance in the test sets and can yield lower variance performance estimates than single test set approaches ([Jonathan, Krzanowski, and McCarthy 2000](#ref-jonathanUseCrossvalidationAssess2000)).

The primary performance metric for model selection and evaluation of the validation set was area under the Receiver Operating Characteristic Curve (auROC) ([Kuhn and Johnson 2018](#ref-kuhnAppliedPredictiveModeling2018)). auROC indexes the probability that the model will predict a higher score for a randomly selected positive case (lapse) relative to a randomly selected negative case (no lapse). This metric was selected because it 1) combines sensitivity and specificity, which are both important characteristics for clinical implementation; 2) is an aggregate metric across all decision thresholds, which is important because optimal decision thresholds may differ across settings and goals; and 3) is unaffected by class imbalance, which is important for comparing models with differing prediction window widths and levels of class imbalance. The best model configuration was selected using median auROC across all validation sets. Several secondary performance metrics including sensitivity, specificity, balanced accuracy, positive predictive value (PPV), and negative predictive value (NPV) will also be assessed.

SHAP (SHapley Additive exPlanations) values were computed as interpretability metrics to identify the relative importance of different features in each final algorithm. SHAP values measure the unique contribution of features in an algorithm’s predictions ([Lundberg and Lee 2017](#ref-lundbergUnifiedApproachInterpreting2017)). SHAP values possess several useful properties including: *Additivity* (SHAP values for each feature can be computed independently and summed); *Efficiency* (the sum of SHAP values across features must add up to the difference between predicted and observed outcomes for each observation); *Symmetry* (SHAP values for two features should be equal if the two features contribute equally to all possible coalitions); and *Dummy* (a feature that does not change the predicted value in any coalition will have a SHAP value of 0). Highly important features represent relevant, actionable potential antecedents to lapse (and therefore points of intervention) that will be relevant in the future development of a continuous risk monitoring system. However, these are descriptive analyses because standard errors or other indices of uncertainty for importance scores are not available for SHAP values.

Finally, a Bayesian hierarchical generalized linear model was used to estimate the posterior probability distributions and 95% Bayesian credible intervals (CIs) for auROC.

### Algorithmic fairness

Subgroups were defined on the basis of personal individual characteristics (in the machine learning fairness literature, “sensitive attributes”) that are specifically associated with treatment disparities in AUD.

A Bayesian hierarchical generalized linear model was used to estimate the posterior probability distributions and 95% Bayesian CIs for auROC across four subgroups of participants: white versus non-white, younger than 55 versus equal to and older than 55, below or above the federal poverty line (citation), and sex (male versus female).

## Results

In [None]:
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

Rows: 300 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): algorithm, feature_set, resample
dbl (13): config_num, outer_split_num, inner_split_num, hp1, hp2, hp3, sens,...
lgl  (1): split_num

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

In [None]:
study_dates <- read_csv(here::here(path_gps, "study_dates.csv"),
                          show_col_types = FALSE) 

subids_dates <- study_dates |>
    pull(subid) |>  
    unique()

screen <- read_csv(file.path(path_shared, "screen.csv"), 
                   col_types = cols()) |>
  filter(subid %in% subids_dates) |> 
  mutate(across(dsm5_1:dsm5_11, ~ recode(., "No" = 0, "Yes" = 1))) |>  
  rowwise() |>  
  mutate(dsm5_total = sum(c(dsm5_1, dsm5_2, dsm5_3, dsm5_4, dsm5_5, dsm5_6, dsm5_7, 
                              dsm5_8, dsm5_9, dsm5_10, dsm5_11))) |>  
  ungroup()

lapses <- read_csv(file.path(path_shared, "lapses_day.csv"), col_types = cols()) |>
  filter(exclude == FALSE)

n_total <- 146

dem <- screen |>
  summarise(mean = as.character(round(mean(dem_1, na.rm = TRUE), 1)),
            SD = as.character(round(sd(dem_1, na.rm = TRUE), 1)),
            min = as.character(min(dem_1, na.rm = TRUE)),
            max = as.character(max(dem_1, na.rm = TRUE))) |>
  mutate(var = "Age",
         n = as.numeric(""),
         perc = as.numeric("")) |>
  select(var, n, perc, everything()) |>
  full_join(screen |>
  select(var = dem_2) |>
  group_by(var) |>
  summarise(n = n()) |>
  mutate(perc = (n / sum(n)) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = dem_3) |>
  mutate(var = fct_relevel(factor(var,
                         c("American Indian/Alaska Native", "Asian", "Black/African American",
                           "White/Caucasian", "Other/Multiracial")))) |>
  group_by(var) |>
  summarise(n = n()) |>
  mutate(perc = (n / sum(n)) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = dem_4) |>
  mutate(var = case_when(var == "No, I am not of Hispanic, Latino, or Spanish origin" ~ "No",
                         TRUE ~ "Yes"),
         var = fct_relevel(factor(var, c("Yes", "No")))) |>
  group_by(var) |>
  summarise(n = n()) |>
  mutate(perc = (n / sum(n)) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = dem_5) |>
  mutate(var = fct_relevel(factor(var,
                         c("Less than high school or GED degree", "High school or GED",
                           "Some college", "2-Year degree", "College degree", "Advanced degree")))) |>
  group_by(var) |>
  summarise(n = n()) |>
  mutate(perc = (n / sum(n)) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = dem_6, dem_6_1) |>
  mutate(var = case_when(dem_6_1 == "Full-time" ~ "Employed full-time",
                         dem_6_1 == "Part-time" ~ "Employed part-time",
                         TRUE ~ var)) |>
  mutate(var = fct_relevel(factor(var,
                         c("Employed full-time", "Employed part-time", "Full-time student",
                           "Homemaker", "Disabled", "Retired", "Unemployed",
                           "Temporarily laid off, sick leave, or maternity leave",
                           "Other, not otherwise specified")))) |>
  group_by(var) |>
  summarise(n = n()) |>
  mutate(perc = (n / sum(n)) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  summarise(mean = format(round(mean(dem_7, na.rm = TRUE), 0), big.mark = ","),
            SD = format(round(sd(dem_7, na.rm = TRUE), 0), big.mark = ","),
            min =format(round(min(dem_7, na.rm = TRUE), 0), big.mark = ","),
            max = format(round(max(dem_7, na.rm = TRUE), 0), scientific = FALSE, big.mark = ",")) |>
  mutate(var = "Personal Income",
        n = as.numeric(""),
        perc = as.numeric(""),
        mean = str_c("$", as.character(mean)),
        SD = str_c("$", as.character(SD)),
        min = str_c("$", as.character(min)),
        max = as.character(max)) |>
  select(var, n, perc, everything()), by = c("var", "n", "perc", "mean", "SD", "min", "max")) |>
  full_join(screen |>
  select(var = dem_8) |>
  mutate(var = case_when(var == "Never Married" ~ "Never married",
                         TRUE ~ var)) |>
  mutate(var = fct_relevel(factor(var,
                         c("Never married", "Married", "Divorced", "Separated",
                           "Widowed")))) |>
  group_by(var) |>
  summarise(n = n()) |>
  mutate(perc = (n / sum(n)) * 100), by = c("var", "n", "perc"))

auh <- screen |>
  summarise(mean = mean(auh_1, na.rm = TRUE),
            SD = sd(auh_1, na.rm = TRUE),
            min = min(auh_1, na.rm = TRUE),
            max = max(auh_1, na.rm = TRUE)) |>
  mutate(var = "Age of first drink",
        n = as.numeric(""),
        perc = as.numeric("")) |>
  select(var, n, perc, everything()) |>
  full_join(screen |>
  summarise(mean = mean(auh_2, na.rm = TRUE),
            SD = sd(auh_2, na.rm = TRUE),
            min = min(auh_2, na.rm = TRUE),
            max = max(auh_2, na.rm = TRUE)) |>
  mutate(var = "Age of regular drinking",
        n = as.numeric(""),
        perc = as.numeric("")) |>
  select(var, n, perc, everything()), by = c("var", "n", "perc", "mean", "SD",
                                             "min", "max")) |>
  full_join(screen |>
  summarise(mean = mean(auh_3, na.rm = TRUE),
            SD = sd(auh_3, na.rm = TRUE),
            min = min(auh_3, na.rm = TRUE),
            max = max(auh_3, na.rm = TRUE)) |>
  mutate(var = "Age at which drinking became problematic",
        n = as.numeric(""),
        perc = as.numeric("")) |>
  select(var, n, perc, everything()), by = c("var", "n", "perc", "mean", "SD",
                                             "min", "max")) |>
  full_join(screen |>
  summarise(mean = mean(auh_4, na.rm = TRUE),
            SD = sd(auh_4, na.rm = TRUE),
            min = min(auh_4, na.rm = TRUE),
            max = max(auh_4, na.rm = TRUE)) |>
  mutate(var = "Age of first quit attempt",
        n = as.numeric(""),
        perc = as.numeric("")) |>
  select(var, n, perc, everything()), by = c("var", "n", "perc", "mean", "SD",
                                             "min", "max")) |>
  full_join(screen |>
  # filter out 2 people with 100 and 365 reported quit attempts - will make footnote in table
  filter(auh_5 < 100) |>
  summarise(mean = mean(auh_5, na.rm = TRUE),
            SD = sd(auh_5, na.rm = TRUE),
            min = min(auh_5, na.rm = TRUE),
            max = max(auh_5, na.rm = TRUE)) |>
  mutate(var = "Number of Quit Attempts*",
        n = as.numeric(""),
        perc = as.numeric("")) |>
  select(var, n, perc, everything()), by = c("var", "n", "perc", "mean", "SD",
                                             "min", "max")) |>
  full_join(screen |>
  select(var = auh_6_1) |>
  mutate(var = case_when(var == "Long-Term Residential Treatment (more than 6 months)" ~ "Long-term residential (6+ months)",
                         TRUE ~ var)) |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = auh_6_2) |>
  mutate(var = case_when(var == "Short-Term Residential Treatment (less than 6 months)" ~ "Short-term residential (< 6 months)",
                         TRUE ~ var)) |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = auh_6_3) |>
  mutate(var = case_when(var == "Outpatient Treatment" ~ "Outpatient",
                         TRUE ~ var)) |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = auh_6_4) |>
  mutate(var = case_when(var == "Individual Counseling" ~ "Individual counseling",
                         TRUE ~ var)) |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = auh_6_5) |>
  mutate(var = case_when(var == "Group Counseling" ~ "Group counseling",
                         TRUE ~ var)) |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = auh_6_6) |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = auh_6_7) |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = auh_7) |>
  mutate(var = fct_relevel(factor(var, c("Yes", "No")))) |>
  group_by(var) |>
  summarise(n = n()) |>
  mutate(perc = (n / sum(n)) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  summarise(mean = mean(dsm5_total),
            SD = sd(dsm5_total),
            min = min(dsm5_total, na.rm = TRUE),
            max = max(dsm5_total, na.rm = TRUE)) |>
  mutate(var = "DSM-5 Alcohol Use Disorder Symptom Count",
        n = as.numeric(""),
        perc = as.numeric("")) |>
  select(var, n, perc, everything()), by = c("var", "n", "perc", "mean", "SD",
                                             "min", "max")) |>
  full_join(screen |>
  select(var = assist_2_1) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Tobacco products (cigarettes, chewing tobacco, cigars, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = assist_2_2) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Cannabis (marijuana, pot, grass, hash, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = assist_2_3) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Cocaine (coke, crack, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = assist_2_4) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Amphetamine type stimulants (speed, diet pills, ecstasy, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = assist_2_5) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Inhalants (nitrous, glue, petrol, paint thinner, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = assist_2_6) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Sedatives or sleeping pills (Valium, Serepax, Rohypnol, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = assist_2_7) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Hallucinogens (LSD, acid, mushrooms, PCP, Special K, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc")) |>
  full_join(screen |>
  select(var = assist_2_8) |>
  filter(var != "Never" & !is.na(var)) |>
  mutate(var = "Opioids (heroin, morphine, methadone, codeine, etc.)") |>
  group_by(var) |>
  drop_na() |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100), by = c("var", "n", "perc"))

lapses_per_subid <- screen |>
  select(subid) |>
  left_join(lapses |>
  tabyl(subid) |>
  select(-percent), by = "subid") |>
  mutate(n = if_else(is.na(n), 0, n),
         lapse = if_else(n > 0, "yes", "no"))

lapse_info <- lapses_per_subid |>
  group_by(lapse) |>
  rename(var = lapse) |>
  mutate(var = factor(var, levels = c("yes", "no"), labels = c("Yes", "No"))) |>
  summarise(n = n()) |>
  mutate(perc = (n / n_total) * 100,
         mean = NA_real_,
         SD = NA_real_,
         min = NA_real_,
         max = NA_real_) |>
  full_join(lapses_per_subid |>
  summarise(mean = mean(n),
            SD = sd(n),
            min = min(n),
            max = max(n)) |>
  mutate(var = "Number of reported lapses"),
  by = c("var", "mean", "SD", "min", "max"))

### Demographics

A total of 192 individuals were eligible to participate in the study, of which 191 consented to participate and 169 enrolled in the study. Fifteen participants were excluded prior to the first monthly follow-up visit. One participant was excluded for not maintaining a recovery goal of abstinence during their time on study. Two participants were excluded due to evidence of low compliance and careless responding. A further five individuals were excluded due to poor geolocation data quality as a result of insufficient data (resulting from software incompatibility and/or very limited mobility), resulting in a final sample size of 146.

The average age of the final sample was 40.9 years (SD = 12 years, range = 21-72 years). There was an approximately equal number of men (n = 74, 50.7%) and women (n = 72, 49.3%). The majority of the sample was White/Caucasian (n = 127, 86.99%) and non-Hispanic (n = 142, n = 97%). The mean income of participants was \$34,408 (SD = \$32,259, range = \$0-\$200,000). On average, participants self-reported a mean number of 8.9 DSM-V symptoms of AUD (range = 4-11). A detailed breakdown of participant characteristics is presented in <a href="#fig-demo-2" class="quarto-xref">Figure 1</a>.

output: true label: tbl-demo-2

<figure id="fig-demo-2">
<img src="attachment:objects/table.png" />
<figcaption>Figure 1: Test</figcaption>
</figure>

In [None]:
# 
# options(knitr.kable.NA = "—")
# #options(knitr.table.format = "markdown")
# 
# 
# footnote_table_dem_a <- "N = 146"
# 
# footnote_table_dem_b <- "Two participants reported 100 or more quit attempts. We removed these outliers prior"
# 
# footnote_table_dem_c <- "to calculating the mean (M), standard deviation (SD), and range."
# 
# dem  |>
#   bind_rows(auh |>
#               mutate(across(mean:max, ~round(.x, 1))) |>
#               mutate(across(mean:max, ~as.character(.x)))) |>
#   bind_rows(lapse_info |>
#               mutate(across(mean:max, ~round(.x, 1))) |>
#               mutate(across(mean:max, ~as.character(.x)))) |>
#   mutate(range = str_c(min, "-", max)) |>
#   select(-c(min, max)) |>
#   kbl(longtable = TRUE,
#       booktabs = TRUE,
#       col.names = c("", "N", "%", "M", "SD", "Range"),
#       align = c("l", "c", "c", "c", "c", "c"),
#       digits = 1,
#       caption = "Demographics and clinical characteristics") |>
#   kable_styling(position = "l") |>
#   row_spec(row = 0, align = "c", italic = TRUE) |>
#   column_spec(column = 1, width = "18em") |>
#   pack_rows("Sex", 2, 3, bold = FALSE) |>
#   pack_rows("Race", 4, 8, bold = FALSE) |>
#   pack_rows("Hispanic, Latino, or Spanish Origin", 9, 10, bold = FALSE) |>
#   pack_rows("Education", 11, 16, bold = FALSE) |>
#   pack_rows("Employment", 17, 25, bold = FALSE) |>
#   pack_rows("Marital Status", 27, 31, bold = FALSE) |>
#   pack_rows("Alcohol Use Disorder Milestones", 32, 35, bold = FALSE) |>
#   pack_rows("Lifetime History of Treatment (Can choose more than 1)", 37, 43, bold = FALSE) |>
#   pack_rows("Received Medication for Alcohol Use Disorder", 44, 45, bold = FALSE) |>
#   pack_rows("Current (Past 3 Month) Drug Use", 47, 54, bold = FALSE) |>
#   pack_rows("Reported 1 or More Lapse During Study Period", 55, 56, bold = FALSE) |>
#   kableExtra::footnote(general = c(footnote_table_dem_a, footnote_table_dem_b, footnote_table_dem_c), escape=FALSE) |> 
#   save_kable(file = "objects/table.png")

### Model Evaluation

We selected and evaluated the best performing XGBoost model from our validation set. This may result in a slight optimism bias in our model performance, though we believe this is largely offset through our use of 10 x 30 cross-validation (which averages model performance across 300 folds). However, evaluating the validation set was important to do because model development is still in progress, and as such it would not have been appropriate to examine independent test set performance at this stage.

The median auROC across the 300 folds achieved fair performance (*Mdn* = 0.712). <a href="#fig-auroc-histogram" class="quarto-xref">Figure 2</a> displays a histogram of model performance distribution across all folds. A receiver operating characteristic curve is displayed in <a href="#fig-auroc-plot" class="quarto-xref">Figure 3</a>, representing aggregate predicted lapse probabilities across all 300 folds of the validation set.

Posterior probability distributions for the auROCs for our best performing validation set model were used to formally characterize model performance. The median auROC was 0.714 (95% CI \[0.70-0.73\]), indicating that there is a probability \> .95 that our model is performing above chance (i.e., auROC \> .5; <a href="#fig-pp" class="quarto-xref">Figure 4</a>).

Finally, we performed model calibration in order to improve our trust in model predictions. Results of model calibration are displayed in <a href="#fig-calibration" class="quarto-xref">Figure 5</a>, showing that this model *over* predicts lapse probability even after calibrating the model. In other words, our model is more likely to predict that an individual will lapse than the true rate of lapse in our sample.

``` python
# auROCs |> 
#   ggplot(aes(x = auROC)) +
#   geom_histogram(bins = 10, fill = c("#af1f21")) +
#   geom_vline(xintercept = median(auROCs$auROC), color = c("#f29c96"), lwd = 1, linetype="longdash") +
#   labs(x = "auROC", y = "Frequency")
probs |> 
  ggplot(aes(x = roc_auc)) +
  geom_histogram(bins = 15, color = c("#af1f21"), fill = "white") +
  #geom_step(bins = 10, fill = c("#af1f21")) +
  #stat_bin(geom="step", bins = 10, color = c("#af1f21"), lwd = 1) +
  geom_vline(xintercept = median(probs$roc_auc), color = c("darkblue"), lwd = 1, linetype="dashed") +
  scale_y_continuous(expand = c(0,0)) +
  labs(x = "auROC", y = "Frequency")
```

<figure id="fig-auroc-histogram">
<img src="attachment:index_files/figure-ipynb/notebooks-auROC_distribution_posterior-fig-auroc-histogram-output-1.png" />
<figcaption>Figure 2: Area under the receiver operating characteristic (auROC) curves for each of 300 (10 x 30) cross validation splits. The dashed line represents the median auROC across all 300 splits.</figcaption>
</figure>

``` python
roc_data <- probs |> 
  roc_curve(prob_logi, truth = label)
  
plot_roc <- function(df, line_colors){
  df |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
    geom_path(linewidth = 1.25) +
    geom_abline(lty = 3) +
    coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(x = "False Positive Rate",
        y = "True Positive Rate") +
  scale_color_manual(values = line_colors)
}

roc_data |>
  mutate(sensitivity = round(sensitivity, 4),
         specificity = round(specificity, 4)) |>
  group_by(sensitivity, specificity) |> 
  summarise(.threshold = mean(.threshold)) |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = .threshold)) +
  #ggplot(aes(x = specificity, y = sensitivity, color = .threshold)) +
  geom_path(linewidth = 1) +
  geom_abline(lty = 3) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(x = "False Positive Rate",
       y = "True Positive Rate") +
  scale_x_continuous(breaks = seq(0,1,.25),
    labels = sprintf("%.2f", seq(0,1,.25))) + # to flip axis
  scale_color_gradient(low="blue", high = "red", name = "Threshold") +
  theme(axis.text = element_text(size = rel(1.50)), 
        axis.title = element_text(size = rel(1.75)))
```

    `summarise()` has grouped output by 'sensitivity'. You can override using the
    `.groups` argument.

<figure id="fig-auroc-plot">
<img src="attachment:index_files/figure-ipynb/notebooks-auROC_plot-fig-auroc-plot-output-2.png" />
<figcaption>Figure 3: Area under the receiver operating characteristic (auROC) curve for overall validation set performance across all possible classification thresholds.</figcaption>
</figure>

``` python
pp_tidy <- pp |> 
  tidy(seed = 123)

q <- c(0.025, 0.5, 0.975)

ci_day <- pp_tidy |> 
  summarize(median = quantile(posterior, probs = q[2]),
            lower = quantile(posterior, probs = q[1]), 
            upper = quantile(posterior, probs = q[3])) |> 
  mutate(y = 30)

pp_tidy |> 
  ggplot(aes(x = posterior)) + 
  geom_density(color = c("#af1f21"), fill = "white", alpha = 1, lwd = .8) +
  #geom_segment(mapping = aes(y = y, yend = y, x = lower, xend = upper), color = c("#af1f21"),
                #data = ci_day, lwd = 1) +
  geom_errorbar(aes(y = ci_day$y, xmin = ci_day$lower, xmax = ci_day$upper), color = c("darkblue"), lwd = 1) +
  geom_vline(xintercept = ci_day$median, color = c("darkblue"), lwd = 1, linetype="dashed") +
  geom_vline(xintercept = .5, color = "darkblue", lwd = 1, linetype="dotted") +
  scale_x_continuous(limits=c(0.49,.76)) +
  scale_y_continuous(expand = c(0,0)) +
  ylab("Posterior Probability Density") +
  xlab("Area Under ROC Curve")
```

<figure id="fig-pp">
<img src="attachment:index_files/figure-ipynb/notebooks-auROC_distribution_posterior-fig-pp-output-1.png" />
<figcaption>Figure 4: Posterior probability distribution of model performance with a 95% credible interval. The dashed line represents median auROC across the sampling distribution, while the dotted line represents chance performance (auROC = 0.50).</figcaption>
</figure>

``` python
cols <- c("prob_raw" = "#FF9898FF", "prob_logi" = "#A91E45FF")

probs |>
  mutate(.pred_lapse = .pred_Lapse) |>
  filter(method == "prob_raw" | method == "prob_logi") |> 
  cal_plot_breaks(truth = label, 
                  estimate = .pred_lapse,
                  .by = method) +
  scale_color_manual(values = cols,
                     aesthetics = c("color", "fill")) +
  ylab("Observed Lapse Rate") +
  xlab("Predicted Lapse Probability (Bin Midpoint)") +
  facet_grid(~factor(method, levels=c('prob_raw','prob_logi'),
                     labels = c("Raw (Uncalibrated) Probability",
                                "Logistic (Calibrated) Probability"))) +
  scale_y_continuous(breaks = seq(0,1, by = .1),
                     limits = seq(0,1)) +
  scale_x_continuous(breaks = seq(0,1, by = .1),
                     limits = seq(0,1)) +
  theme_classic() +
  theme(legend.position="none")
```

    Scale for y is already present.
    Adding another scale for y, which will replace the existing scale.
    Scale for x is already present.
    Adding another scale for x, which will replace the existing scale.

<figure id="fig-calibration">
<img src="attachment:index_files/figure-ipynb/notebooks-calibration_plot-fig-calibration-output-2.png" />
<figcaption>Figure 5: Comparison between raw (uncalibrated) and logistic (calibrated) probabilities. Predicted lapse probability represents the predicted probabilities derived from the model, whereas observed lapse rate reflects the true rate of lapses in the data. The dashed y = x line represents perfect performance, where predicted probabilities reflect true probabilities. Each point represents the midpoint of a given bin, which increase by 10% (i.e., 5% represents the midpoint from 0-10%).</figcaption>
</figure>

### Feature Importance

Global importance (mean absolute Shapley values) for feature categories is shown in <a href="#fig-shaps-group" class="quarto-xref">Figure 6</a>. Three aggregated feature categories were identified as being particularly important in contributing to model predictions: time spent at risky locations, time spent at different types of location, and time spent at locations with varying levels of alcohol availability. Other aggregated feature groups, both context-supplemented and without, did not appear to be strong and unique global contributors to model predictions.

``` python
shaps_grp |>
  group_by(variable_grp) |>
  summarize(mean_value = (mean(abs(value)))) |> 
  mutate(group = reorder(variable_grp, mean_value)) |> #, sum)) |>
  #mutate(window = fct(window, levels = c("week", "day", "hour"))) |> 
  ggplot() +
  geom_bar(aes(x = group, y = mean_value), stat = "identity", fill = "#af1f21") +
  ylab("Mean |SHAP| value (in Log-Odds)") +
  xlab("") +
  coord_flip()
```

<figure id="fig-shaps-group">
<img src="attachment:index_files/figure-ipynb/notebooks-shaps-fig-shaps-group-output-1.png" />
<figcaption>Figure 6: Grouped SHAP values displaying relative feature importance calculated using mean absolute values. Larger log-odds values indicate greater contribution to predictions in the model.</figcaption>
</figure>

### Algorithmic Fairness

<a href="#fig-fairness-subgroups" class="quarto-xref">Figure 7</a> shows differences in model performance across race (*N* white = 127, *N* non-white = 19), sex (*N* male = 74, *N* female = 72), age (*N* younger than 55 = 126, *N* older than or equal to 55 = 20), and income (*N* below federal poverty line = 48, *N* above federal poverty line = 98). All group comparisons were reliably different (probability \> .95) across models, such that identities with higher assumed privilege were associated with improved model performance. White, non-Hispanic participants demonstrated 0.055 greater model performance than Hispanic and/or non-white participants (range=0.026-0.083, probability=1.000). Male participants demonstrated 0.036 greater model performance than female participants (range=0.012-0.060, probability=0.999). Younger participants demonstrated 0.106 greater model performance than older participants (range=0.080-0.133, probability=1.000). Finally, participants above the poverty line demonstrated 0.056 greater model performance than those below the poverty line (range=0.033-0.079, probability=1.000).

``` python
cowplot::plot_grid(fig_race, fig_sex, fig_age, fig_income, align="v")
```

<figure id="fig-fairness-subgroups">
<img src="attachment:index_files/figure-ipynb/notebooks-fairness-fig-fairness-subgroups-output-1.png" />
<figcaption>Figure 7: 95% credible intervals across posterior probability distributions by subgroup at differential levels of privilege.</figcaption>
</figure>

## Discussion

### Model Performance

-   Our day-level model of lapse prediction using geolocation data performs adequately well. Models which perform at around the .7 threshold are considered to have “fair” performance. This suggests that, while there is still substantial room for improvement in model performance, geolocation data can predict future alcohol lapse in the next day with fair sensitivity and specificity.

-   Calibration is more sensitive than true lapse rate in sample, suggesting that it overpredicts occurrence of lapses

-   This is not necessarily an issue if we are trying to quantify relative risk to individuals using a risk monitoring system

-   Top performing Shapley values were time spent at risky locations, time spent at different types of location, and time spent at locations with varying levels of alcohol availability.

-   These features were all generated utilizing additional context supplied by participants after a given location was identified as frequently visited (\> 2x in the previous month).

-   It should be noted that these features have potential to be generated without user feedback (i.e., using consumer and other publicly available data to identify establishments that sell alcohol, etc.)

-   This could potentially reduce burden further by not requiring individual input

-   However, self-classifying locations as risky might be encoding nuance that could not be feasibly obtained using public data. For example, a location might be labeled as risky from user input because it is a person-specific triggering location (e.g., scene of a traumatic event).

-   Interestingly, location valence (i.e., the emotion tied to a given location) is the fourth-highest Shapley value, yet appears to be minimally contributing to model predictions. This may be because participants were asked retrospectively about these locations at one month follow-up visits, and so our measures of emotional quality of a location may be too distal to be meaningful (particularly when compared to daily EMA).

-   Location to avoid in recovery and previous drinking location – poor insight so low predictive ability?

-   Transitory movement, location variance, time spent out of home in the evening – lacking individual input from participants, not easily tied to meaningful lapse risk factors

### Model Fairness

### Future directions

-   Baseline model?
-   Add in more affective features
-   Add in risk-terrain modeling features
-   Add in other important features that could contribute to movement patterns like day of the week and weather
-   Test final model
-   Further calibration
-   Break down three top performing features into their subcomponents (i.e., high risk locations, medium risk locations, and low risk locations; yes there is alcohol available here or no there is not alcohol available here) to obtain a more nuanced understanding of model performance.

### Conclusion

This study demonstrates that it is feasible to predict lapse with a fair level of accuracy using geolocation data, suggesting that geolocation data is a viable supplement for risk prediction monitoring systems. Moreover, our model demonstrates similar performance across vulnerable subgroups. Moving forward, additional risk-relevant features will be added to the model in an effort to improve prediction and the final model will be evaluated.

## References

Areàn, Patricia A., Kien Hoa Ly, and Gerhard Andersson. 2016. “[Mobile Technology for Mental Health Assessment](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4969703).” *Dialogues in Clinical Neuroscience* 18 (2): 163–69.

Attwood, Sophie, Hannah Parke, John Larsen, and Katie L. Morton. 2017. “Using a Mobile Health Application to Reduce Alcohol Consumption: A Mixed-Methods Evaluation of the Drinkaware Track & Calculate Units Application.” *BMC Public Health* 17 (1): 394. <https://doi.org/10.1186/s12889-017-4358-9>.

Brandon, Thomas H., Jennifer Irvin Vidrine, and Erika B. Litvin. 2007. “Relapse and Relapse Prevention.” *Annual Review of Clinical Psychology* 3 (Volume 3, 2007): 257–84. <https://doi.org/10.1146/annurev.clinpsy.3.022806.091455>.

Carreiro, Stephanie, Melissa Taylor, Sloke Shrestha, Megan Reinhardt, Nicole Gilbertson, and Premananda Indic. 2021. “Realize, Analyze, Engage (RAE): A Digital Tool to Support Recovery from Substance Use Disorder.” *Journal of Psychiatry and Brain Science* 6: e210002. <https://doi.org/10.20900/jpbs.20210002>.

Doryab, Afsaneh, Daniella K. Villalba, Prerna Chikersal, Janine M. Dutcher, Michael Tumminia, Xinwen Liu, Sheldon Cohen, et al. 2019. “Identifying Behavioral Phenotypes of Loneliness and Social Isolation with Passive Sensing: Statistical Analysis, Data Mining and Machine Learning of Smartphone and Fitbit Data.” *JMIR mHealth and uHealth* 7 (7): e13209. <https://doi.org/10.2196/13209>.

Epstein, David H., Matthew Tyburski, Ian M. Craig, Karran A. Phillips, Michelle L. Jobes, Massoud Vahabzadeh, Mustapha Mezghanni, Jia-Ling Lin, C. Debra M. Furr-Holden, and Kenzie L. Preston. 2014. “Real-Time Tracking of Neighborhood Surroundings and Mood in Urban Drug Misusers: Application of a New Method to Study Behavior in Its Geographical Context.” *Drug and Alcohol Dependence* 134 (January): 22–29. <https://doi.org/10.1016/j.drugalcdep.2013.09.007>.

Gonzalez, Vivian M., and Patrick L. Dulin. 2015. “Comparison of a Smartphone App for Alcohol Use Disorders with an <span class="nocase">Internet-based</span> Intervention Plus Bibliotherapy: A Pilot Study.” *Journal of Consulting and Clinical Psychology* 83 (2): 335–45. <https://doi.org/10.1037/a0038620>.

Gustafson, David H., Fiona M. McTavish, Ming-Yuan Chih, Amy K. Atwood, Roberta A. Johnson, Michael G. Boyle, Michael S. Levy, et al. 2014. “A Smartphone Application to Support Recovery From Alcoholism: A Randomized Clinical Trial.” *JAMA Psychiatry* 71 (5): 566. <https://doi.org/10.1001/jamapsychiatry.2013.4642>.

Heller, Aaron S., Tracey C. Shi, C. E. Chiemeka Ezie, Travis R. Reneau, Lara M. Baez, Conor J. Gibbons, and Catherine A. Hartley. 2020. “Association Between Real-World Experiential Diversity and Positive Affect Relates to Hippocampal-Striatal Functional Connectivity.” *Nature Neuroscience* 23 (7): 800–804. <https://doi.org/10.1038/s41593-020-0636-4>.

“Highlights for the 2022 National Survey on Drug Use and Health.” n.d.

Janak, Patricia H., and Nadia Chaudhri. 2010. “The Potent Effect of Environmental Context on Relapse to Alcohol-Seeking After Extinction.” *The Open Addiction Journal* 3 (January): 76–87. <https://doi.org/10.2174/1874941001003010076>.

Japkowicz, Nathalie. 2000. “The Class Imbalance Problem: Significance and Strategies.” In *Proc. Of the Int’l Conf. On Artificial Intelligence*, 56:111–17.

Jonathan, P., W. J. Krzanowski, and W. V. McCarthy. 2000. “On the Use of Cross-Validation to Assess Performance in Multivariate Prediction.” *Statistics and Computing* 10 (3): 209–29. <https://doi.org/10.1023/A:1008987426876>.

Kuhn, Max, and Kjell Johnson. 2018. *Applied Predictive Modeling*. 1st ed. 2013, Corr. 2nd printing 2018 edition. New York: Springer. <https://doi.org/10.1007/978-1-4614-6849-3>.

Kwan, Mei-Po, Jue Wang, Matthew Tyburski, David H. Epstein, William J. Kowalczyk, and Kenzie L. Preston. 2019. “Uncertainties in the Geographic Context of Health Behaviors: A Study of Substance Users’ Exposure to Psychosocial Stress Using GPS Data.” *International Journal of Geographical Information Science* 33 (6): 1176–95. <https://doi.org/10.1080/13658816.2018.1503276>.

LeCocq, Mandy Rita, Patrick A. Randall, Joyce Besheer, and Nadia Chaudhri. 2020. “Considering Drug-Associated Contexts in Substance Use Disorders and Treatment Development.” *Neurotherapeutics: The Journal of the American Society for Experimental NeuroTherapeutics* 17 (1): 43–54. <https://doi.org/10.1007/s13311-019-00824-2>.

Lundberg, Scott M., and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In *Proceedings of the 31st International Conference on Neural Information Processing Systems*, 4768–77. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.

Mohr, David C., Mi Zhang, and Stephen M. Schueller. 2017. “Personal Sensing: Understanding Mental Health Using Ubiquitous Sensors and Machine Learning.” *Annual Review of Clinical Psychology* 13 (1): 23–47. <https://doi.org/10.1146/annurev-clinpsy-032816-044949>.

Naughton, Felix, Sarah Hopewell, Neal Lathia, Rik Schalbroeck, Chloë Brown, Cecilia Mascolo, Andy McEwen, and Stephen Sutton. 2016. “A Context-Sensing Mobile Phone App (Q Sense) for Smoking Cessation: A Mixed-Methods Study.” *JMIR mHealth and uHealth* 4 (3): e106. <https://doi.org/10.2196/mhealth.5787>.

“Physical Activity Guidelines for Americans, 2nd Edition.” n.d.

Rajkomar, Alvin, Michaela Hardt, Michael D. Howell, Greg Corrado, and Marshall H. Chin. 2018. “Ensuring Fairness in Machine Learning to Advance Health Equity.” *Annals of Internal Medicine* 169 (12): 866–72. <https://doi.org/10.7326/M18-1990>.

Raugh, Ian M., Sydney H. James, Cristina M. Gonzalez, Hannah C. Chapman, Alex S. Cohen, Brian Kirkpatrick, and Gregory P. Strauss. 2020. “Geolocation as a Digital Phenotyping Measure of Negative Symptoms and Functional Outcome.” *Schizophrenia Bulletin* 46 (6): 1596–1607. <https://doi.org/10.1093/schbul/sbaa121>.

Schick, Melissa R., Nichea S. Spillane, and Katherine L. Hostetler. 2020. “A Call to Action: A Systematic Review Examining the Failure to Include Females and Members of Minoritized Racial/Ethnic Groups in Clinical Trials of Pharmacological Treatments for Alcohol Use Disorder.” *Alcoholism: Clinical and Experimental Research* 44 (10): 1933–51. <https://doi.org/10.1111/acer.14440>.

Shin, Jaeeun, and Sung Man Bae. 2023. “A Systematic Review of Location Data for Depression Prediction.” *International Journal of Environmental Research and Public Health* 20 (11): 5984. <https://doi.org/10.3390/ijerph20115984>.

Stahler, Gerald J., Jeremy Mennis, and David A. Baron. 2013. “Geospatial Technology and the "Exposome": New Perspectives on Addiction.” *American Journal of Public Health* 103 (8): 1354–56. <https://doi.org/10.2105/AJPH.2013.301306>.

“The Science of Drug Use and Addiction: The Basics NIDA Archives.” n.d. https://archives.nida.nih.gov/publications/media-guide/science-drug-use-addiction-basics. Accessed October 7, 2024.

Tucker, Jalie A., Susan D. Chandler, and Katie Witkiewitz. 2020. “Epidemiology of Recovery From Alcohol Use Disorder.” *Alcohol Research : Current Reviews* 40 (3): 02. <https://doi.org/10.35946/arcr.v40.3.02>.

Walton, M. A., T. M. Reischl, and C. S. Ramanthan. 1995. “Social Settings and Addiction Relapse.” *Journal of Substance Abuse* 7 (2): 223–33. <https://doi.org/10.1016/0899-3289(95)90006-3>.

Walton, Maureen A., Frederic C. Blow, C. Raymond Bingham, and Stephen T. Chermack. 2003. “Individual and Social/Environmental Predictors of Alcohol and Drug Use 2 Years Following Substance Abuse Treatment.” *Addictive Behaviors* 28 (4): 627–42. <https://doi.org/10.1016/s0306-4603(01)00284-2>.

Wang, Angelina, Vikram V Ramaswamy, and Olga Russakovsky. 2022. “Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresentation, and Performing Evaluation.” In *Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency*, 336–49. FAccT ’22. New York, NY, USA: Association for Computing Machinery. <https://doi.org/10.1145/3531146.3533101>.

Wang, Xiaomeng, Yishi Zhang, and Ruilin Zhu. 2022. “A Brief Review on Algorithmic Fairness.” *Management System Engineering* 1 (1): 7. <https://doi.org/10.1007/s44176-022-00006-z>.

Wawira Gichoya, Judy, Liam G. McCoy, Leo Anthony Celi, and Marzyeh Ghassemi. 2021. “Equity in Essence: A Call for Operationalising Fairness in Machine Learning for Healthcare.” *BMJ Health & Care Informatics* 28 (1): e100289. <https://doi.org/10.1136/bmjhci-2020-100289>.

Witkiewitz, Katie, and G. Alan Marlatt. 2004. “Relapse Prevention for Alcohol and Drug Problems: That Was Zen, This Is Tao.” *The American Psychologist* 59 (4): 224–35. <https://doi.org/10.1037/0003-066X.59.4.224>.