Responsibility Stickiness
===

To further investigate the validity of the responsibility model specifically, we tested the expectation that an author mentioning a responsibility in an update is more likely to mention that responsibility in other updates authored in the same week. For each responsibility R, we fit a Poisson regression to predict the number of updates containing R based on whether a randomly selected journal from the week in question contains R.  We consider only weeks with at least 2 updates and use the total number of updates authored that week as the exposure, additionally controlling for the baseline rate of updates predicted to contain R for this site.  Incidence rates and associated 95\% confidence intervals are shown in Table X.  When an update is predicted to contain a responsibility, other updates in that week are predicted to contain R at a rate 1.XX times greater than if the update is predicted not to contain R.  These results provide additional evidence that the responsibility predictions correlate with reality.

\begin{tabular}{@{}llll@{}}
\toprule
     & Contains $R$? & Baseline rate of $R$   & $G^2$ (df) \\ \midrule
CS   & $0.00\pm0.00$ & $0.00\pm0.00$ & XXX (YYY)  \\
SM   & $0.00\pm0.00$ & $0.00\pm0.00$ & XXX (YYY)  \\
CP   & $0.00\pm0.00$ & $0.00\pm0.00$ & XXX (YYY)  \\
FM   & $0.00\pm0.00$ & $0.00\pm0.00$ & XXX (YYY)  \\
GB   & $0.00\pm0.00$ & $0.00\pm0.00$ & XXX (YYY)  \\
BC   & $0.00\pm0.00$ & $0.00\pm0.00$ & XXX (YYY)  \\
Mean & 1.XX          & 2.YY          & ---        \\ \bottomrule
\end{tabular}

The intuition is that when a responsibility is mentioned by a patient they're likely to discuss the responsibility again soon.  By finding evidence that this is so, we establish additional convergent validity between the responsibility model and our expectations. 

The proposed approach is Poisson regression to estimate the proportion of updates published in a week that are predicted to contain a responsibility. 
Specifically, for each site, we bucket the site into weeks and include only weeks with at least 2 authored journal updates.  Then, we randomly select one of the updates authored in a week and use it as the seed journal, asking "is the probability of the other updates containing this responsibility higher if the seed update contains the responsibility?".  The only additional confounders we control for is the baseline proportion of posts on this site that contain the responsibility in question and potentially the amount of time elapsed since the start of the site.

```
Poisson regression model response and variables for responsibility A
y = # updates with A of non-seed journals
x1 = does seed update have A
x2 = proportion of updates on this site containing A, i.e. the baseline occurrence of A on this site (not including this week).
x3 = week rank  (This would capture longer-term positive or negative trends in the baseline occurrence of A.  Ideally it shouldn't matter if we include this variable, but we may need to include it.)
offset (exposure) = log(# journals in week - 1)  (The exposure is the number of non-seed journals published on this site during this week.)
```

We compute and interpret incidence rates.

Incidence rate interpretation: The rate ratio comparing seed updates containing A to seed updates not containing A, given the other variables are held constant in the model.  Journals containing A have a rate 1.5 times greater for ratio of journals in the same week that contain A.


In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [28]:
import pandas as pd
import numpy as np
import sklearn
import sklearn.metrics
import os
from tqdm import tqdm

import matplotlib.pyplot as plt
import matplotlib.dates as md
import matplotlib

from collections import Counter

In [3]:
import sys
sys.path.append("../annotation_data")

In [10]:
import responsibility as responsibility_utils
import db as db_utils

In [37]:
resp_subset = responsibility_utils.high_irr_responsibility_labels

In [25]:
working_dir = "/home/srivbane/shared/caringbridge/data/projects/qual-health-journeys/patient_responsibilities"

In [5]:
# load the test predictions
vw_working_dir = "/home/srivbane/shared/caringbridge/data/projects/qual-health-journeys/classification/responsibilities/vw"
all_preds_filepath = os.path.join(vw_working_dir, "vw_all_preds.pkl")
df = pd.read_pickle(all_preds_filepath)
len(df)

157389

In [6]:
pred_label_cols = [col for col in df.columns if col.endswith("_pred_label")]
pred_label_cols

['coordinating_support_pred_label',
 'sharing_medical_info_pred_label',
 'compliance_pred_label',
 'financial_management_pred_label',
 'giving_back_pred_label',
 'behavior_changes_pred_label']

In [9]:
df[pred_label_cols].head(n=5)

Unnamed: 0,coordinating_support_pred_label,sharing_medical_info_pred_label,compliance_pred_label,financial_management_pred_label,giving_back_pred_label,behavior_changes_pred_label
0,1,1,1,0,1,0
1,1,1,1,0,0,0
2,1,1,1,1,1,1
3,1,1,1,1,1,1
4,1,1,1,1,1,1


In [20]:
def get_created_at(site_id, journal_oid):
    site_id = int(site_id)
    try:
        db = db_utils.get_journal_info_db()
        cursor = db.execute("""SELECT created_at
                                    FROM journalMetadata
                                    WHERE site_id = ? AND journal_oid = ?
                                    """,
                            (site_id, journal_oid))
        journals = cursor.fetchall()
        assert journals is not None, "No journal with this site_id/journal_oid"
        assert len(journals) == 1, f"len(journals) journals with this site_id/journal_oid"

        j = journals[0]
        return j['created_at']
        #journal_dict = {'site_id': site_id,
        #                  'journal_oid': journal_oid,
        #                  'createdAt': j['createdAt']}
        #return journal_dict
    finally:
        db.close()

In [18]:
def get_created_at_for_row(row):
    return get_created_at(row.site_id, row.journal_oid)

In [22]:
df['created_at'] = df.apply(get_created_at_for_row, axis=1)

In [23]:
journal_cols = ['site_id', 'journal_oid', 'created_at']

In [24]:
df[journal_cols + pred_label_cols].head()

Unnamed: 0,site_id,journal_oid,created_at,coordinating_support_pred_label,sharing_medical_info_pred_label,compliance_pred_label,financial_management_pred_label,giving_back_pred_label,behavior_changes_pred_label
0,106710,51be14196ca0041935009526,1232061600000,1,1,1,0,1,0
1,106710,51be14196ca0041935009660,1232385900000,1,1,1,0,0,0
2,106710,51be14196ca00419350098b4,1233079260000,1,1,1,1,1,1
3,106710,51be14196ca00419350099e0,1233675660000,1,1,1,1,1,1
4,106710,51be14196ca0041935009a62,1233890580000,1,1,1,1,1,1


In [26]:
journal_preds = df[journal_cols + pred_label_cols].copy()
journal_preds.reset_index(inplace=True)

In [27]:
journal_preds_filepath = os.path.join(working_dir, "journal_preds.csv")
journal_preds.to_csv(journal_preds_filepath)
print("Finished.")

Finished.


In [32]:
one_week = 1000 * 60 * 60 * 24 * 7  # one week in ms

In [33]:
def get_week_labels(group):
    # assume group is from same site
    first_journal = np.min(group.created_at)
    last_journal = np.max(group.created_at)
    bins = [first_journal]
    while bins[-1] < last_journal:
        bins.append(bins[-1] + one_week)
    
    bin_labels = np.digitize(group.created_at, bins, right=True)
    return bin_labels

In [35]:
journal_preds['week'] = -1
for site_id, group in tqdm(journal_preds.groupby(by='site_id', sort=False), total=len(set(journal_preds.site_id))):
    # assign all week labels
    journal_preds.loc[group.index, 'week'] = get_week_labels(group)


  0%|          | 0/4947 [00:00<?, ?it/s][A
  0%|          | 20/4947 [00:00<00:25, 196.62it/s][A
  1%|          | 49/4947 [00:00<00:22, 216.87it/s][A
  2%|▏         | 78/4947 [00:00<00:20, 233.40it/s][A
  2%|▏         | 106/4947 [00:00<00:19, 244.09it/s][A
  3%|▎         | 134/4947 [00:00<00:19, 253.00it/s][A
  3%|▎         | 161/4947 [00:00<00:18, 256.89it/s][A
  4%|▍         | 187/4947 [00:00<00:18, 257.60it/s][A
  4%|▍         | 214/4947 [00:00<00:18, 261.17it/s][A
  5%|▍         | 242/4947 [00:00<00:17, 264.33it/s][A
  5%|▌         | 270/4947 [00:01<00:17, 267.93it/s][A
  6%|▌         | 299/4947 [00:01<00:17, 273.28it/s][A
  7%|▋         | 328/4947 [00:01<00:16, 276.26it/s][A
  7%|▋         | 356/4947 [00:01<00:16, 277.19it/s][A
  8%|▊         | 385/4947 [00:01<00:16, 278.93it/s][A
  8%|▊         | 413/4947 [00:01<00:16, 275.29it/s][A
  9%|▉         | 441/4947 [00:01<00:16, 275.15it/s][A
  9%|▉         | 469/4947 [00:01<00:16, 275.24it/s][A
 10%|█         | 498/49

In [71]:
poisson_regression_vars_filepath = os.path.join(working_dir, 'regression_vars.csv')

with open(poisson_regression_vars_filepath, 'w') as outfile:
    header = "site_id,resp,nonseed_contains_count,seed_contains,site_proportion,week_rank,nonseed_total_count"
    outfile.write(header + "\n")
    for site_id, journals in tqdm(journal_preds.groupby(by='site_id', sort=False), total=len(set(journal_preds.site_id))):
        # compute proportion of updates on this site that contain the responsibility
        preds = journal_preds.loc[journals.index, pred_label_cols]
        resp_site_proportions = np.sum(preds, axis=0) / preds.shape[0]
        resp_site_proportions.index = resp_subset

        for week, group in journals.groupby(by='week', sort=False):
            if len(group) == 1:
                continue
            # choose a seed journal
            num_candidates = len(group)
            seed_index = np.random.randint(0, num_candidates)

            # does seed journal contain resp?
            seed_contains_resp = journal_preds.loc[group.index[seed_index], pred_label_cols]
            seed_contains_resp.index = resp_subset

            # resp counts for non-seed journals
            non_seed_journals = group.drop(group.index[seed_index])
            #assert len(non_seed_journals) == num_candidates - 1
            preds = journal_preds.loc[non_seed_journals.index, pred_label_cols]
            nonseed_resp_counts = np.sum(preds, axis=0)
            nonseed_resp_counts.index = resp_subset

            offset = num_candidates - 1

            for resp in resp_subset:
                site_proportion = resp_site_proportions[resp]
                seed_contains = seed_contains_resp[resp]
                nonseed_count = int(nonseed_resp_counts[resp])
                week_rank = week
                line = ",".join((str(site_id), 
                                 str(resp), str(nonseed_count), str(seed_contains), 
                                 str(site_proportion), str(week_rank), str(offset)))
                outfile.write(line + "\n")
        #print(resp_site_proportions)
        #print()
        #print(seed_contains_resp) 
        #print()
        #print(nonseed_resp_counts)
        #print()
        #print(offset)
        #print()




  0%|          | 0/4947 [00:00<?, ?it/s][A[A[A


  0%|          | 2/4947 [00:00<08:23,  9.83it/s][A[A[A


  0%|          | 7/4947 [00:01<10:28,  7.86it/s][A[A[A


  0%|          | 9/4947 [00:01<13:34,  6.06it/s][A[A[A


  0%|          | 11/4947 [00:01<11:01,  7.46it/s][A[A[A


  0%|          | 14/4947 [00:01<08:35,  9.57it/s][A[A[A


  0%|          | 16/4947 [00:02<08:45,  9.39it/s][A[A[A


  0%|          | 18/4947 [00:02<07:35, 10.83it/s][A[A[A


  0%|          | 20/4947 [00:02<06:37, 12.41it/s][A[A[A


  0%|          | 23/4947 [00:02<05:45, 14.23it/s][A[A[A


  1%|          | 25/4947 [00:02<06:12, 13.22it/s][A[A[A


  1%|          | 27/4947 [00:02<06:56, 11.81it/s][A[A[A


  1%|          | 30/4947 [00:02<05:41, 14.39it/s][A[A[A


  1%|          | 33/4947 [00:03<05:02, 16.24it/s][A[A[A


  1%|          | 35/4947 [00:03<05:22, 15.24it/s][A[A[A


  1%|          | 40/4947 [00:03<04:16, 19.13it/s][A[A[A


  1%|          | 43/4947 [00:03<0

 16%|█▌        | 778/4947 [00:52<05:19, 13.03it/s][A[A[A


 16%|█▌        | 783/4947 [00:52<04:40, 14.83it/s][A[A[A


 16%|█▌        | 786/4947 [00:52<04:07, 16.80it/s][A[A[A


 16%|█▌        | 791/4947 [00:52<03:28, 19.94it/s][A[A[A


 16%|█▌        | 794/4947 [00:52<03:07, 22.17it/s][A[A[A


 16%|█▌        | 799/4947 [00:52<02:41, 25.70it/s][A[A[A


 16%|█▌        | 803/4947 [00:53<03:01, 22.77it/s][A[A[A


 16%|█▋        | 806/4947 [00:53<02:52, 24.01it/s][A[A[A


 16%|█▋        | 809/4947 [00:53<03:21, 20.49it/s][A[A[A


 16%|█▋        | 812/4947 [00:53<03:08, 21.97it/s][A[A[A


 16%|█▋        | 815/4947 [00:53<02:53, 23.78it/s][A[A[A


 17%|█▋        | 818/4947 [00:53<02:53, 23.79it/s][A[A[A


 17%|█▋        | 821/4947 [00:54<03:35, 19.12it/s][A[A[A


 17%|█▋        | 824/4947 [00:54<04:52, 14.12it/s][A[A[A


 17%|█▋        | 827/4947 [00:54<04:27, 15.41it/s][A[A[A


 17%|█▋        | 829/4947 [00:54<04:24, 15.57it/s][A[A[A


 17%|█▋ 

 30%|███       | 1507/4947 [01:39<03:45, 15.27it/s][A[A[A


 31%|███       | 1509/4947 [01:39<05:54,  9.70it/s][A[A[A


 31%|███       | 1511/4947 [01:39<05:33, 10.31it/s][A[A[A


 31%|███       | 1513/4947 [01:39<06:07,  9.34it/s][A[A[A


 31%|███       | 1515/4947 [01:39<05:12, 10.96it/s][A[A[A


 31%|███       | 1517/4947 [01:40<04:38, 12.32it/s][A[A[A


 31%|███       | 1520/4947 [01:40<04:03, 14.09it/s][A[A[A


 31%|███       | 1522/4947 [01:40<05:02, 11.31it/s][A[A[A


 31%|███       | 1526/4947 [01:40<04:02, 14.13it/s][A[A[A


 31%|███       | 1529/4947 [01:40<03:54, 14.56it/s][A[A[A


 31%|███       | 1533/4947 [01:40<03:11, 17.82it/s][A[A[A


 31%|███       | 1536/4947 [01:40<02:58, 19.07it/s][A[A[A


 31%|███       | 1539/4947 [01:41<04:49, 11.78it/s][A[A[A


 31%|███       | 1541/4947 [01:41<04:18, 13.16it/s][A[A[A


 31%|███       | 1543/4947 [01:41<03:56, 14.37it/s][A[A[A


 31%|███       | 1545/4947 [01:41<04:49, 11.77it/s][A

 45%|████▌     | 2245/4947 [02:25<04:45,  9.47it/s][A[A[A


 45%|████▌     | 2249/4947 [02:25<04:11, 10.74it/s][A[A[A


 46%|████▌     | 2252/4947 [02:25<03:41, 12.15it/s][A[A[A


 46%|████▌     | 2254/4947 [02:25<03:43, 12.03it/s][A[A[A


 46%|████▌     | 2256/4947 [02:26<04:01, 11.14it/s][A[A[A


 46%|████▌     | 2258/4947 [02:26<05:24,  8.28it/s][A[A[A


 46%|████▌     | 2260/4947 [02:26<04:31,  9.88it/s][A[A[A


 46%|████▌     | 2262/4947 [02:27<07:14,  6.19it/s][A[A[A


 46%|████▌     | 2265/4947 [02:27<05:49,  7.67it/s][A[A[A


 46%|████▌     | 2269/4947 [02:27<04:36,  9.70it/s][A[A[A


 46%|████▌     | 2271/4947 [02:27<03:59, 11.20it/s][A[A[A


 46%|████▌     | 2274/4947 [02:27<03:18, 13.46it/s][A[A[A


 46%|████▌     | 2276/4947 [02:27<03:13, 13.78it/s][A[A[A


 46%|████▌     | 2279/4947 [02:28<02:54, 15.26it/s][A[A[A


 46%|████▌     | 2281/4947 [02:28<02:56, 15.12it/s][A[A[A


 46%|████▌     | 2283/4947 [02:28<03:00, 14.72it/s][A

 61%|██████    | 3019/4947 [03:11<01:57, 16.38it/s][A[A[A


 61%|██████    | 3023/4947 [03:11<01:37, 19.77it/s][A[A[A


 61%|██████    | 3027/4947 [03:11<01:26, 22.11it/s][A[A[A


 61%|██████    | 3030/4947 [03:11<01:38, 19.53it/s][A[A[A


 61%|██████▏   | 3033/4947 [03:12<02:42, 11.79it/s][A[A[A


 61%|██████▏   | 3035/4947 [03:12<02:29, 12.83it/s][A[A[A


 61%|██████▏   | 3038/4947 [03:12<02:08, 14.84it/s][A[A[A


 61%|██████▏   | 3040/4947 [03:12<02:05, 15.21it/s][A[A[A


 61%|██████▏   | 3042/4947 [03:12<02:15, 14.08it/s][A[A[A


 62%|██████▏   | 3045/4947 [03:12<01:58, 16.11it/s][A[A[A


 62%|██████▏   | 3047/4947 [03:13<01:54, 16.62it/s][A[A[A


 62%|██████▏   | 3052/4947 [03:13<01:35, 19.92it/s][A[A[A


 62%|██████▏   | 3055/4947 [03:13<01:54, 16.54it/s][A[A[A


 62%|██████▏   | 3059/4947 [03:13<01:39, 19.00it/s][A[A[A


 62%|██████▏   | 3064/4947 [03:13<01:24, 22.26it/s][A[A[A


 62%|██████▏   | 3067/4947 [03:13<01:38, 18.99it/s][A

 77%|███████▋  | 3789/4947 [03:58<01:46, 10.89it/s][A[A[A


 77%|███████▋  | 3791/4947 [03:58<01:34, 12.18it/s][A[A[A


 77%|███████▋  | 3796/4947 [03:58<01:13, 15.61it/s][A[A[A


 77%|███████▋  | 3800/4947 [03:58<01:01, 18.61it/s][A[A[A


 77%|███████▋  | 3803/4947 [03:59<01:06, 17.24it/s][A[A[A


 77%|███████▋  | 3806/4947 [03:59<01:02, 18.12it/s][A[A[A


 77%|███████▋  | 3809/4947 [03:59<01:00, 18.77it/s][A[A[A


 77%|███████▋  | 3812/4947 [03:59<00:57, 19.62it/s][A[A[A


 77%|███████▋  | 3816/4947 [03:59<00:58, 19.26it/s][A[A[A


 77%|███████▋  | 3819/4947 [03:59<00:53, 21.13it/s][A[A[A


 77%|███████▋  | 3824/4947 [04:00<00:47, 23.67it/s][A[A[A


 77%|███████▋  | 3827/4947 [04:00<00:54, 20.58it/s][A[A[A


 77%|███████▋  | 3830/4947 [04:00<00:57, 19.39it/s][A[A[A


 77%|███████▋  | 3833/4947 [04:00<00:57, 19.30it/s][A[A[A


 78%|███████▊  | 3836/4947 [04:00<01:00, 18.48it/s][A[A[A


 78%|███████▊  | 3839/4947 [04:00<00:53, 20.54it/s][A

 92%|█████████▏| 4528/4947 [04:46<00:28, 14.57it/s][A[A[A


 92%|█████████▏| 4530/4947 [04:46<00:28, 14.60it/s][A[A[A


 92%|█████████▏| 4532/4947 [04:46<00:26, 15.55it/s][A[A[A


 92%|█████████▏| 4534/4947 [04:46<00:34, 11.86it/s][A[A[A


 92%|█████████▏| 4536/4947 [04:46<00:30, 13.44it/s][A[A[A


 92%|█████████▏| 4539/4947 [04:47<00:28, 14.55it/s][A[A[A


 92%|█████████▏| 4541/4947 [04:47<00:32, 12.49it/s][A[A[A


 92%|█████████▏| 4545/4947 [04:47<00:28, 14.26it/s][A[A[A


 92%|█████████▏| 4547/4947 [04:47<00:30, 13.02it/s][A[A[A


 92%|█████████▏| 4550/4947 [04:47<00:25, 15.50it/s][A[A[A


 92%|█████████▏| 4552/4947 [04:47<00:28, 13.69it/s][A[A[A


 92%|█████████▏| 4557/4947 [04:48<00:22, 17.16it/s][A[A[A


 92%|█████████▏| 4560/4947 [04:48<00:20, 19.01it/s][A[A[A


 92%|█████████▏| 4563/4947 [04:48<00:18, 20.82it/s][A[A[A


 92%|█████████▏| 4567/4947 [04:48<00:16, 22.48it/s][A[A[A


 92%|█████████▏| 4570/4947 [04:48<00:15, 23.77it/s][A

In [72]:
resp_subset

['coordinating_support',
 'sharing_medical_info',
 'compliance',
 'financial_management',
 'giving_back',
 'behavior_changes']