Site Level Author Type Discussion
===

Notebook for discussing disagreements in initial training set.

This stuff didn't make the paper.

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
import os
import re
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split

from collections import Counter
import sqlite3
from nltk import word_tokenize
from html.parser import HTMLParser
from tqdm import tqdm
import random
import pickle
import json

from datetime import datetime
from pprint import pprint

import matplotlib.pyplot as plt
import matplotlib.dates as md
import matplotlib
import pylab as pl
from IPython.core.display import display, HTML

In [4]:
import sys
sys.path.append("/home/srivbane/levon003/repos/qual-health-journeys/annotation_data")
import journal as journal_utils
import db as db_utils

In [5]:
working_dir = "/home/srivbane/shared/caringbridge/data/projects/sna-social-support/data_selection"
os.makedirs(working_dir, exist_ok=True)

In [77]:
annotation_web_client_database = "/home/srivbane/shared/caringbridge/data/projects/qual-health-journeys/instance/cbAnnotator.sqlite"


def get_annotation_db():
    db = sqlite3.connect(
            annotation_web_client_database,
            detect_types=sqlite3.PARSE_DECLTYPES
        )
    db.row_factory = sqlite3.Row
    return db


def get_author_annotations():
    try:
        db = get_annotation_db()
        cursor = db.execute(
            """SELECT id, site_id, journal_oid, username, data 
                FROM journalAnnotation
                WHERE annotation_type = "journal_author_type"
                GROUP BY site_id, journal_oid, username, id""")
        journal_author_annotations = cursor.fetchall()
        annotation_strings = [{'id': a['id'],
                               'site_id': a['site_id'], 
                               'journal_oid': a['journal_oid'], 
                               'username': a['username'],
                               'data': a['data']}
                              for a in journal_author_annotations]
        df = pd.DataFrame(annotation_strings)
        # here, we drop duplicates, taking the highest-id value for each site/journal/username group
        filtered_df = df.sort_values(by=['site_id', 'journal_oid', 'username', 'id']).groupby(by=['site_id', 'journal_oid', 'username']).tail(1)
        return filtered_df
    finally:
        db.close()


# Test extraction of annotations
get_author_annotations().head()

Unnamed: 0,data,id,journal_oid,site_id,username
7352,cg,29706,51bdf3e56ca0048f4e00ceca,1,levon003
7351,cg,29705,51bdf3e56ca0048f4e00ced4,1,levon003
7353,cg,29707,51bdf3e56ca0048f4e00d2c8,1,levon003
7442,unk,29844,51bdf72d6ca004d458006a20,13741,levon003
7410,p,29812,51bdf72d6ca004d458006a7c,13741,levon003


In [78]:
df = get_author_annotations()
len(df)

8019

In [79]:
# verify lack of duplicate entries
for key, group in df.groupby(by=['site_id', 'journal_oid', 'username']):
    if len(group) > 1:
        print(key)
        print(group)
        assert False

In [65]:
Counter(df.data).most_common()

[('p', 6467),
 ('cg', 1434),
 ('unk', 86),
 ('pcg', 20),
 ('all_cg', 10),
 ('all_p', 2)]

In [66]:
df.head()

Unnamed: 0,data,id,journal_oid,site_id,username
7352,cg,29706,51bdf3e56ca0048f4e00ceca,1,levon003
7351,cg,29705,51bdf3e56ca0048f4e00ced4,1,levon003
7353,cg,29707,51bdf3e56ca0048f4e00d2c8,1,levon003
7442,unk,29844,51bdf72d6ca004d458006a20,13741,levon003
7410,p,29812,51bdf72d6ca004d458006a7c,13741,levon003


In [74]:
def get_webclient_url(site_id, journal_oid=None):
    url = f"127.0.0.1:5000/siteId/{site_id}"
    if journal_oid is not None:
        url += f"#{journal_oid}"
    return url

### Load sites in the training set

In [69]:
annotation_assignment_dir = "/home/srivbane/shared/caringbridge/data/projects/qual-health-journeys/instance/annotation_data/assignments/levon003"
fname = "sna_author_type_train_n10.txt"
train_site_ids = []
with open(os.path.join(annotation_assignment_dir, fname), 'r') as infile:
    infile.readline()  # strip the header line
    for line in infile:
        line = line.strip()
        if line.startswith('#') or line == "":
            continue
        site_id = int(line)
        train_site_ids.append(site_id)
len(train_site_ids)

10

In [70]:
len(df[df.journal_oid == 'site'])

24

In [71]:
train_df = df[df.site_id.isin(train_site_ids)]
len(train_df)

880

In [75]:
print(f"{'Site Id':>9} {'Hannah':>10} {'Zach':>10} {'Agree?':>15}")
print("="*80)
for site_id in train_site_ids:
    site_df = train_df[train_df.site_id == site_id]
    mill_df = site_df[site_df.username == 'mill6273']
    levon003_df = site_df[site_df.username == 'levon003']
    mill_site_annotation = mill_df[mill_df.journal_oid == 'site'].iloc[0]['data']
    levon003_site_annotation = levon003_df[levon003_df.journal_oid == 'site'].iloc[0]['data']
    is_agreement = mill_site_annotation == levon003_site_annotation
    agreement_str = "-" if is_agreement else "DISAGREEMENT"
    print(f"{site_id:>9} {mill_site_annotation:>10} {levon003_site_annotation:>10} {agreement_str:>15}    {get_webclient_url(site_id)}")

  Site Id     Hannah       Zach          Agree?
    64076     all_cg         cg    DISAGREEMENT    127.0.0.1:5000/siteId/64076
   378016     all_cg         cg    DISAGREEMENT    127.0.0.1:5000/siteId/378016
  1041742         cg     all_cg    DISAGREEMENT    127.0.0.1:5000/siteId/1041742
   594127     all_cg     all_cg               -    127.0.0.1:5000/siteId/594127
    93134         cg     all_cg    DISAGREEMENT    127.0.0.1:5000/siteId/93134
   848164         cg     all_cg    DISAGREEMENT    127.0.0.1:5000/siteId/848164
  1115733        unk        unk               -    127.0.0.1:5000/siteId/1115733
   170125         cg     all_cg    DISAGREEMENT    127.0.0.1:5000/siteId/170125
   705855         cg     all_cg    DISAGREEMENT    127.0.0.1:5000/siteId/705855
   609641          p      all_p    DISAGREEMENT    127.0.0.1:5000/siteId/609641


In [76]:
for site_id in train_site_ids:
    site_df = train_df[(train_df.site_id == site_id)&(train_df.journal_oid != 'site')]
    mill_df = site_df[site_df.username == 'mill6273']
    levon_df = site_df[site_df.username == 'levon003']
    m = pd.merge(levon_df, mill_df, on='journal_oid', suffixes=("_levon", "_mill"))
    is_agreement_list = m.data_levon == m.data_mill
    print(f"Agrement for site {site_id}: {np.sum(is_agreement_list) / len(m) * 100:.3f}%")
    for journal_oid, mill_annotation, levon_annotation in zip(m.journal_oid, m.data_mill, m.data_levon):
        is_agreement = mill_annotation == levon_annotation
        if is_agreement:
            continue
        #agreement_str = "-" if is_agreement else "DIS"
        print(f"{journal_oid:>20} {mill_annotation:>10} {levon_annotation:>10}  {get_webclient_url(site_id, journal_oid)}")    
    print()
    print()

Agrement for site 64076: 97.368%
51be07406ca004300d008cbc         cg        unk  127.0.0.1:5000/siteId/64076#51be07406ca004300d008cbc


Agrement for site 378016: 100.000%


Agrement for site 1041742: 99.091%
55ce152acb16b4f1728ab1e5        unk         cg  127.0.0.1:5000/siteId/1041742#55ce152acb16b4f1728ab1e5


Agrement for site 594127: 100.000%


Agrement for site 93134: 94.949%
51be10256ca004c52800cee1        pcg         cg  127.0.0.1:5000/siteId/93134#51be10256ca004c52800cee1
51be10256ca004c52800d166        pcg         cg  127.0.0.1:5000/siteId/93134#51be10256ca004c52800d166
51be10256ca004c52800d441         cg        unk  127.0.0.1:5000/siteId/93134#51be10256ca004c52800d441
51be10256ca004c52800d44f         cg        unk  127.0.0.1:5000/siteId/93134#51be10256ca004c52800d44f
51be10266ca004c52800d573        pcg         cg  127.0.0.1:5000/siteId/93134#51be10266ca004c52800d573


Agrement for site 848164: 100.000%


Agrement for site 1115733: 100.000%


Agrement for site 170125: 100.000%


#### Notes from discussion:

General guidance: No mindreading. Even if you suspect that the patient couldn't have written the post, unless the caregiver spells out clearly that they are the true author, tag posts as being authored by the person themselves.

CG/P vs "Unknown": Probably more than a sentence, and that sentence should communicate something generally informative (and not just say "Patient X asked me to pass these words on:").  

Nearly all 3rd-party text should be marked Unknown, unless the post is authored by a 3rd-party caregiver.  The "one sentence rule" above applies here; if there is substantive content in additional to the 3rd-party content (e.g. an informational quote, song lyrics, a bible verse), then tag it according to the author type of the substantive content.

P/CG: "Guest" text from patients or caregivers that is being shared in a single update needs to be written with the INTENTION of being shared as a health-related (or not necessarily health-related) update.  e.g. it's not enough to quote a person's words unless that person was passing on those words to be shared.


#### Raw notes:

Who chose the words?

No mindreading
See: http://127.0.0.1:5000/siteId/64076#51be07406ca004300d0089b4
and explicate this further

how much caregiver presence is enough to indicate authorship?  Probably more than a sentence, and it should indicate something more informative than just that they are passing on the patients words.

Nearly all 3rd-party text should be marked Unknown, unless the post is authored by a 3rd-party caregiver.
	

P/CG: "Guest" text from patients or caregivers that is being shared in an update needs to be written with the INTENTION of being shared as a health-related (or not necessarily health-related) update.

Does it matter if the patient is physically able to author posts?
I argue no.

Observation: Sites about very young kids seem different?
Birth and early childhood illness issues seem different from other things, because the "caregiver" is essentially the person who needs support!


Is "guest post" a type?


Characterize: how sites START, how sites END, and what's in between.
