Interaction Extraction
===

This file processes all interactions on CaringBridge into files.

It processes:
 - Journal updates
 - Comments (standalone and from guestbooks)
 - Guestbooks
 - Amps (from updates, comments, and guestbooks)

The fields in the resulting data-frame:

 - user_id
 - site_id
 - interaction_type (update, comment, guestbook, amp)
 - interaction_oid
 - parent_type (update, comment, guestbook)
 - parent_id (depending on parent, either a journal_oid, a comment_oid, or a gb_oid) 
 - created_at
 - updated_at

Other keys:
- ancestor_type (either an update or a guestbook) - do we need this for any reason? I think not

Question: do we need to do this for journal updates, or do we essentially already have that data?
I think we essentially have it. But, so we can straightforwardly stream the file, I'm generating it anyway?

The pseudocode process for using this data looks like:
 - Load interaction
 - Update new eligible authors based on time elapsed
   - Special case: if no time elapsed (because this interaction occurred at the literal same timestamp as the last), do nothing and re-use e.g. eligible author pools without recomputing them
   - Identify any user/site pairs with an "initial join" time (i.e. their 3rd published update on some site)
   - Do this by just slowly chewing through a stack sorted by user/site "initial join" time
   - Update site and author maps: add new authors to sites, and sites to authors.  Q: is this something that depends on FIRST update on a site, or THIRD update? In the CSCW project, we used existing + future links based on author type (being a patient). Here, seems like we just want to use current authorship i.e. THIRD update.
 - Update activity map
   - Add any intervening journal updates as activity
   - Remove any "timedout" users
 - Identify user/site sources and targets
 - Negative sample alt author(s) for each user/site source/target combination
   - Note: need a network of authors to sites they've interacted with (independent from the author-author network)
 - Generate activity features for all implicated
 - Generate network features for all implicated
 - Retrieve other cached features (such as from journal texts)
 - Store triples in database
 - Update network connections
   - Use site maps: add link to authors that are in the current site map. see Q above.
   - Point of confusion: u1 is active on s1 and u2 is active on s2 at t, u3 interacts with s2 at t+1, u1 is active on s2 at t+2. 
   What should happen? at t+1, u3->u2 exists. At t+2, u3->u1 exists.  
   How to identify that this edge should exist?  Well, could just look at in-bound edges of u2 if u1 becomes active on u2s site, then duplicate those edges to refer to u1.  But, the problem is that not all connections to u2 are because of s2.  
   Solution: store (u1,s1) pairs as nodes in the network. Problem: u1 has interacted with a site before any eligible person exists on that site, or u1 interacts before having any eligible sites themselves. Possibility: create "dummy" nodes, e.g. (u1,s1)->(*,s2) and (u1,*)->(u2,s2) respectively, that are special nodes when user info is not yet available. 
   Second thought: this is clearly absurd.  In particular, we KNOW when a user becomes active on a site (in the user_site_df).  
   So thought 1 is two separate networks: user->site, site->user
   Thought 2: just a user->site network. At time t, can just resolve each site to 0 or more users (if it resolves to 0 users, then we have updated the network with the new edge but can't generate any new sites. If it resolves to 2+ users, then we generate one new interaction for each users.)
   But this is clearly a disaster; to determine something like weakly-connected component, you need to resolve all the sites into nodes. In other words, this is just a bipartite graph by another face.
   Thought 3: user->user network.  But need a separate store of user->site links, just as a list or dict or whatever, that can be used when a user becomes active on a site; we retrieve all existing user->site links, then construct the corresponding user->user links in the graph.  (This idea seems really reasonable to me!)
   Basically, do two activities:
   1. When interaction u1->s2 happens, put {s2: u1} in dictionary, get current eligible users of s2 (get d[s2] where d is site_id->set(user_id), add edges (u1,uX for X in d[s2]).
   2. When u3 becomes active on s2, d[s2].add(u3), for each user uX in edge dict[s2], create edge (uX, u3).
   It's okay to have user nodes in the graph that aren't yet active! (TODO check this assumption)
   
 - Add to activity map with the interaction

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import os
import re
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
import sqlite3
from nltk import word_tokenize
from tqdm import tqdm
import random
import pickle
import json

from datetime import datetime
from dateutil.relativedelta import relativedelta
import pytz
from pprint import pprint

import matplotlib.pyplot as plt
import matplotlib.dates as md
import matplotlib
import pylab as pl
from IPython.core.display import display, HTML

In [None]:
from pathlib import Path
git_root_dir = !git rev-parse --show-toplevel
git_root_dir = Path(git_root_dir[0].strip())
git_root_dir

In [None]:
import sys
caringbridge_core_path = "/home/lana/levon003/repos/caringbridge_core"
sys.path.append(caringbridge_core_path)

In [None]:
import cbcore.data.paths as paths
import cbcore.data.dates as dates
import cbcore.data.utils as utils

In [None]:
raw_data_dir = paths.raw_data_2019_filepath
raw_data_dir

In [None]:
working_dir = "/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data"
assert os.path.exists(working_dir)
working_dir

In [None]:
# load the list of valid user/site pairs
s = datetime.now()
model_data_dir = '/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data'
user_site_df = pd.read_csv(os.path.join(model_data_dir, 'user_site_df.csv'))
valid_user_ids = set(user_site_df.user_id)
valid_site_ids = set(user_site_df.site_id)
print(f"Read {len(user_site_df)} rows ({len(valid_user_ids)} unique users, {len(valid_site_ids)} unique sites) in {datetime.now() - s}.")
user_site_df.head()

In [None]:
guestbook_filepath = os.path.join(raw_data_dir, 'guestbook_scrubbed.json')
output_filepath = os.path.join(working_dir, "guestbook_filtered.csv")
both_valid_count = 0
neither_valid_count = 0
author_valid_count = 0
site_valid_count = 0
with open(output_filepath, 'w') as outfile:
    with open(guestbook_filepath, encoding='utf-8') as infile:
        processed_count = 0
        for i, line in tqdm(enumerate(infile), total=82858710):
            if i < 4002:
                continue
            try:
                gb = json.loads(line)
            except:
                continue
            gb_oid = gb['_id']['$oid']
            site_id = utils.extract_long(gb['siteId'])
            user_id = utils.extract_long(gb['userId'])
            created_at = dates.get_date_from_json_value(gb['createdAt']) if 'createdAt' in gb else 0
            updated_at = dates.get_date_from_json_value(gb['updatedAt']) if 'updatedAt' in gb else 0
            
            if 'amps' in gb and type(gb['amps']) == list:
                # we write out any amps as separate lines
                for amp in gb['amps']:
                    amp_user_id = utils.extract_long(amp)
                    is_user_valid = amp_user_id in valid_user_ids
                    is_site_valid = site_id in valid_site_ids
                    if is_user_valid and is_site_valid:
                        outfile.write(f"{amp_user_id},{site_id},amp,{gb_oid}|{amp_user_id},guestbook,{gb_oid},guestbook,{gb_oid},{created_at},{updated_at}\n")
                        both_valid_count += 1
                    elif is_user_valid and not is_site_valid:
                        author_valid_count += 1
                    elif not is_user_valid and is_site_valid:
                        site_valid_count += 1
                    else:
                        neither_valid_count += 1
            is_user_valid = user_id in valid_user_ids
            is_site_valid = site_id in valid_site_ids
            if is_user_valid and is_site_valid:
                # columns: user_id, site_id, interaction_type, interaction_oid, parent_type, parent_id, ancestor_type, ancestor_id, created_at, updated_at
                outfile.write(f"{user_id},{site_id},guestbook,{gb_oid},None,None,None,None,{created_at},{updated_at}\n")
                both_valid_count += 1
            elif is_user_valid and not is_site_valid:
                author_valid_count += 1
            elif not is_user_valid and is_site_valid:
                site_valid_count += 1
            else:
                neither_valid_count += 1
            processed_count += 1
processed_count, both_valid_count, neither_valid_count, author_valid_count, site_valid_count

In [None]:
comments_filepath = os.path.join(raw_data_dir, 'comment_scrubbed.json')
output_filepath = os.path.join(working_dir, "comment_filtered.csv")
both_valid_count = 0
neither_valid_count = 0
author_valid_count = 0
site_valid_count = 0
with open(output_filepath, 'w') as outfile:
    with open(comments_filepath, encoding='utf-8') as infile:
        for line in tqdm(infile, total=31052715):
            comment = json.loads(line)
            comment_oid = comment['_id']['$oid']
            parent_type = comment['parentType']  # either 'journal' or 'comment'
            parent_oid = comment['parentId']
            journal_oid = comment['ancestorId']  # ancestorType is never guestbook; we seemingly don't have any of the guestbook comment data
            site_id = utils.extract_long(comment['siteId'])
            is_site_valid = site_id in valid_site_ids
            user_id = utils.extract_long(comment['userId'])
            created_at = dates.get_date_from_json_value(comment['createdAt'])
            updated_at = dates.get_date_from_json_value(comment['updatedAt'])
            
            if 'amps' in comment and type(comment['amps']) == list:
                # we write out any amps as separate lines
                for amp in comment['amps']:
                    amp_user_id = utils.extract_long(amp)
                    is_user_valid = amp_user_id in valid_user_ids
                    if is_user_valid and is_site_valid:
                        outfile.write(f"{amp_user_id},{site_id},amp,{comment_oid}|{amp_user_id},comment,{comment_oid},journal,{journal_oid},{created_at},{updated_at}\n")
                        both_valid_count += 1
                    elif is_user_valid and not is_site_valid:
                        author_valid_count += 1
                    elif not is_user_valid and is_site_valid:
                        site_valid_count += 1
                    else:
                        neither_valid_count += 1
            
            is_user_valid = user_id in valid_user_ids
            if is_user_valid and is_site_valid:
                # columns: user_id, site_id, interaction_type, interaction_oid, parent_type, parent_id, ancestor_type, ancestor_id, created_at, updated_at
                outfile.write(f"{user_id},{site_id},comment,{comment_oid},{parent_type},{parent_oid},journal,{journal_oid},{created_at},{updated_at}\n")
                both_valid_count += 1
            elif is_user_valid and not is_site_valid:
                author_valid_count += 1
            elif not is_user_valid and is_site_valid:
                site_valid_count += 1
            else:
                neither_valid_count += 1
both_valid_count, neither_valid_count, author_valid_count, site_valid_count

In [None]:
journal_filepath = os.path.join(raw_data_dir, 'journal.json')
output_filepath = os.path.join(working_dir, "journal_filtered.csv")
both_valid_count = 0
neither_valid_count = 0
author_valid_count = 0
site_valid_count = 0
with open(output_filepath, 'w') as outfile:
    with open(journal_filepath, encoding='utf-8') as infile:
        for line in tqdm(infile, total=19137078):
            journal = json.loads(line)
            
            if 'amps' not in journal:
                continue
            amps = journal['amps']
            if type(amps) != list:
                continue
                
            journal_oid = journal['_id']['$oid']
            site_id = utils.extract_long(journal['siteId'])
            is_site_valid = site_id in valid_site_ids
            user_id = utils.extract_long(journal['userId'])
            
            created_at = dates.get_date_from_json_value(journal['createdAt'])
            updated_at = dates.get_date_from_json_value(journal['updatedAt'])
            
            for amp in amps:
                amp_user_id = utils.extract_long(amp)
                is_user_valid = amp_user_id in valid_user_ids
                if is_user_valid and is_site_valid:
                    outfile.write(f"{amp_user_id},{site_id},amp,{journal_oid}|{amp_user_id},journal,{journal_oid},journal,{journal_oid},{created_at},{updated_at}\n")
                    both_valid_count += 1
                elif is_user_valid and not is_site_valid:
                    author_valid_count += 1
                elif not is_user_valid and is_site_valid:
                    site_valid_count += 1
                else:
                    neither_valid_count += 1
            
            #is_user_valid = user_id in valid_user_ids
            #if is_user_valid and is_site_valid:
            #    # columns: user_id, site_id, interaction_type, interaction_oid, parent_type, parent_id, ancestor_type, ancestor_id, created_at, updated_at
            #    outfile.write(f"{user_id},{site_id},journal,{journal_oid},None,None,None,None,{created_at},{updated_at}\n")
            #    both_valid_count += 1
            #elif is_user_valid and not is_site_valid:
            #    author_valid_count += 1
            #elif not is_user_valid and is_site_valid:
            #    site_valid_count += 1
            #else:
            #    neither_valid_count += 1
both_valid_count, neither_valid_count, author_valid_count, site_valid_count

In [None]:
cols = ['user_id', 'site_id', 'interaction_type', 'interaction_oid', 'parent_type', 'parent_oid', 'ancestor_type', 'ancestor_oid', 'created_at', 'updated_at']
s = datetime.now()
gb_filepath = os.path.join(working_dir, "guestbook_filtered.csv")
gb_df = pd.read_csv(gb_filepath, header=None, names=cols)
print(datetime.now() - s)

s = datetime.now()
comment_filepath = os.path.join(working_dir, "comment_filtered.csv")
comment_df = pd.read_csv(comment_filepath, header=None, names=cols)
print(datetime.now() - s)

s = datetime.now()
journal_filepath = os.path.join(working_dir, "journal_filtered.csv")
journal_df = pd.read_csv(journal_filepath, header=None, names=cols)
print(datetime.now() - s)

len(gb_df), len(comment_df), len(journal_df)

In [None]:
ints_df = pd.concat([gb_df, comment_df, journal_df], sort=False)
ints_df.reset_index(drop=True, inplace=True)
print(len(ints_df))
ints_df.head()

In [None]:
s = datetime.now()
ints_df = ints_df.sort_values(by='created_at')
print(datetime.now() - s)

In [None]:
len(ints_df)

In [None]:
s = datetime.now()
ints_df.reset_index(drop=True).to_feather(os.path.join(working_dir, 'ints_df.feather'))
print(datetime.now() - s)
s = datetime.now()
ints_df.to_csv(os.path.join(working_dir, 'ints_df.csv'), index=False)
print(datetime.now() - s)

In [None]:
# read interactions dataframe
s = datetime.now()
model_data_dir = '/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data'
ints_df = pd.read_feather(os.path.join(model_data_dir, 'ints_df.feather'))
print(f"Read {len(ints_df)} rows ({len(set(ints_df.user_id))} unique users) in {datetime.now() - s}.")
ints_df.head()

In [None]:
ints_df[['interaction_type', 'parent_type', 'ancestor_type']].value_counts()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

bins = []
year = 2005
month = 0
while year != 2020:
    if month == 12:
        year += 1
        month = 1
    else:
        month += 1
    bins.append(datetime.fromisoformat(f"{year}-{month:02}-01").replace(tzinfo=pytz.UTC).timestamp())

total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type == 'amp'].created_at / 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Amps')
total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type == 'guestbook'].created_at / 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks')
total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type == 'comment'].created_at / 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Comments')
ax.set_yscale('log')

plt.legend()
plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp(), color='black', alpha=0.8, linestyle='--', linewidth=1)
plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp(), color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Interactions per month")
plt.title(f"{len(ints_df):,} interactions from {len(set(ints_df.user_id)):,} unique users on {len(set(ints_df.site_id)):,} unique sites")

newline = '\n'
xticks = [datetime.fromisoformat(f"{2005 + i}-01-01").replace(tzinfo=pytz.UTC).timestamp() for i in range((2020 - 2005) + 2)]
plt.xticks(
    xticks, 
    [f"{datetime.utcfromtimestamp(be).strftime('%Y')}" for i, be in enumerate(xticks)])
          
plt.show()

In [None]:
start_timestamp = datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000
end_timestamp = datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000
sdf = ints_df[(ints_df.created_at >= start_timestamp)&(ints_df.created_at <= end_timestamp)]
len(sdf)

In [None]:
sdf[['interaction_type', 'parent_type', 'ancestor_type']].value_counts()

In [None]:
# load the journal dataframe with the index
# this is all the new journal data
s = datetime.now()
journal_metadata_dir = "/home/lana/shared/caringbridge/data/derived/journal_metadata"
journal_metadata_filepath = os.path.join(journal_metadata_dir, "journal_metadata.df")
journal_df = pd.read_feather(journal_metadata_filepath)
print(datetime.now() - s)
len(journal_df)

In [None]:
journal_df.sample(n=10)

In [None]:
datetime.utcfromtimestamp(journal_df.created_at.quantile(0.0001) / 1000).isoformat(),\
datetime.utcfromtimestamp(journal_df.created_at.quantile(0.999999) / 1000).isoformat()

In [None]:
# journal updates over time, by month

start_date = "2002-04-01"
end_date = "2019-03-01"
sdate = datetime.fromisoformat(start_date)
edate = datetime.fromisoformat(end_date)
delta = edate - sdate
delta = relativedelta(edate, sdate)
bins = []
for i in range((delta.years*12) + delta.months + 1):
    day = sdate + relativedelta(months=i) #timedelta(months=i)
    bins.append(day.timestamp())

fig, ax = plt.subplots(1, 1, figsize=(10, 4))

total_counts, bin_edges = np.histogram(journal_df.created_at / 1000, bins=bins)
ax.axhline(0, color='gray', alpha=0.4)
ax.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2)

# 5 year analysis period of relative normality, 2014-2019
ax.axvline(datetime.fromisoformat("2014-01-01").timestamp(), color='gray', linestyle='--', alpha=0.4)
ax.axvline(datetime.fromisoformat("2019-01-01").timestamp(), color='gray', linestyle='--', alpha=0.4)


use_autoloc = True
locs = bins
if use_autoloc:
    locs = ax.get_xticks()
labels = []
for xtick in locs:
    label = f"{datetime.utcfromtimestamp(xtick).strftime('%b%y')}"
    labels.append(label)
ax.set_xticks(locs)
ax.set_xticklabels(labels)

ax.set_yscale('log')
    
plt.show()

## Visualizing createdAt of guestbooks

`new_guestbook_createdAt.txt` created via `cut -f4 -d, new_guestbook_metadata_raw.csv > new_guestbook_createdAt.txt`

In [None]:
ca_arr = np.zeros(82854708)
with open(os.path.join(working_dir, "new_guestbook_createdAt.txt"), 'r') as infile:
    error_count = 0
    for i, line in tqdm(enumerate(infile), total=82854708):
        try:
            ca_arr[i] = int(line.strip())
        except:
            error_count += 1
            continue
error_count

In [None]:
ca_arr = ca_arr / 1000
ca_arr[:10]

In [None]:
np.min(ca_arr)

In [None]:
print(ca_arr.shape)
ca_arr = ca_arr[ca_arr > 0]
print(ca_arr.shape)

In [None]:
ca_arr_old = np.zeros(82980359)
with open(os.path.join(working_dir, "old_guestbook_createdAt.txt"), 'r') as infile:
    error_count = 0
    for i, line in tqdm(enumerate(infile), total=82854708):
        try:
            ca_arr_old[i] = int(line.strip())
        except:
            error_count += 1
            continue
error_count

In [None]:
ca_arr_old = ca_arr_old / 1000

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

bins = []
year = 2005
month = 0
while year != 2020:
    if month == 12:
        year += 1
        month = 1
    else:
        month += 1
    bins.append(datetime.fromisoformat(f"{year}-{month:02}-01").timestamp())

total_counts, bin_edges = np.histogram(ca_arr, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks (2019 data)')

total_counts, bin_edges = np.histogram(ca_arr_old, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks (2016 data)')

plt.axvline(datetime.fromisoformat(f"2016-06-01").timestamp(), color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Guestbook count")

newline = '\n'
xticks = [datetime.fromisoformat(f"{2005 + i}-01-01").timestamp() for i in range((2020 - 2005) + 2)]
plt.xticks(
    xticks, 
    [f"{datetime.utcfromtimestamp(be).strftime('%Y')}" for i, be in enumerate(xticks)])
     
#plt.tight_layout(pad=0)
#plt.margins(0,0)
#plt.savefig(os.path.join(figures_dir, 'initiation_types_timeline.pdf'), dpi=200, pad_inches=0)
     
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

bins = []
year = 2005
month = 0
while year != 2020:
    if month == 12:
        year += 1
        month = 1
    else:
        month += 1
    bins.append(datetime.fromisoformat(f"{year}-{month:02}-01").timestamp())

total_counts, bin_edges = np.histogram(ca_arr, bins=bins)
total_counts_old, bin_edges = np.histogram(ca_arr_old, bins=bins)
plt.plot(bin_edges[:-1], total_counts - total_counts_old, linestyle='-', linewidth=2, label='Guestbooks (2019 - 2016 data)')

plt.axvline(datetime.fromisoformat(f"2016-06-01").timestamp(), color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Guestbook count")

newline = '\n'
xticks = [datetime.fromisoformat(f"{2005 + i}-01-01").timestamp() for i in range((2020 - 2005) + 2)]
plt.xticks(
    xticks, 
    [f"{datetime.utcfromtimestamp(be).strftime('%Y')}" for i, be in enumerate(xticks)])
     
#plt.tight_layout(pad=0)
#plt.margins(0,0)
#plt.savefig(os.path.join(figures_dir, 'initiation_types_timeline.pdf'), dpi=200, pad_inches=0)
     
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

bins = []
year = 2016
month = 0
while year != 2020:
    if month == 12:
        year += 1
        month = 1
    else:
        month += 1
    bins.append(datetime.fromisoformat(f"{year}-{month:02}-01").timestamp())

total_counts, bin_edges = np.histogram(ca_arr, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks (2019 data)')

total_counts, bin_edges = np.histogram(ca_arr_old, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks (2016 data)')

plt.axvline(datetime.fromisoformat(f"2016-06-01").timestamp(), color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Guestbook count")

newline = '\n'
xticks = [datetime.fromisoformat(f"{2016 + i}-01-01").timestamp() for i in range((2020 - 2016) + 2)]
plt.xticks(
    xticks, 
    [f"{datetime.utcfromtimestamp(be).strftime('%Y')}" for i, be in enumerate(xticks)])
     
#plt.tight_layout(pad=0)
#plt.margins(0,0)
#plt.savefig(os.path.join(figures_dir, 'initiation_types_timeline.pdf'), dpi=200, pad_inches=0)
     
plt.show()

In [None]:
# TODO look for match on guestbook_oid, site_id, and created_at