Interaction Extraction
===

This file processes all interactions on CaringBridge into files.

It processes:
 - Journal updates
 - Comments (standalone and from guestbooks)
 - Guestbooks
 - Amps (from updates, comments, and guestbooks)

The fields in the resulting data-frame:

 - user_id
 - site_id
 - interaction_type (update, comment, guestbook, amp)
 - interaction_oid
 - parent_type (update, comment, guestbook)
 - parent_id (depending on parent, either a journal_oid, a comment_oid, or a gb_oid) 
 - created_at
 - updated_at

Other keys:
- ancestor_type (either an update or a guestbook) - do we need this for any reason? I think not

Question: do we need to do this for journal updates, or do we essentially already have that data?
I think we essentially have it. But, so we can straightforwardly stream the file, I'm generating it anyway?

The pseudocode process for using this data looks like:
 - Load interaction
 - Update new eligible authors based on time elapsed
   - Special case: if no time elapsed (because this interaction occurred at the literal same timestamp as the last), do nothing and re-use e.g. eligible author pools without recomputing them
   - Identify any user/site pairs with an "initial join" time (i.e. their 3rd published update on some site)
   - Do this by just slowly chewing through a stack sorted by user/site "initial join" time
   - Update site and author maps: add new authors to sites, and sites to authors.  Q: is this something that depends on FIRST update on a site, or THIRD update? In the CSCW project, we used existing + future links based on author type (being a patient). Here, seems like we just want to use current authorship i.e. THIRD update.
 - Update activity map
   - Add any intervening journal updates as activity
   - Remove any "timedout" users
 - Identify user/site sources and targets
 - Negative sample alt author(s) for each user/site source/target combination
   - Note: need a network of authors to sites they've interacted with (independent from the author-author network)
 - Generate activity features for all implicated
 - Generate network features for all implicated
 - Retrieve other cached features (such as from journal texts)
 - Store triples in database
 - Update network connections
   - Use site maps: add link to authors that are in the current site map. see Q above.
   - Point of confusion: u1 is active on s1 and u2 is active on s2 at t, u3 interacts with s2 at t+1, u1 is active on s2 at t+2. 
   What should happen? at t+1, u3->u2 exists. At t+2, u3->u1 exists.  
   How to identify that this edge should exist?  Well, could just look at in-bound edges of u2 if u1 becomes active on u2s site, then duplicate those edges to refer to u1.  But, the problem is that not all connections to u2 are because of s2.  
   Solution: store (u1,s1) pairs as nodes in the network. Problem: u1 has interacted with a site before any eligible person exists on that site, or u1 interacts before having any eligible sites themselves. Possibility: create "dummy" nodes, e.g. (u1,s1)->(*,s2) and (u1,*)->(u2,s2) respectively, that are special nodes when user info is not yet available. 
   Second thought: this is clearly absurd.  In particular, we KNOW when a user becomes active on a site (in the user_site_df).  
   So thought 1 is two separate networks: user->site, site->user
   Thought 2: just a user->site network. At time t, can just resolve each site to 0 or more users (if it resolves to 0 users, then we have updated the network with the new edge but can't generate any new sites. If it resolves to 2+ users, then we generate one new interaction for each users.)
   But this is clearly a disaster; to determine something like weakly-connected component, you need to resolve all the sites into nodes. In other words, this is just a bipartite graph by another face.
   Thought 3: user->user network.  But need a separate store of user->site links, just as a list or dict or whatever, that can be used when a user becomes active on a site; we retrieve all existing user->site links, then construct the corresponding user->user links in the graph.  (This idea seems really reasonable to me!)
   Basically, do two activities:
   1. When interaction u1->s2 happens, put {s2: u1} in dictionary, get current eligible users of s2 (get d[s2] where d is site_id->set(user_id), add edges (u1,uX for X in d[s2]).
   2. When u3 becomes active on s2, d[s2].add(u3), for each user uX in edge dict[s2], create edge (uX, u3).
   It's okay to have user nodes in the graph that aren't yet active! (TODO check this assumption)
   
 - Add to activity map with the interaction

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import os
import re
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
import sqlite3
from nltk import word_tokenize
from tqdm import tqdm
import random
import pickle
import json

from datetime import datetime
from dateutil.relativedelta import relativedelta
import pytz
from pprint import pprint

import matplotlib.pyplot as plt
import matplotlib.dates as md
import matplotlib
matplotlib.rcParams['figure.dpi'] = 120
matplotlib.rcParams['font.family'] = "serif"

import pylab as pl
from IPython.core.display import display, HTML

In [None]:
from pathlib import Path
git_root_dir = !git rev-parse --show-toplevel
git_root_dir = Path(git_root_dir[0].strip())
git_root_dir

In [None]:
import sys
caringbridge_core_path = "/home/lana/levon003/repos/caringbridge_core"
sys.path.append(caringbridge_core_path)

In [None]:
import cbcore.data.paths as paths
import cbcore.data.dates as dates
import cbcore.data.utils as utils

In [None]:
raw_data_dir = paths.raw_data_filepath
raw_data_dir

In [None]:
interactions_dir = os.path.join(paths.derived_data_filepath, 'interactions')
interactions_dir

In [None]:
!du -h {interactions_dir}/*

In [None]:
!wc -l /home/lana/shared/caringbridge/data/derived/interactions/reaction.csv

In [None]:
working_dir = "/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data"
assert os.path.exists(working_dir)
working_dir

In [None]:
# load the list of valid user/site pairs
s = datetime.now()
model_data_dir = '/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data'
user_site_df = pd.read_csv(os.path.join(model_data_dir, 'user_site_df.csv'))
valid_user_ids = set(user_site_df.user_id)
valid_site_ids = set(user_site_df.site_id)
print(f"Read {len(user_site_df)} rows ({len(valid_user_ids)} unique users, {len(valid_site_ids)} unique sites) in {datetime.now() - s}.")
user_site_df.head()

In [None]:
output_filepath = os.path.join(working_dir, 'ints_filtered.csv')
with open(output_filepath, 'w') as outfile:
    for filename in ['reaction.csv', 'amps.csv', 'comment.csv', 'guestbook.csv']:
        input_filepath = os.path.join(interactions_dir, filename)
        both_valid_count = 0
        neither_valid_count = 0
        author_valid_count = 0
        site_valid_count = 0
        with open(input_filepath, 'r') as infile:
            for line in tqdm(infile, desc=filename):
                # columns: user_id, site_id, interaction_type, interaction_oid, parent_type, parent_id, ancestor_type, ancestor_id, created_at, updated_at
                tokens = line.strip().split(",")
                user_id = int(tokens[0])
                site_id = int(tokens[1])
                
                is_user_valid = user_id in valid_user_ids
                is_site_valid = site_id in valid_site_ids
                if is_user_valid and is_site_valid:
                    outfile.write(line)
                    both_valid_count += 1
                elif is_user_valid and not is_site_valid:
                    author_valid_count += 1
                elif not is_user_valid and is_site_valid:
                    site_valid_count += 1
                else:
                    neither_valid_count += 1
        print(f"{filename:>15} ; Both valid = {both_valid_count:>8} ; Author valid = {author_valid_count:>8} ; Site valid = {site_valid_count:>8} ; Neither valid = {neither_valid_count:>8}")

In [None]:
!du -h {output_filepath}

In [None]:
cols = ['user_id', 'site_id', 'interaction_type', 'interaction_oid', 'parent_type', 'parent_oid', 'ancestor_type', 'ancestor_oid', 'created_at', 'updated_at']
s = datetime.now()
ints_filepath = output_filepath
ints_df = pd.read_csv(ints_filepath, header=None, names=cols)
print(datetime.now() - s)
len(ints_df)

In [None]:
ints_df.head()

In [None]:
s = datetime.now()
ints_df = ints_df.sort_values(by='created_at')
print(datetime.now() - s)

In [None]:
s = datetime.now()
ints_df.reset_index(drop=True).to_feather(os.path.join(working_dir, 'ints_df.feather'))
print(datetime.now() - s)
#s = datetime.now()
#ints_df.to_csv(os.path.join(working_dir, 'ints_df.csv'), index=False)
#print(datetime.now() - s)

In [None]:
# read interactions dataframe
s = datetime.now()
model_data_dir = '/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data'
ints_df = pd.read_feather(os.path.join(model_data_dir, 'ints_df.feather'))
print(f"Read {len(ints_df)} rows ({len(set(ints_df.user_id))} unique users) in {datetime.now() - s}.")
ints_df.head()

In [None]:
pd.DataFrame(ints_df.interaction_type.value_counts().rename('interaction_type_total'))

In [None]:
pd.DataFrame(ints_df.parent_type.value_counts(dropna=False).rename('parent_type_total'))

In [None]:
#pd.DataFrame(ints_df[['interaction_type', 'parent_type', 'ancestor_type']].value_counts().rename('count'))
#ints_df[['interaction_type', 'parent_type', 'ancestor_type']].value_counts(dropna=False)
# this doesn't work due to NoneType objects...
pd.crosstab(ints_df.interaction_type, [ints_df.parent_type, ints_df.ancestor_type], dropna=False)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.strptime('2005-01-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.utcfromtimestamp(np.max(ints_df.created_at) / 1000).replace(tzinfo=pytz.UTC) #datetime.strptime('2021-07-15', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type.str.startswith('amp')].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Amps')
total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type == 'guestbook'].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks')
total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type == 'comment'].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Comments')
ax.set_yscale('log')

plt.legend()
plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)
#plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Interactions per month")
plt.title(f"{len(ints_df):,} interactions from {len(set(ints_df.user_id)):,} unique users on {len(set(ints_df.site_id)):,} unique sites")

x_dates = [start_time + relativedelta(years=i) for i in range(18)]
ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
ax.set_xticklabels([f"Jan\n" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])

#newline = '\n'
#xticks = [datetime.fromisoformat(f"{2005 + i}-01-01").replace(tzinfo=pytz.UTC).timestamp() for i in range((2020 - 2005) + 2)]
#plt.xticks(
#    xticks, 
#    [f"{datetime.utcfromtimestamp(be).strftime('%Y')}" for i, be in enumerate(xticks)])
          
plt.show()

In [None]:
start_timestamp = datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000
#end_timestamp = datetime.fromisoformat(f"2021-09-01").replace(tzinfo=pytz.UTC).timestamp() * 1000
#sdf = ints_df[(ints_df.created_at >= start_timestamp)&(ints_df.created_at <= end_timestamp)]
sdf = ints_df[ints_df.created_at >= start_timestamp]
len(sdf), len(ints_df)

In [None]:
sdf[['interaction_type', 'parent_type', 'ancestor_type']].value_counts()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.strptime('2021-05-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
sdf = ints_df[ints_df.created_at >= start_time.timestamp() * 1000]

curr_time = start_time
end_time = datetime.strptime('2021-12-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(days=1)
bins.append(int(curr_time.timestamp() * 1000))
print(f'{len(bins)} bins from {start_time} to {end_time}')

total_counts, bin_edges = np.histogram(sdf[sdf.interaction_type.str.startswith('amp')].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Amps')
total_counts, bin_edges = np.histogram(sdf[sdf.interaction_type == 'guestbook'].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks')
total_counts, bin_edges = np.histogram(sdf[sdf.interaction_type == 'comment'].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Comments')
ax.set_yscale('log')

plt.legend()
#plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)
#plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Interactions per day")
plt.title(f"{len(sdf):,} interactions from {len(set(sdf.user_id)):,} unique users on {len(set(sdf.site_id)):,} unique sites")

curr_time = start_time
tick_bins = []
while curr_time < end_time:
    tick_bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
tick_bins.append(int(curr_time.timestamp() * 1000))

ax.set_xticks(tick_bins)
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, y: f"{datetime.utcfromtimestamp(x / 1000).strftime('%m-%d')}"))

#ax.set_xticklabels([f"Jan\n" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])

#newline = '\n'
#xticks = [datetime.fromisoformat(f"{2005 + i}-01-01").replace(tzinfo=pytz.UTC).timestamp() for i in range((2020 - 2005) + 2)]
#plt.xticks(
#    xticks, 
#    [f"{datetime.utcfromtimestamp(be).strftime('%Y')}" for i, be in enumerate(xticks)])
          
plt.show()

In [None]:
# load the journal dataframe with the index
# this is all the new journal data
s = datetime.now()
journal_metadata_dir = "/home/lana/shared/caringbridge/data/derived/journal_metadata"
journal_metadata_filepath = os.path.join(journal_metadata_dir, "journal_metadata.feather")
journal_df = pd.read_feather(journal_metadata_filepath)
print(datetime.now() - s)
len(journal_df)

In [None]:
journal_df.sample(n=10)

In [None]:
datetime.utcfromtimestamp(journal_df.created_at.quantile(0.0001) / 1000).isoformat(),\
datetime.utcfromtimestamp(journal_df.created_at.quantile(0.999999) / 1000).isoformat()

In [None]:
# journal updates over time, by month

start_time = datetime.strptime('2005-01-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.strptime('2021-09-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

total_counts, bin_edges = np.histogram(journal_df.created_at, bins=bins)
ax.axhline(0, color='gray', alpha=0.4)
ax.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2)

# start of analysis period
ax.axvline(datetime.fromisoformat("2014-01-01").timestamp() * 1000, color='gray', linestyle='--', alpha=0.4)

use_autoloc = True
locs = bins
if use_autoloc:
    locs = ax.get_xticks()
labels = []
for xtick in locs:
    label = f"{datetime.utcfromtimestamp(xtick / 1000).strftime('%b %Y')}"
    labels.append(label)
ax.set_xticks(locs)
ax.set_xticklabels(labels)

ax.set_yscale('log')
    
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.strptime('2005-01-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.utcfromtimestamp(np.max(ints_df.created_at) / 1000).replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type.str.startswith('amp')].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Amps')
total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type == 'guestbook'].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Guestbooks')
total_counts, bin_edges = np.histogram(ints_df[ints_df.interaction_type == 'comment'].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Comments')

total_counts, bin_edges = np.histogram(journal_df.created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Journals')

ax.set_yscale('log')

plt.legend()
plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)
#plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Interactions per month")
plt.title(f"{len(ints_df):,} interactions from {len(set(ints_df.user_id)):,} unique users on {len(set(ints_df.site_id)):,} unique sites\n{len(journal_df):,} journals from {len(set(journal_df.user_id)):,} unique users on {len(set(journal_df.site_id)):,} unique sites")

x_dates = [start_time + relativedelta(years=i) for i in range(18)]
ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
ax.set_xticklabels([f"Jan\n" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])
          
plt.show()

In [None]:
# For journals, createdAt vs publishedAt
# this is the percentage of the time creation date == publication date
np.sum(journal_df.published_at == journal_df.created_at) / len(journal_df)

## Investigating timestamp issues

In [None]:
ints_df.head()

In [None]:
journal_df.head()

In [None]:
cdf = ints_df[(ints_df.interaction_type == 'comment')&(ints_df.parent_type == 'journal')]
len(cdf)

In [None]:
for q in [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]:
    quantile = np.quantile(cdf.created_at, q)
    dt = datetime.utcfromtimestamp(int(quantile) / 1000).isoformat()
    print(f"{q:.3f} {quantile:.3f} {dt}")
    

In [None]:
sdf = journal_df[journal_df.published_at.notna()]
for q in [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]:
    quantile = np.quantile(sdf.published_at, q)
    dt = datetime.utcfromtimestamp(int(quantile) / 1000).isoformat()
    print(f"{q:.3f} {quantile:.3f} {dt}")
    

In [None]:
journal_df.head()

In [None]:
cdf.head()

In [None]:
sdf = journal_df[journal_df.created_at.notna()]
published_at_index = sdf.set_index('journal_oid').created_at
len(published_at_index)

In [None]:
cdf = ints_df[(ints_df.interaction_type == 'comment')&(ints_df.parent_type == 'journal')]
print(len(cdf))
cdf = cdf[cdf.parent_oid.isin(published_at_index.index)]
len(cdf)

In [None]:
cdf['journal_published_at'] = cdf.parent_oid.map(lambda journal_oid: published_at_index.at[journal_oid])

In [None]:
cdf['time_to_comment'] = cdf.created_at - cdf.journal_published_at
np.min(cdf.time_to_comment)

In [None]:
for q in [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]:
    quantile = np.quantile(cdf.time_to_comment, q)
    dt = quantile / 1000 / 60 / 60
    print(f"{q:.3f} {quantile:.3f} {dt:.3f}hrs")

In [None]:
def get_seconds_since_midnight(ts):
    dt = datetime.fromtimestamp(int(ts) / 1000).astimezone(pytz.UTC)
    dts = dt.strftime('%H%M%S')
    hour, minute, second = int(dts[0:2]), int(dts[2:4]), int(dts[4:6])
    seconds_since_midnight = second + minute * 60 + 60 * 60 * hour
    return seconds_since_midnight

get_seconds_since_midnight(1637262305000)

In [None]:
seconds_since_midnight = cdf.created_at.sample(n=100000).map(get_seconds_since_midnight)

In [None]:
hours = np.arange(0, 24 * 60 * 60, 60 * 60)

fig, ax = plt.subplots(1, 1, figsize=(6, 6))

for int_type in ['reaction', 'comment', 'guestbook', 'journal']:
    if int_type == 'guestbook':
        created_at = ints_df[ints_df.interaction_type == int_type].created_at.sample(n=1000000)
    elif int_type == 'journal':
        created_at = journal_df[journal_df.published_at.notna()].published_at.sample(n=1000000)
    elif int_type == 'reaction':
        created_at = ints_df[(ints_df.interaction_type.str.startswith('amp_'))&(ints_df.parent_type == 'journal')].created_at
    else:
        created_at = ints_df[(ints_df.interaction_type == int_type)&(ints_df.parent_type == 'journal')].created_at.sample(n=1000000)
    seconds_since_midnight = created_at.map(get_seconds_since_midnight)
    counts, _ = np.histogram(seconds_since_midnight, hours)
    pcts = counts / np.sum(counts)
    ax.plot(hours[1:], pcts, label=int_type)

ax.legend()
ax.set_xticks(hours)
ax.set_xticklabels([f"{hour / 60 / 60:.0f}" for hour in hours])

ax.set_xlabel("Hour of day (UTC)")
ax.set_ylabel("% of ints during this hour")

plt.show()

In [None]:
import bson.objectid

In [None]:
def get_generated_at_from_uuid(uuid):
    generated_at = int(bson.objectid.ObjectId(uuid).generation_time.timestamp())
    return generated_at

In [None]:
sdf = journal_df.sample(n=100000)
generated_at_list = []
for row in sdf.itertuples():
    generated_at = get_generated_at_from_uuid(row.journal_oid)
    generated_at_list.append(generated_at)
generated_at = pd.Series(data=generated_at_list, name='generated_at')
generated_at.shape

In [None]:
generated_at

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.fromtimestamp(generated_at.min()).replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.fromtimestamp(generated_at.max()).replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

total_counts, bin_edges = np.histogram(generated_at * 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Generated At')

total_counts, bin_edges = np.histogram(sdf.created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Created At')

total_counts, bin_edges = np.histogram(sdf.published_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Published At')

ax.set_yscale('log')

plt.legend()
#plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)
#plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Journals per year")

#x_dates = [start_time + relativedelta(years=i) for i in range(25)]
#ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
#ax.set_xticklabels([f"Jan\n" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])
          
plt.show()

In [None]:
generated_at.map(lambda ts: datetime.fromtimestamp(int(ts / 60 / 60) * 60 * 60).isoformat()).value_counts().head(20)

In [None]:
diffs = (np.array(generated_at) * 1000) - np.array(sdf.created_at)
plt.hist(diffs / 1000 / 60 / 60 / 24, bins=np.arange(-20, 20), log=True)
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.fromtimestamp(generated_at.min()).replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.fromtimestamp(generated_at.max()).replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

total_counts, bin_edges = np.histogram(generated_at * 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Generated At')

total_counts, bin_edges = np.histogram(sdf.created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Created At')

total_counts, bin_edges = np.histogram(sdf.published_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Published At')

ax.set_yscale('log')

plt.legend()
#plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)
#plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Journals per year")

#x_dates = [start_time + relativedelta(years=i) for i in range(25)]
#ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
#ax.set_xticklabels([f"Jan\n" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])
          
plt.show()

### Comment dates analysis

In [None]:
import gzip
import json
import time
input_filepath = os.path.join(paths.raw_data_dir, 'comment_scrubbed.json.gz')
with gzip.open(input_filepath, 'rt', encoding='utf-8') as infile, open('comment_dates.csv', 'w') as outfile:
    for i, line in tqdm(enumerate(infile), total=48494407):
        comment = json.loads(line)
        comment_oid = comment['_id']['$oid']
        created_at = comment['createdAt']
        time_str = created_at['$date']
        
        if type(time_str) is str and "Z" in time_str:
            if time_str.startswith("-"):  # e.g. -0001
                date_type = 'invalid_date'
                created_at = None
            else:
                naive = datetime.strptime(time_str, "%Y-%m-%dT%H:%M:%S.%fZ")
                if bool(time.localtime(naive.timestamp()).tm_isdst):
                    local = (naive.timestamp() * 1000) - 18000000
                    date_type = 'dst_datestring'
                else:
                    local = (naive.timestamp() * 1000) - 21600000
                    date_type = 'nondst_datestring'
                created_at = int(local)
        else:
            created_at = int(time_str)
            date_type = 'int'
        outfile.write(comment_oid + ',' + date_type + ',' + str(created_at) + '\n')

In [None]:
!head comment_dates.csv

In [None]:
from collections import defaultdict
type_counts = defaultdict(int)
with open('comment_dates.csv', 'r') as infile:
    for line in tqdm(infile, total=48494407):
        tokens = line.split(",")
        if len(tokens) == 3:
            date_type = tokens[1]
            type_counts[date_type] += 1


In [None]:
type_counts

In [None]:

def generate_comment_dates():
    with open('comment_dates.csv', 'r') as infile:
        for line in tqdm(infile, total=48494407):
            tokens = line.strip().split(",")
            if len(tokens) == 3:
                generated_at = get_generated_at_from_uuid(tokens[0])
                date_type = tokens[1]
                created_at = int(tokens[2])
                yield date_type, created_at, generated_at

            
df = pd.DataFrame(generate_comment_dates(), columns=['date_type','created_at', 'generated_at'])

In [None]:
df.head()

In [None]:
df['generated_at'] = df.generated_at * 1000

In [None]:
df['timestamp_match'] = df.created_at == df.generated_at

In [None]:
df['timestamp_diff'] = df.created_at - df.generated_at
df.timestamp_diff.value_counts().head(20)

In [None]:
pd.crosstab(df.timestamp_match, df.date_type)

In [None]:
sdf = df[~df.timestamp_match]
diffs = sdf.created_at - sdf.generated_at
plt.hist(diffs / 1000 / 60 / 60, bins=np.arange(-20, 20, step=0.2))
plt.show()

In [None]:
diffs.value_counts()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.fromtimestamp(df.generated_at.min() / 1000).replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.fromtimestamp(df.generated_at.max() / 1000).replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

index = (df.timestamp_diff >= -1000)&(df.timestamp_diff <= 1000)
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'CreatedAt/UUID match ({np.sum(index) / len(index):.2%})')

index = df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000])
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'UTC/CST offset ({np.sum(index) / len(index):.2%})')

index = ((df.timestamp_diff < -1000)|(df.timestamp_diff > 1000))&(~df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000]))
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'Other offset ({np.sum(index) / len(index):.2%})')

ax.set_yscale('log')

plt.legend()
#plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)
#plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp() * 1000, color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("Comments per month")

x_dates = [start_time + relativedelta(years=i) for i in range((len(bins) // 12) + 1)]
ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
ax.set_xticklabels([f"Jan\n'" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])
          
plt.show()

In [None]:
# looks like 1-hour offsets are somewhat common
index = ((df.timestamp_diff < -1000)|(df.timestamp_diff > 1000))&(~df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000]))
other_diffs = df[index].timestamp_diff / 1000 / 60 / 60
plt.hist(other_diffs, log=True, bins=np.linspace(other_diffs.min(), other_diffs.max()))
plt.xlabel("Difference between creation and generation timestamp (hours)")
plt.show()
other_diffs.value_counts().head()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.fromisoformat(f"2018-02-01").replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.fromisoformat(f"2019-02-15").replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(days=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

index = (df.timestamp_diff >= -1000)&(df.timestamp_diff <= 1000)
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'CreatedAt/UUID match ({np.sum(index) / len(index):.2%})')
match_counts = total_counts

index = df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000])
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'UTC/CST offset ({np.sum(index) / len(index):.2%})')
utc_offset_counts = total_counts

index = ((df.timestamp_diff < -1000)|(df.timestamp_diff > 1000))&(~df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000]))
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'Other offset ({np.sum(index) / len(index):.2%})')

ax.set_yscale('log')

plt.legend()

plt.ylabel("Comments per month")
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, y: datetime.utcfromtimestamp(x / 1000).replace(tzinfo=pytz.timezone('US/Central')).strftime("%Y\n%m/%d")))

plt.show()

In [None]:
# I visually inspected the index where match_counts spiked, in order to identify the day
match_counts[56]

In [None]:
# then, I figured out which day it was
datetime.utcfromtimestamp(bin_edges[56] / 1000).isoformat()

In [None]:
sdf = df.sort_values(by='generated_at')

In [None]:
# then, I manually identified the "switchover" point
sdf[sdf.generated_at >= bin_edges[56] + 1000 * 60 * 60 * 3.23].head(30)

In [None]:
# based on visual inspection, the switch-over generated_at timestamp is Thursday, March 29, 2018 3:14:41 AM
comments_correct_generated_at_timestamp = 1522293281000
comments_correct_created_at_timestamp = 1522275282000
comments_correct_generated_at_timestamp - comments_correct_created_at_timestamp

In [None]:
# verify that the fix works...
hours = np.arange(0, 24 * 60 * 60, 60 * 60)

fig, ax = plt.subplots(1, 1, figsize=(6, 6))

for int_type in ['reaction', 'comment', 'guestbook', 'journal']:
    if int_type == 'guestbook':
        created_at = ints_df[ints_df.interaction_type == int_type].created_at.sample(n=1000000)
    elif int_type == 'journal':
        created_at = journal_df[journal_df.published_at.notna()].published_at.sample(n=1000000)
    elif int_type == 'reaction':
        created_at = ints_df[(ints_df.interaction_type.str.startswith('amp_'))&(ints_df.parent_type == 'journal')].created_at
    else:
        created_at = ints_df[(ints_df.interaction_type == int_type)&(ints_df.parent_type == 'journal')].created_at.sample(n=1000000)
        created_at[created_at <= comments_correct_created_at_timestamp] += 18000000
    seconds_since_midnight = created_at.map(get_seconds_since_midnight)
    counts, _ = np.histogram(seconds_since_midnight, hours)
    pcts = counts / np.sum(counts)
    ax.plot(hours[1:], pcts, label=int_type)

ax.legend()
ax.set_xticks(hours)
ax.set_xticklabels([f"{hour / 60 / 60:.0f}" for hour in hours])

ax.set_xlabel("Hour of day (UTC)")
ax.set_ylabel("% of ints during this hour")

plt.show()

In [None]:
import gzip
import json
import time
input_filepath = os.path.join(paths.raw_data_dir, 'guestbook_scrubbed.json.gz')
with gzip.open(input_filepath, 'rt', encoding='utf-8') as infile, open('gb_dates.csv', 'w') as outfile:
    for i, line in tqdm(enumerate(infile), total=82680259):
        if i < 4002:
            continue
        gb = json.loads(line)
        if '_id' in gb and '$oid' in gb['_id']:
            guestbook_oid = gb['_id']['$oid']
        else:
            continue
        if 'createdAt' not in gb:
            continue
        created_at = gb['createdAt']
        time_str = created_at['$date']
        
        if type(time_str) is str and "Z" in time_str:
            if time_str.startswith("-"):  # e.g. -0001
                date_type = 'invalid_date'
                created_at = None
            else:
                naive = datetime.strptime(time_str, "%Y-%m-%dT%H:%M:%S.%fZ")
                if bool(time.localtime(naive.timestamp()).tm_isdst):
                    local = (naive.timestamp() * 1000) - 18000000
                    date_type = 'dst_datestring'
                else:
                    local = (naive.timestamp() * 1000) - 21600000
                    date_type = 'nondst_datestring'
                created_at = int(local)
        else:
            created_at = int(time_str)
            date_type = 'int'
        outfile.write(guestbook_oid + ',' + date_type + ',' + str(created_at) + '\n')

In [None]:
from collections import defaultdict
type_counts = defaultdict(int)
with open('gb_dates.csv', 'r') as infile:
    for line in tqdm(infile, total=82686297):
        tokens = line.split(",")
        if len(tokens) == 3:
            date_type = tokens[1]
            type_counts[date_type] += 1
type_counts

In [None]:
def generate_gb_dates(limit=10000000):
    with open('gb_dates_shuffled.csv', 'r') as infile:
        i = 0
        for line in tqdm(infile, total=min(82686297, limit)):
            tokens = line.strip().split(",")
            if len(tokens) == 3:
                i += 1
                generated_at = get_generated_at_from_uuid(tokens[0])
                date_type = tokens[1]
                created_at = int(tokens[2])
                yield date_type, created_at, generated_at
                if i >= limit:
                    return

            
df = pd.DataFrame(generate_gb_dates(), columns=['date_type','created_at', 'generated_at'])
len(df)

In [None]:
df.head()

In [None]:
df['generated_at'] = df.generated_at * 1000
df['timestamp_match'] = df.created_at == df.generated_at
df['timestamp_diff'] = df.created_at - df.generated_at
df.timestamp_diff.value_counts().head(20)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.fromtimestamp(df.generated_at.min() / 1000).replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.fromtimestamp(df.generated_at.max() / 1000).replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

index = (df.timestamp_diff >= -1000)&(df.timestamp_diff <= 1000)
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'CreatedAt/UUID match ({np.sum(index) / len(index):.2%})')

index = df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000])
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'UTC/CST offset ({np.sum(index) / len(index):.2%})')

index = ((df.timestamp_diff < -1000)|(df.timestamp_diff > 1000))&(~df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000]))
total_counts, bin_edges = np.histogram(df[index].generated_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'Other offset ({np.sum(index) / len(index):.2%})')

ax.set_yscale('log')

plt.legend()
plt.ylabel("Guestbooks per month")

x_dates = [start_time + relativedelta(years=i) for i in range((len(bins) // 12) + 1)]
ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
ax.set_xticklabels([f"Jan\n'" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])
          
plt.show()

In [None]:
# same as above, but plotting the created_at dates
fig, ax = plt.subplots(1, 1, figsize=(10,4))

start_time = datetime.fromtimestamp(df.created_at.min() / 1000).replace(tzinfo=pytz.UTC)
curr_time = start_time
end_time = datetime.fromtimestamp(df.created_at.max() / 1000).replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(months=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

index = (df.timestamp_diff >= -1000)&(df.timestamp_diff <= 1000)
total_counts, bin_edges = np.histogram(df[index].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'CreatedAt/UUID match ({np.sum(index) / len(index):.2%})')

index = df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000])
total_counts, bin_edges = np.histogram(df[index].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'UTC/CST offset ({np.sum(index) / len(index):.2%})')

index = ((df.timestamp_diff < -1000)|(df.timestamp_diff > 1000))&(~df.timestamp_diff.isin([-18000000, -21600000, -17999000, -21599000]))
total_counts, bin_edges = np.histogram(df[index].created_at, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label=f'Other offset ({np.sum(index) / len(index):.2%})')

ax.set_yscale('log')

plt.legend()
plt.ylabel("Guestbooks per month")

x_dates = [start_time + relativedelta(years=i) for i in range((len(bins) // 12) + 1)]
ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
ax.set_xticklabels([f"Jan\n'" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])
          
plt.show()

In [None]:
# TODO investigate journals more closely as well, to verify expected patterns between journals and guestbooks...

## Investigating new amps


In [None]:
# are amps getting double-counted?
# amps are not getting double-counted, since amp_hearts are only happening on photos
ints_df[ints_df.interaction_type == 'amp_heart'].parent_type.value_counts()

In [None]:
reactions_launch_date = datetime.strptime('2021-02-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
sdf = ints_df[(ints_df.parent_type == 'journal')&(ints_df.interaction_type.str.startswith('amp'))&(ints_df.created_at >= int(reactions_launch_date.timestamp() * 1000))]
print(len(sdf))
amps_by_journal = sdf.groupby(by='parent_oid')
amps_by_journal_df = amps_by_journal.agg({'site_id': len, 'user_id': 'nunique', 'interaction_type': 'nunique'}).rename(columns={'site_id': 'amp_count', 'user_id': 'user_count', 'interaction_type': 'amp_type_count'})
len(amps_by_journal_df)

In [None]:
amps_by_journal_df = amps_by_journal_df.rename(columns={'site_id': 'amp_count', 'user_id': 'user_count', 'interaction_type': 'amp_type_count'})

In [None]:
# load the site data
s = datetime.now()
site_metadata_dir = "/home/lana/shared/caringbridge/data/derived/site_metadata"
site_metadata_filepath = os.path.join(site_metadata_dir, "site_metadata.feather")
site_df = pd.read_feather(site_metadata_filepath)
print(f"Read {len(site_df)} site_df rows in {datetime.now() - s}.")
site_df.head()

In [None]:
pd.crosstab((amps_by_journal_df.amp_count > 1).rename('>1 amp?'), amps_by_journal_df.amp_type_count)

In [None]:
# number of journal updates with difference between total amps and unique amping users
# this occurs due to a caringbridge bug where a single user is able to leave an original heart and a new reaction type
np.sum(amps_by_journal_df.amp_count != amps_by_journal_df.user_count), np.sum(amps_by_journal_df.amp_count > amps_by_journal_df.user_count)

In [None]:
for journal_oid in amps_by_journal_df[amps_by_journal_df.amp_count != amps_by_journal_df.user_count].sample(n=5, random_state=0).index:
    row = journal_df[journal_df.journal_oid == journal_oid].iloc[0]
    journal_publish_date = datetime.utcfromtimestamp(row.published_at / 1000)
    site_id = row.site_id
    row = site_df[site_df.site_id == site_id].iloc[0]
    print(row.site_id, row['name'], row.title, datetime.utcfromtimestamp(row.created_at / 1000), row.privacy, journal_publish_date)

In [None]:
reactions_launch_date = datetime.strptime('2020-11-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
sdf = ints_df[(ints_df.interaction_type.str.startswith('amp'))&(ints_df.created_at >= int(reactions_launch_date.timestamp() * 1000))]
sdf['created_at_day'] = sdf.created_at.map(lambda ts: datetime.utcfromtimestamp(ts / 1000).replace(tzinfo=pytz.UTC).strftime('%Y-%m-%d'))
len(sdf)

In [None]:
real_reactions_launch_date = datetime.utcfromtimestamp(np.min(sdf[sdf.interaction_type != 'amp'].created_at) / 1000).replace(tzinfo=pytz.UTC)
str(real_reactions_launch_date)

In [None]:
# launch started on January 27, 2021
sdf[sdf.interaction_type != 'amp'].created_at_day.value_counts().sort_index().head(10)

In [None]:
sdf.interaction_type.value_counts()

In [None]:
int_type = sdf[sdf.created_at >= int(real_reactions_launch_date.timestamp() * 1000)].interaction_type
pd.concat([int_type.value_counts(normalize=False).rename(f'counts since {real_reactions_launch_date.strftime("%Y-%m-%d %H:%m")}'), int_type.value_counts(normalize=True).rename(f'% of total')], axis=1)

In [None]:
# plot query_df queries over time
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

may12 = datetime.strptime('2021-05-12', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
median_daily_pre = np.median(sdf[sdf.created_at_day <= '2021-05-12'].groupby('created_at_day').nunique().interaction_oid)
median_daily_post = np.median(sdf[sdf.created_at_day > '2021-05-12'].groupby('created_at_day').nunique().interaction_oid)

ax = axes[0]
start_time = reactions_launch_date
curr_time = start_time
end_time = datetime.utcfromtimestamp(np.max(sdf.created_at) / 1000).replace(tzinfo=pytz.UTC)
bins = []
while curr_time < end_time:
    bins.append(int(curr_time.timestamp() * 1000))
    curr_time += relativedelta(days=1)
print(f'{len(bins)} bins from {start_time} to {end_time}')

counts, bin_edges = np.histogram(sdf.created_at, bins=bins)
ax.plot(bin_edges[:-1], counts, label="All reactions")
day_totals = counts

counts, bin_edges = np.histogram(sdf[sdf.interaction_type != 'amp'].created_at, bins=bins)
ax.plot(bin_edges[:-1], counts, label="New reactions")

#ax.axvline(
#    may12.timestamp() * 1000,
#    linestyle='--', color='gray', alpha=0.8, label='May 12, 2021'
#)

ax.axvline(
    real_reactions_launch_date.timestamp() * 1000,
    linestyle='--', color='gray', alpha=0.8, label='New Reactions Launch'
)

#ax.hlines(median_daily_pre, start_time.timestamp() * 1000, may12.timestamp() * 1000, linestyle='dotted', color='black', label=f'Pre-May-12th median ({median_daily_pre} per day)', zorder=100)
#ax.hlines(median_daily_post, may12.timestamp() * 1000, end_time.timestamp() * 1000, linestyle='dashdot', color='black', label=f'Post-May-12th median ({median_daily_post} per day)', zorder=100)

ax.set_ylabel(f"Reactions per day")
ax.set_xlabel("Date")
ax.set_title(f"Reactions (n={len(sdf):,}) in Mongo snapshot from July 15, 2021")

#ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, y: datetime.utcfromtimestamp(x / 1000).replace(tzinfo=pytz.timezone('US/Central')).strftime("%Y\n%m %d").replace(" 0", " ")))
#start = datetime.strptime('2005-01-01', '%Y-%m-%d').replace(tzinfo=pytz.UTC)
#x_dates = [start + relativedelta(years=i) for i in range(18)]
#ax.set_xticks([d.timestamp() * 1000 for d in x_dates])
#nl = '\n'
#ax.set_xticklabels([f"{nl if i % 2 == 1 else ''}'" + d.strftime('%Y')[2:] for i, d in enumerate(x_dates)])
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, y: datetime.utcfromtimestamp(x / 1000).replace(tzinfo=pytz.timezone('US/Central')).strftime("%h %d").replace(" 0", " ")))
ax.legend(loc='lower right')

ax = axes[1]

for interaction_type, type_repr in zip(['amp', 'amp_folded_hands', 'amp_heart', 'amp_happy', 'amp_sad'], ['Original Amp', 'Folded Hands', 'Heart', 'Happy Face', 'Sad Face']):
    counts, bin_edges = np.histogram(sdf[sdf.interaction_type == interaction_type].created_at, bins=bins)
    pcts = counts / day_totals
    if interaction_type == 'amp':
        continue
    ax.plot(bin_edges[:-1], pcts, label=f"{type_repr}")

#ax.axvline(
#    may12.timestamp() * 1000,
#    linestyle='--', color='gray', alpha=0.8, label='May 12, 2021'
#)

ax.axvline(
    real_reactions_launch_date.timestamp() * 1000,
    linestyle='--', color='gray', alpha=0.8, label='New Reactions Launch'
)

ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, y: datetime.utcfromtimestamp(x / 1000).replace(tzinfo=pytz.timezone('US/Central')).strftime("%h %d").replace(" 0", " ")))
ax.legend()
ax.set_ylabel(f"% of total daily reactions")
ax.set_xlabel("Date")
ax.set_title(f"Reactions by type")

plt.show()