Valid Author Selection
===

This process generates three outcomes:
 - List of authors
 - List of valid authors
 - List of valid sites

Note on requirements: all authored updates most be non-trivial.

Requirements for authors:
 - 1+ update
 - Never authored an update on a spam site

Requirement for valid authors:
 - 3+ updates authored on SOME site
 - Never authored an update on a spam site

Requirements for valid sites:
 - 3+ updates from a valid author
 
Data to record about authors:
 - Update count
 - Unique site count
 - Most updates on one site
 - First to last update time
 - Is valid
 - Primary site (if valid)
 - Number of eligible sites
 - List of eligible sites
 
Note that what we actually have are valid author/site pairs. Data about them:
 - user_id
 - site_id
 - total_updates
 - user_total_updates
 - first_update_timestamp
 - user_first_update_timestamp
 - user_third_update_timestamp
 - user_valid_site_count  `# total valid sites i.e. how many user/site pairs contain this user`
 - site_valid_user_count  `# total valid users i.e. how many user/site pairs contain this site`
 
Note: the sna-social-support project required "Most updates on one site" >= 2 and 24 hours apart.

**Question: should we generate recommendations for authors who have published 1 update on a site with 3+ updates?**
The intuition behind recommnendation based on author type, etc. is that your writings reveal who you want by encoding information about who you are. So, I think we want to compute embeddings for authors based on first three updates *by that author*.  
But this is weird, since we are really recommending sites to authors. The sites are focused on a particular person, but multiple people could be eligible "authors" for a site.  We could compute an embedding/interaction features for potentially 2+ authors.  So each author has a "primary" site that we use as the basis for recommendation?  Or we just have (user,site) pairs in all cases, and when an author SEEKS a recommendation we select for them the (user,site) pair such that the most recent/large/relevant one is used as the basis for feature extraction?

One question: depending on definition, how often would an authors' primary site actually shift?  Maybe we just ALWAYS use a person's first site where they hit 3+ journal updates, and then disregard all other sites? How many sites would that omit? (And how inaccurately would we portray those users by using the "older" site for features?)


Specifically for embedding feature extraction, it seems like one needs to do so for all (user,site) pairs where user is a valid author and site is an eligible site for that author.

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import os
import re
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split

from collections import Counter
import sqlite3
from nltk import word_tokenize
from html.parser import HTMLParser
from tqdm import tqdm
import random
import pickle
import json

from datetime import datetime
import pytz
from pprint import pprint

import matplotlib.pyplot as plt
import matplotlib.dates as md
import matplotlib
import pylab as pl
from IPython.core.display import display, HTML

In [None]:
# this imports a number of utility functions related to data annotation & the web client
import sys
sys.path.append("/home/lana/levon003/repos/qual-health-journeys/annotation_data")
import journal as journal_utils
import db as db_utils

In [None]:
# put all derived data in the data_selection folder
working_dir = "/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data"
os.makedirs(working_dir, exist_ok=True)
working_dir

In [None]:
# load the site metadata dataframe
# this is created in caringbridge_core from the new data
site_metadata_working_dir = "/home/lana/shared/caringbridge/data/derived/site_metadata"
s = datetime.now()
site_metadata_filepath = os.path.join(site_metadata_working_dir, "site_metadata.feather")
site_info_df = pd.read_feather(site_metadata_filepath)
print(datetime.now() - s)
len(site_info_df)

In [None]:
site_info_df.head()

In [None]:
#if np.sum(site_info_df.site_id.duplicated()) > 0:
site_info_df[site_info_df.site_id.duplicated(keep=False)]

In [None]:
# drop any duplicates
site_info_df = site_info_df.drop_duplicates(subset='site_id', keep='first')
len(site_info_df)

In [None]:
datetime.utcfromtimestamp(site_info_df.created_at.quantile(0.0001) / 1000).isoformat(),\
datetime.utcfromtimestamp(site_info_df.created_at.max() / 1000).isoformat()

In [None]:
# load the journal dataframe with the index
# this is all the new journal data
s = datetime.now()
journal_metadata_dir = "/home/lana/shared/caringbridge/data/derived/journal_metadata"
journal_metadata_filepath = os.path.join(journal_metadata_dir, "journal_metadata.feather")
journal_df = pd.read_feather(journal_metadata_filepath)
print(datetime.now() - s)
len(journal_df)

In [None]:
journal_df.head()

In [None]:
datetime.utcfromtimestamp(journal_df.created_at.quantile(0.0001) / 1000).isoformat(),\
datetime.utcfromtimestamp(journal_df.created_at.quantile(0.999999) / 1000).isoformat()

In [None]:
# the vast majority of sites with journals also have site-level metadata
# these 16 missing sites might be related to e.g. incomplete deletions on the part of CaringBridge
# or, more likely for new journals, the site collection was snapshotted before the journal collection, and new sites were created in the intervening period
len(set(journal_df[~journal_df.is_deleted].site_id) - set(site_info_df.site_id))

In [None]:
# trim out journal updates that are trivial (short or machine-generated) and deleted
print(len(journal_df))
journal_df = journal_df[(journal_df.is_nontrivial)&(~journal_df.is_deleted)]
print(len(journal_df))

In [None]:
# trim out journal updates with invalid dates
# (which includes journals without a published_at date)
invalid_start_date = datetime.fromisoformat('2005-01-01').replace(tzinfo=pytz.UTC)
invalid_end_date = datetime.fromisoformat('2022-01-01').replace(tzinfo=pytz.UTC)
print(f"Keeping journals between {invalid_start_date.isoformat()} and {invalid_end_date.isoformat()}.")
invalid_start_timestamp = invalid_start_date.timestamp() * 1000
invalid_end_timestamp = invalid_end_date.timestamp() * 1000
print(len(journal_df), np.sum(journal_df.published_at.isna()))
journal_df = journal_df[(journal_df.published_at >= invalid_start_timestamp)&(journal_df.published_at <= invalid_end_timestamp)]
print(len(journal_df), np.sum(journal_df.published_at.isna()))

In [None]:
# build a dataframe where each site has a list of user_ids on that site and the total number of non-trivial journals
site_proportions = []
for site_id, group in tqdm(journal_df.groupby(by='site_id', sort=False)):
    total_journals = len(group)
    user_ids = set(group.user_id)
    site_proportion = {
        'site_id': site_id,
        'user_ids': user_ids,
        'total_journals': total_journals
    }
    site_proportions.append(site_proportion)
len(site_proportions)

In [None]:
site_proportions_df = pd.DataFrame(site_proportions)
len(site_proportions_df)

In [None]:
site_proportions_df.head()

In [None]:
# merge the dataframes so that we have more detailed site-level info
s = datetime.now()
site_df = pd.merge(site_info_df, site_proportions_df, on='site_id', validate='one_to_one')
print(datetime.now() - s)
len(site_df), len(site_df) / len(site_info_df), len(site_df) / len(site_proportions_df)

In [None]:
site_df.head()

In [None]:
site_df.dtypes

In [None]:
print(f"{np.sum(site_df.total_journals == site_df.numJournals) / len(site_df) * 100:.2f}% of sites ({len(site_df)} / {len(site_info_df)} total sites) have a correct 'numJournals' entry.")
site_df[site_df.total_journals != site_df.numJournals].sample(n=10)[['site_id', 'name', 'title', 'privacy', 'publish_date', 'created_at', 'numJournals', 'total_journals', 'isDeactivated']]

In [None]:
sdf = site_df[~site_df.isDeactivated]
print(f"{np.sum(sdf.total_journals == sdf.numJournals) / len(sdf) * 100:.2f}% of non-deactivated sites ({len(sdf)} / {len(site_df)} sites with 1+ updates) have a correct 'numJournals' entry.")

plt.hist(sdf.numJournals - sdf.total_journals, bins=np.linspace(-100, 100), log=True)
plt.xlabel("Additional journals in total count but not in subset")
plt.ylabel("Site count")
plt.show()

In [None]:
bins = np.linspace(site_info_df.created_at.quantile(0.0001), site_info_df.created_at.max())
totals, bin_edges = np.histogram(sdf.created_at, bins=bins)

counts, bin_edges = np.histogram(sdf[sdf.numJournals != sdf.total_journals].created_at, bins=bins)
pcts = counts / totals


fig, ax = plt.subplots(1, 1, figsize=(7,6))
ax.plot(bin_edges[:-1], pcts)
ax.set_title("Count mismatches are related to original site creation time")
ax.set_xlabel("Date")
ax.set_ylabel("% of sites with mismatch between official and actual Journal counts")
ax.xaxis.set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, y: datetime.utcfromtimestamp(x / 1000).replace(tzinfo=pytz.timezone('US/Central')).strftime("%Y\n%m/%d").replace(" 0", " ")))
plt.tight_layout()
plt.show()

## User filtering

In [None]:
user_site_counts = journal_df[['user_id', 'site_id']].value_counts()
user_site_df = user_site_counts.to_frame(name='update_count').reset_index()
print(len(user_site_df))
user_site_df.head()

In [None]:
invalid_users = []
for spam_users in site_df[site_df.isSpam == 1].user_ids:
    for spam_user in spam_users:
        invalid_users.append(spam_user)
print(len(invalid_users))

# Manual removal of users who are invalid for other reasons
invalid_users.extend([
    0,  # Test user
    15159562,  # Test user account run by CaringBridge Customer Experience team
    46,  # Seems to be run at least in part by CaringBridge team for testing
    13896060,  # Seems like another customer care rep
    594,  # Seems like a customer care rep, but also seems like it may include some legitimate sites? (see e.g. site 559205)
    7393709, #Junk and test posts
    25036137, #Repeated test text
    8192483, #Mostly test posts, but one genuine post about patient
    17956362, #Test posts
    16648084, #Test posts (and some good poetry)
    31761432, # Doctor's ad
    32764680, # Payday lending ad
    30457719, # 3D visualization company ad
    32538830, # Car supplies ad
    32757690, # Fashion ad
    32757739, # Clothing brand ad
    1043681, # Leasing furniture ad
    28132146, # Farm company ad
    31477721, # Lenders ad
    31879875, # Payday lender ad
    31799168, # Credit company ad
    32428328, # SEO ad
    31684805, # Various ads
    30165532, # Various ads about black magic
    31833912, # Job hunting spam
    32753111, # Arabic text (possibly spam)
    32732132 # Turkish text (spam)
])
print(len(invalid_users))

In [None]:
user_site_df = user_site_df[~user_site_df.user_id.isin(invalid_users)].reset_index(drop=True)
len(user_site_df), len(set(user_site_df.user_id)), len(set(user_site_df.site_id))

In [None]:
# remove any sites that are deactivated
# note that we do this as a separate step, rather than removing all users who have published on deactivated sites, since we want to remove deleted sites but don't consider it to be "author poison" the way publishing on a spam site is
deactivated_sites = set(site_df[site_df.isDeactivated].site_id)
print(len(deactivated_sites))
user_site_df = user_site_df[~user_site_df.site_id.isin(deactivated_sites)].reset_index(drop=True)
len(user_site_df), len(set(user_site_df.user_id)), len(set(user_site_df.site_id))

In [None]:
user_site_df.head()

In [None]:
s = datetime.now()
user_site_df.to_feather(os.path.join(working_dir, 'user_site_df.feather'))
user_site_df.to_csv(os.path.join(working_dir, 'user_site_df.csv'), index=False)
print(datetime.now() - s)

### Old Process

In [None]:
user_site_counts = journal_df[['user_id', 'site_id']].value_counts()
user_site_df = user_site_counts[user_site_counts >= 3].to_frame(name='update_count')
user_site_df.head()

In [None]:
#user_site_df.index.to_frame(index=False)
user_site_df = user_site_df.reset_index()
user_site_df

In [None]:
# 12,813 authors with multiple sites
user_eligible_site_count = user_site_df['user_id'].value_counts()

fig, ax = plt.subplots(1, 1, figsize=(6, 6))
bins = np.arange(1, 20)
ax.hist(user_eligible_site_count, bins=bins, log=True)

ax.axvline(2, linestyle='--', color='black')
ax.text(0.12, 0.97, f'{np.sum(user_eligible_site_count > 1):,} ({np.sum(user_eligible_site_count > 1) / len(user_eligible_site_count) * 100:.2f}%) authors have > 1 eligible site', transform=ax.transAxes, va='top', ha='left')

ax.set_xlabel("Number of sites with 3+ journal updates")
ax.set_ylabel("User count")
ax.set_title("Some authors have 3+ non-trivial updates on multiple sites")

ax.set_xticks(bins)

plt.show()

In [None]:
# user_site_df contains only users with 3+ journal updates on at least one site
valid_users = set(user_site_df.user_id)
len(valid_users)

In [None]:
0 in valid_users

In [None]:
valid_users.remove(0)

In [None]:
removed_for_spam = 0
for spam_users in tqdm(site_df[~site_df.isSpam.isna()].user_ids):
    for spam_user in spam_users:
        if spam_user in valid_users:
            valid_users.remove(spam_user)
            removed_for_spam += 1
removed_for_spam, len(valid_users)

In [None]:
# Manual removal of users who are invalid for other reasons
invalid_users = [
    0,  # Test user
    15159562,  # Test user account run by CaringBridge Customer Experience team
    46,  # Seems to be run at least in part by CaringBridge team for testing
    13896060,  # Seems like another customer care rep
    594,  # Seems like a customer care rep, but also seems like it may include some legitimate sites? (see e.g. site 559205)
    7393709, #Junk and test posts
    25036137, #Repeated test text
    8192483, #Mostly test posts, but one genuine post about patient
    17956362, #Test posts
    16648084, #Test posts (and some good poetry)
    31761432, # Doctor's ad
    32764680, # Payday lending ad
    30457719, # 3D visualization company ad
    32538830, # Car supplies ad
    32757690, # Fashion ad
    32757739, # Clothing brand ad
    1043681, # Leasing furniture ad
    28132146, # Farm company ad
    31477721, # Lenders ad
    31879875, # Payday lender ad
    31799168, # Credit company ad
    32428328, # SEO ad
    31684805, # Various ads
    30165532, # Various ads about black magic
    31833912, # Job hunting spam
    32753111, # Arabic text (possibly spam)
    32732132 # Turkish text (spam)
]

In [None]:
removed_manually = 0
for invalid_user in invalid_users:
    if invalid_user in valid_users:
        valid_users.remove(invalid_user)
        removed_manually += 1
removed_manually, len(valid_users)

In [None]:
# what percent of users remain?
len(valid_users) / len(set(journal_df.user_id))

In [None]:
# write out valid users to text file
with open(os.path.join(working_dir, "valid_user_ids.txt"), 'w') as outfile:
    for user_id in valid_users:
        outfile.write(str(user_id) + "\n")
print("Finished.")

In [None]:
# read valid users
valid_users = set()
with open(os.path.join(working_dir, "valid_user_ids.txt"), 'r') as infile:
    for line in infile:
        user_id = line.strip()
        if user_id == "":
            continue
        else:
            valid_users.add(int(user_id))
len(valid_users)

## Site analysis & filtering

In [None]:
print(len(user_site_df))
user_site_df = user_site_df[user_site_df.user_id.isin(valid_users)]
print(len(user_site_df))
valid_sites = set(user_site_df.site_id)
len(valid_sites)

In [None]:
user_site_df.sample(n=12)

In [None]:
valid_site_ids = valid_sites

In [None]:
len(valid_site_ids), len(valid_site_ids) / len(set(site_df.site_id))

In [None]:
# are there any spam sites still in the sample?
# no, as expected
len(site_df[(~site_df.isSpam.isna())&(site_df.site_id.isin(valid_site_ids))])

In [None]:
# how many sites included in this sample have only a single (substantive) update?
# None! We have changed the criteria compared to the CSCW paper
total_single_update_sites = len(site_df[(site_df.site_id.isin(valid_site_ids))&(site_df.total_journals == 1)])
total_single_update_sites, total_single_update_sites / len(valid_site_ids)

In [None]:
single_update_site_ids = np.array(site_df[(site_df.site_id.isin(valid_site_ids))&(site_df.total_journals == 1)].site_id)
len(single_update_site_ids)

In [None]:
# write out valid sites to text file
with open(os.path.join(working_dir, "valid_site_ids.txt"), 'w') as outfile:
    for site_id in valid_site_ids:
        outfile.write(str(site_id) + "\n")
print("Finished.")

In [None]:
# read valid sites
valid_site_ids = set()
with open(os.path.join(working_dir, "valid_site_ids.txt"), 'r') as infile:
    for line in infile:
        site_id = line.strip()
        if site_id == "":
            continue
        else:
            valid_site_ids.add(int(site_id))
len(valid_site_ids)

In [None]:
plt.hist(site_df[site_df.site_id.isin(valid_site_ids)].visits, log=True, bins=range(500))
plt.title("Distribution of selected sites by number of visits")
plt.show()

### Create valid user/site dataframe

 - user_id
 - site_id
 - total_updates
 - user_total_updates
 - first_update_timestamp
 - user_first_update_timestamp
 - user_third_update_timestamp
 - user_valid_site_count  `# total valid sites i.e. how many user/site pairs contain this user`
 - site_valid_user_count  `# total valid users i.e. how many user/site pairs contain this site`

In [None]:
user_site_df = user_site_counts[user_site_counts >= 3].to_frame(name='update_count').reset_index()
user_site_df = user_site_df[user_site_df.user_id.isin(valid_users)]
user_site_df.head()

In [None]:
# add total_updates column
site_counts = journal_df.site_id.value_counts().to_frame(name='total_updates').rename_axis(index='site_id').reset_index()
user_site_df = pd.merge(user_site_df, site_counts, how='left', on='site_id')
user_site_df = user_site_df.rename(columns={'update_count': 'user_total_updates'})
user_site_df.head()

In [None]:
user_valid_site_count = user_site_df.groupby('user_id').site_id.nunique()

fig, ax = plt.subplots(1, 1, figsize=(6,6))
bins=np.arange(1, 12)
ax.hist(user_valid_site_count, log=True, bins=bins)
ax.set_xticks(bins)
ax.axvline(2, linestyle='--', color='black')
ax.text(0.15, 0.97, f'{np.sum(user_valid_site_count > 1):,} ({np.sum(user_valid_site_count > 1) / len(user_valid_site_count) * 100:.2f}%) authors have > 1 eligible site', transform=ax.transAxes, va='top', ha='left')
ax.set_xlabel("Number of sites with 3+ journal updates")
ax.set_ylabel("User count")
ax.set_title("Some authors have 3+ non-trivial updates on multiple sites")
plt.show()

In [None]:
site_valid_user_count = user_site_df.groupby('site_id').user_id.nunique()
fig, ax = plt.subplots(1, 1, figsize=(6,6))
bins=np.arange(1, 12)
ax.hist(site_valid_user_count, log=True, bins=bins)
ax.set_xticks(bins)
ax.axvline(2, linestyle='--', color='black')
ax.text(0.15, 0.97, f'{np.sum(site_valid_user_count > 1):,} ({np.sum(site_valid_user_count > 1) / len(site_valid_user_count) * 100:.2f}%) valid sites have > 1 eligible author', transform=ax.transAxes, va='top', ha='left')
ax.set_xlabel("Number of valid authors with 3+ updates on site")
ax.set_ylabel("Site count")
ax.set_title("Some sites have multiple authors with 3+ updates")
plt.show()

In [None]:
user_valid_site_count = user_valid_site_count.to_frame('user_valid_site_count').reset_index()
site_valid_user_count = site_valid_user_count.to_frame('site_valid_user_count').reset_index()
user_site_df = pd.merge(user_site_df, user_valid_site_count, how='left', on='user_id')
user_site_df = pd.merge(user_site_df, site_valid_user_count, how='left', on='site_id')
user_site_df.head()

In [None]:
s = datetime.now()
user_first_update_timestamp_dict = {}
user_third_update_timestamp_dict = {}
#df[["A","B"]].apply(tuple, 1).isin(AB_col)
valid_tuples = user_site_df[['user_id', 'site_id']].apply(tuple, 1)
filtered_journals = journal_df[journal_df.user_id.isin(valid_users)]
filtered_journal_tuples = filtered_journals[['user_id', 'site_id']].apply(tuple, 1)
filtered_journals = filtered_journals[filtered_journal_tuples.isin(valid_tuples)]
print(f"Starting groupby after {datetime.now() - s}.")
for key, group in tqdm(filtered_journals[['user_id', 'site_id', 'created_at']].groupby(['user_id', 'site_id'])):
    created_at = group.created_at.sort_values(ascending=True)
    user_first_update_timestamp_dict[key] = created_at.iloc[0]
    user_third_update_timestamp_dict[key] = created_at.iloc[2]
print(f"Finished groupby after {datetime.now() - s}.")

In [None]:
user_first_update_timestamp_list = []
user_third_update_timestamp_list = []
for user_id, site_id in zip(user_site_df.user_id, user_site_df.site_id):
    user_first_update_timestamp = user_first_update_timestamp_dict[(user_id, site_id)]
    user_first_update_timestamp_list.append(user_first_update_timestamp)
    user_third_update_timestamp = user_third_update_timestamp_dict[(user_id, site_id)]
    user_third_update_timestamp_list.append(user_third_update_timestamp)
user_site_df = user_site_df.assign(user_first_update_timestamp=user_first_update_timestamp_list, user_third_update_timestamp=user_third_update_timestamp_list)
user_site_df.head()

In [None]:
site_first_update_df = journal_df[['site_id', 'created_at']].groupby('site_id').min().reset_index().rename(columns={'created_at': 'first_update_timestamp'})
user_site_df = pd.merge(user_site_df, site_first_update_df, how='left', on='site_id')
user_site_df.head()

In [None]:
# user's first update should always be >= the first update on the site
assert np.sum(user_site_df.first_update_timestamp <= user_site_df.user_first_update_timestamp) == len(user_site_df)

In [None]:
# this number should be very large
# (in fact, the two should be equal on any site where total_updates == user_total_updates
np.sum(user_site_df.first_update_timestamp == user_site_df.user_first_update_timestamp)

In [None]:
sdf = user_site_df[user_site_df.total_updates == user_site_df.user_total_updates]
print(f"{len(sdf)} ({len(sdf) / len(user_site_df)*100:.2f}%) user/site pairs are on single-author sites.")
assert np.all(sdf.first_update_timestamp == sdf.user_first_update_timestamp)
assert np.all(sdf.site_valid_user_count == 1)

In [None]:
s = datetime.now()
user_site_df.to_feather(os.path.join(working_dir, 'user_site_df.feather'))
user_site_df.to_csv(os.path.join(working_dir, 'user_site_df.csv'), index=False)
print(datetime.now() - s)

### Investigate valid authors

In [None]:
# load the list of valid user/site pairs
s = datetime.now()
model_data_dir = '/home/lana/shared/caringbridge/data/projects/recsys-peer-match/model_data'
user_site_df = pd.read_csv(os.path.join(model_data_dir, 'user_site_df.csv'))
valid_user_ids = set(user_site_df.user_id)
print(f"Read {len(user_site_df)} rows ({len(valid_user_ids)} unique users) in {datetime.now() - s}.")
user_site_df.head()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,4))

bins = []
year = 2005
month = 0
while year != 2020:
    if month == 12:
        year += 1
        month = 1
    else:
        month += 1
    bins.append(datetime.fromisoformat(f"{year}-{month:02}-01").timestamp())

total_counts, bin_edges = np.histogram(user_site_df.user_third_update_timestamp / 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='All valid user/site pairs')
total_counts, bin_edges = np.histogram(user_site_df[user_site_df.user_valid_site_count == 1].user_third_update_timestamp / 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Single-site authors')
total_counts, bin_edges = np.histogram(user_site_df[user_site_df.site_valid_user_count == 1].user_third_update_timestamp / 1000, bins=bins)
plt.plot(bin_edges[:-1], total_counts, linestyle='-', linewidth=2, label='Single-author sites')
plt.legend()
plt.axvline(datetime.fromisoformat(f"2014-01-01").replace(tzinfo=pytz.UTC).timestamp(), color='black', alpha=0.8, linestyle='--', linewidth=1)
plt.axvline(datetime.fromisoformat(f"2019-01-01").replace(tzinfo=pytz.UTC).timestamp(), color='black', alpha=0.8, linestyle='--', linewidth=1)

plt.ylabel("New user/site pairs")
plt.title(f"{len(user_site_df):,} valid user/site pairs containing {len(set(user_site_df.user_id)):,} unique users and {len(set(user_site_df.site_id)):,} unique sites")

newline = '\n'
xticks = [datetime.fromisoformat(f"{2005 + i}-01-01").timestamp() for i in range((2020 - 2005) + 2)]
plt.xticks(
    xticks, 
    [f"{datetime.utcfromtimestamp(be).strftime('%Y')}" for i, be in enumerate(xticks)])
          
plt.show()

In [None]:
# users with the largest number of valid sites
user_df = user_site_df.drop_duplicates(subset='user_id')
user_df[['user_id', 'user_valid_site_count']].sort_values(by='user_valid_site_count', ascending=False).head(n=10)

In [None]:
# sites with the largest number of valid users
site_df = user_site_df.drop_duplicates(subset='site_id')
site_df[['site_id', 'site_valid_user_count']].sort_values(by='site_valid_user_count', ascending=False).head(n=10)

In [None]:
# user/site pairs with the largest number of updates on a single site
user_site_df.sort_values(by='user_total_updates', ascending=False).head(n=10)