---
title: "Beyond twitter"
subtitle: "Exploring `bluesky.social` for digital disease detection and prototyping a data extraction pipeline for ILI surveillance"
author: "Heiner Atze, MSc, PhD"
institute: Digital Epidemiology 2025, Hasselt University
date: "April 10, 2025"
format:
  revealjs: 
    theme: default
    reference-location: margin
    preview-links: true
    toc: false
    toc-title: Outline
    incremental: false
    scrollable: true

  beamer:
    toc: true
    toc-title: Outlininglines
    theme: Hannover
    slide-level: 2
    aspectratio: 169
    pdf-engine: tectonic
    incremental: false

jupyter: digepi
execute: 
  cache: false 
  echo: false
  output: false

bibliography: "../../dig_epi.bib"
---

In [None]:
#| label: imports
#| echo: false
#| output: false
from google.oauth2 import service_account
import pandas as pd
import matplotlib.pyplot as plt
import pandas_gbq
import sys
import os
sys.path.append(os.path.abspath("../"))
from analysis.bq_queries import get_post_count_ili_sql, get_llm_ili_sql
from analysis.feature_eng import *
from analysis.model_evaluation import *
credentials = service_account.Credentials.from_service_account_file(
    '../.gc_creds/digepizcde-71333237bf40.json')

In [None]:
#| label: sql_raw_post counts
#| output: false

who_subset = 'flunet'
lang = 'fr'#'fr'
country_code = 'FRA' #"FRA"

ili_kws = [
    'grippe',  'rhume', 'fievre', 'courbature'
    # "Grippe", 'grippe', 'Schnupfen', 'Fieber', 'Muskelschmerzen'
]
ili_kws_sql = [f"'{x}'" for x in ili_kws]

In [None]:
control_kws = ['travail', 'voiture', 'demain', 'sommeil']
# control_kws = ['Auto', 'morgen', 'Arbeit', 'arbeiten', 'schlafen', 'Schlaf']
control_kws_sql = [f"'{x}'" for x in control_kws]

In [None]:
post_count_ili_sql ="SELECT * FROM `digepizcde.bsky_ili.bsky_ili_fr`"

In [None]:
#| echo: false
#| output: false
post_count_ili_df = pandas_gbq.read_gbq(
   post_count_ili_sql, credentials=credentials
).set_index('date')
post_count_ili_df.index = pd.to_datetime(post_count_ili_df.index)

In [None]:
post_count_ili_df['year'] = post_count_ili_df.index.year.astype("category")
post_count_ili_df['month'] = post_count_ili_df.index.month.astype("category")
post_count_ili_df['week'] = post_count_ili_df.index.isocalendar().week.astype("category")
post_count_ili_df['season'] = post_count_ili_df['month'].apply(assign_season).astype("category")

In [None]:
lags = 2
weeks_ahead = 1
X = post_count_ili_df.drop([
    'ili_case', 'ari_case', 'ili_incidence', 'ari_incidence',
    'norm_post_count', 'rest_posts'], axis = 1)
lagdfs = []

for l in range(1, lags+1):
    lagdf = X.shift(l)
    lagdf.columns = [f"{c}_lag{l}" for c in lagdf.columns]
    lagdfs.append(lagdf)

X = pd.concat([X, *lagdfs], axis = 1).dropna().iloc[:-weeks_ahead,:]

In [None]:
y = post_count_ili_df['ili_incidence'].iloc[lags+weeks_ahead:]
y = y.divide(y.max())

In [None]:
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline

In [None]:
ts_cv = TimeSeriesSplit(
    n_splits=5,
    gap=0,
    max_train_size=100,
    test_size=10,
)

In [None]:
gbrt = HistGradientBoostingRegressor(categorical_features="from_dtype", random_state=42)
categorical_columns = X.columns[X.dtypes == "category"]
print("Categorical features:", categorical_columns.tolist())

In [None]:
# evaluate(gbrt, X, y, cv=ts_cv, model_prop="n_iter_")
# gbrt.fit(X, y)

In [None]:
ypred = pd.Series(gbrt.predict(X), index = y.index) 

In [None]:
y.plot()
ypred.plot()

In [None]:
month_splines = periodic_spline_transformer(12,6) \
    .fit_transform(post_count_ili_df[['month']])

# Create a dataframe for the splines
month_splines_df = pd.DataFrame(
    month_splines, 
    index=post_count_ili_df.index,
     columns=[f'month_spline_{i}' for i in range(month_splines.shape[1])])

# Concatenate the splines with the original dataframe
post_count_ili_df = pd.concat([post_count_ili_df, month_splines_df], axis=1)

In [None]:
week_splines = periodic_spline_transformer(54, 27) \
    .fit_transform(post_count_ili_df[['week']])

# Create a dataframe for the splines
week_splines_df = pd.DataFrame(
    week_splines, 
    index=post_count_ili_df.index,
     columns=[f'week_spline_{i}' for i in range(week_splines.shape[1])])

# Concatenate the splines with the original dataframe
post_count_ili_df = pd.concat([post_count_ili_df, week_splines_df], axis=1)

In [None]:
llm_ili_sql = get_llm_ili_sql(
    ili_kws, lang, country_code
)

In [None]:
#| echo: false
#| output: false
llm_ili_df = pandas_gbq.read_gbq(
    llm_ili_sql, credentials=credentials 
).set_index('date')
llm_ili_df.index = pd.to_datetime(llm_ili_df.index)

# Introduction

## `bluesky`: general aspects

::: {.columns}

:::: {.column}

- microblogging platform 
- similar to `twitter` in user experience
- decentralized 
- open source

::::

:::: {.column}

![](./figures/bluesky_logo.png)

::::

:::

## Decentralization and Democratization of content algorithms [^longnote]

- Decentralized User Identifier (DID)
  - immutable, associated with human readable user handle
- Personal Data servers (PSDs)

- DIDs and affiliated contents are portable between PSDs 
- Users can choose, prioritize and develop feed generators and content labelers

[^longnote]: @balduf2024looking

## Development of user activity [^ref1]

::: {.columns}

::: {.column}

- current estimate: ca. 33 Millions active users
- user base expanded in bursts after key events:
  - 2022: acquisition of `twitter` by Elon Musk
  - 2024: ban of `X` in Brazil, presidential election in the US

:::

:::: {.column}

![](./figures/bsky_guardian.png)

::::

::: 

[^ref1]: @explodingtopicsBlueskyUser, @balduf2024looking

## Literature addressing `bluesky`

- Google scholar search : "bluesky" AND "social" since 2022
- 43 articles

- main topics: 
  - decentralized social network architecture
  - user migration from `X` to `bluesky` 2024
  - network structure and dynamics

- no results for 
  - "bluesky" AND "disease"
  - "bluesky" AND "epidemiology"

# Exploration of bluesky data

## bluesky API

- publicly accessible for free
- extensive documenation at https://docs.bsky.app/docs/category/http-reference

## `searchPosts` API method

- [API documentation](https://docs.bsky.app/docs/api/app-bsky-feed-search-posts)

- selected parameters:
  - `q`: search query
  - `since`, `until`: defining search period  

- deterministic search
- allows exhaustive sampling

## `getProfiles`

- allows to retrieve the author profile information
- for reference, not used in this project

## Post metadata

- defined in the [SDK documentation](https://atproto.blue/en/latest/atproto/atproto_client.models.app.bsky.feed.defs.html#atproto_client.models.app.bsky.feed.defs.PostView)

- fields (selection):
  - `uri`: unique post identifier
  - `author`: contains `did` which allows to retrieve user profile
  - `record`: contains the text and time information of the message
  - `embedded`: any embedded media (images, other posts, etc ...)

- in contrary to former `twitter` post metadata, no geoinformation

## User information

- `Feedgens`
- `Labelers`

- no geo information

# Project

## Outline

**`bluesky` post data for digital disease surveillance**

. . .

**Implementation of a continuous surveillance pipeline**

# Methods

# Data extraction

## Symptom related message extraction

- focused on French `bluesky` posts (data volume constraint)
- extraction using list of keywords
  - grippe (flu, influenza)
  - rhume (common cold)
  - fievre (fever)
  - courbature (muscle pain)

- extraction of 
  - complete message data for further language processing
  - 

## Basal network activity

- probing of the basal network activity using keywords
  - travail (*work*)
  - demain (*tomorrow*)
  - voiture (*car*)
  - sommeil (*sleep*)

- post counts aggregated by day

## Case data

  - data downloaded from `WHO Flumart`
    = FluID: ILI  case data
    - FluNet: virological data

## Data processing for time series extraction

- Normalization of ILI post counts by basal network activity
- 

- LLM
- [ECDC case definition](https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32018D0945&from=EN#page=24)
  - LLM vs. random post selection

# Results

## Raw post counts

::: {.columns}

::: {.column}


In [None]:
#| output: true
ax1 = post_count_ili_df.plot( y = 'grippe_posts', color = 'C0')
ax1.set_ylabel("bsky post counts", color = 'C0')
ax1.set_xlabel("week start date")

ax2 = ax1.twinx()
post_count_ili_df.plot(y = 'ili_case', ax = ax2, color = "C1")
ax2.set_ylabel("ILI case count", color = 'C1')

:::

::: {.column}

### Correlation


In [None]:
#| output: true
post_count_ili_df[['grippe_posts', 'rest_posts', 'ili_case']].corr()

:::

::: 

## Normalized post counts

::: {.columns}

::: {.column}


In [None]:
#| output: true
ax1 = post_count_ili_df.plot( y = 'norm_post_count', color = 'C0')
ax1.set_ylabel("bsky post counts", color = 'C0')
ax1.set_xlabel("week start date")

ax2 = ax1.twinx()
post_count_ili_df.plot(y = 'ili_incidence', ax = ax2, color = "C1")
ax2.set_ylabel("ILI incidence", color = 'C1')

:::

::: {.column}

### Correlation


In [None]:
#| output: true
post_count_ili_df[['norm_post_count', 'rest_posts', 'ili_case']].corr().round(3)

:::

::: 

**It is not as simple as that .... :/**

## LLM annotated post counts, raw

::: {.columns}

::: {.column}


In [None]:
#| output: true
ax1 = llm_ili_df.plot( y = 'post_count', color = 'C0')
ax1.set_ylabel("bsky post counts", color = 'C0')
ax1.set_xlabel("week start date")

ax2 = ax1.twinx()
llm_ili_df.plot(y = 'ili_case', ax = ax2, color = "C1")
ax2.set_ylabel("ILI case count", color = 'C1')

:::

::: {.column}

### Correlation


In [None]:
#| output: true
llm_ili_df.corr().round(3)

:::

::: 

## Bibliography {#refs}