# PubPulse laboratory

First, a little setup to use the database (output hidden, because it's noisy):

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os, sys
sys.path.insert(0, os.path.dirname(os.getcwd()))

from ipywidgets import IntProgress
from IPython.display import display
import time
import os
import numpy as np
import pandas as pd
import psycopg2
from pgvector.psycopg2 import register_vector
from tqdm.notebook import trange, tqdm
from IPython.display import HTML

pd.set_option('display.max_colwidth', 200)

In [3]:
from dotenv import load_dotenv
load_dotenv()

from mastodon_agent.config import config

config.debug = True
config.database_url = os.environ["DATABASE_URL"]
config.embeddings_api_url = 'http://127.0.0.1:8674/predictions/my_model'
config.ml_api_url = 'http://127.0.0.1:8673'

In [4]:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session

engine = create_engine(config.database_url)

In [5]:
%reload_ext sql

In [6]:
import torch
import math
# this ensures that the current MacOS version is at least 12.3+
print(torch.backends.mps.is_available())
# this ensures that the current current PyTorch installation was built with MPS activated.
print(torch.backends.mps.is_built())

True
True


How many statuses have we ingested so far?

In [7]:
%sql SELECT count(*) FROM statuses

1 rows affected.


count
160780


Let's take a look at the latest posts ingested:

In [8]:
%%sql
SELECT
    url,
    ingested_at,
    status->>'created_at' as created_at,
    status->'account'->>'acct' as acct
FROM statuses
ORDER BY ingested_at DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:55432/example
5 rows affected.


url,ingested_at,created_at,acct
https://m.aqr.af/@roboaqraf/112333135265961838,2024-04-25 17:53:37.266352+00:00,2024-04-25T17:53:35+00:00,roboaqraf@m.aqr.af
https://pixelfed.social/p/rohitshetty/689169498187403488,2024-04-25 17:53:36.286362+00:00,2024-04-25T17:53:32+00:00,rohitshetty@pixelfed.social
https://newsie.social/@jasonbeets/112333135238884785,2024-04-25 17:53:36.102498+00:00,2024-04-25T17:53:35+00:00,jasonbeets@newsie.social
https://toot.kif.rocks/@ruhrscholz/112333135063897895,2024-04-25 17:53:36.090614+00:00,2024-04-25T17:53:32+00:00,ruhrscholz@kif.rocks
https://furry.engineer/@jackemled/112333135244971627,2024-04-25 17:53:35.907547+00:00,2024-04-25T17:53:35+00:00,jackemled@furry.engineer


Try fetching the latest posts using python:

In [9]:
from sqlalchemy.sql import text

with engine.connect() as conn:
    stmt = text("""
        SELECT
            ingested_at,
            status->>'created_at' as created_at,
            url,
            status->'account'->>'acct' as acct,
            status->>'content' as content
        FROM statuses
        ORDER BY ingested_at DESC
        LIMIT 10;
    """)
    result = conn.execute(stmt)
    
from collections import namedtuple

Record = namedtuple('Record', result.keys())
records = [Record(*r) for r in result.fetchall()]

texts = [r.content for r in records if r.content]

df = pd.DataFrame(records)
HTML(df.to_html(render_links=True, escape=False))

Unnamed: 0,ingested_at,created_at,url,acct,content
0,2024-04-25 17:53:37.266352+00:00,2024-04-25T17:53:35+00:00,https://m.aqr.af/@roboaqraf/112333135265961838,roboaqraf@m.aqr.af,さあ宇宙でエンジン再点火できてもできなくて泣いとる #bot
1,2024-04-25 17:53:36.286362+00:00,2024-04-25T17:53:32+00:00,https://pixelfed.social/p/rohitshetty/689169498187403488,rohitshetty@pixelfed.social,2023.11.26 \nPost Apocalyptic Halsoor. Networks down. Game Over. \n \nCamera: Pentax P30T \nLens: Pentax SMC fa 1:4-5.6 35-80mm \nExposed at: EI 400 \nAperture: f/16 \nShutterspeed: 1/30 \n \nFilm stock: Kentmere 400 \nDeveloper: D76 (1+1) for 12 minutes at 20C. \n \n#filmisnotdead #believeinfilm #kentmere400 #pentax #35mm #35mmdiary #bangalore #ulsoor
2,2024-04-25 17:53:36.102498+00:00,2024-04-25T17:53:35+00:00,https://newsie.social/@jasonbeets/112333135238884785,jasonbeets@newsie.social,"I find the concept of a ""non-disclosure agreement"" to be morally dubious. I would entertain making the practice illegal, or at least limit it to specific circumstances. Non-disclosure agreements to hide a politician's extra-marital affairs from the public should be made illegal, obviously. 4/x"
3,2024-04-25 17:53:36.090614+00:00,2024-04-25T17:53:32+00:00,https://toot.kif.rocks/@ruhrscholz/112333135063897895,ruhrscholz@kif.rocks,Just disabled video thumbnails on my Plex instance. Watching it (very slowly) delete around 400gb via NFS...
4,2024-04-25 17:53:35.907547+00:00,2024-04-25T17:53:35+00:00,https://furry.engineer/@jackemled/112333135244971627,jackemled@furry.engineer,A big company's Artificial Intelligence assistant will never be any match for its Natural Stupidity CEO.
5,2024-04-25 17:53:35.651891+00:00,2024-04-25T17:52:17+00:00,https://newsmast.social/@bbcworldnewsrss/112333130148494230,bbcworldnewsrss@newsmast.social,Kid Cudi cancels tour after breaking foot #breakingnews #news
6,2024-04-25 17:53:35.341476+00:00,2024-04-25T17:53:33.932000+00:00,https://possum.city/notes/9sj3wwrwimd601hv,leah@possum.city,meow
7,2024-04-25 17:53:34.303232+00:00,2024-04-25T17:53:30+00:00,https://social.edist.ro/@Outersider/112333134937505140,Outersider@social.edist.ro,"All Cats Are Beautiful, ACAB"
8,2024-04-25 17:53:34.147147+00:00,2024-04-25T17:53:29+00:00,https://mstdn.social/@silverscreenpod/112333134864003076,silverscreenpod@mstdn.social,"Going live NOW! D.K the all-knowing leads Mike, Tobi & Glenn through the virtual geek paradise of the Oasis, to review the ups, downs and MANY nerdy Easter Eggs in Spielberg's adaptation of Ernest Cline's novel 'Ready Player One'. Crank up ""Blue Monday"", dress like Buckaroo Banzai and join us in The Distracted Globe. Sixers are not invited! https://youtu.be/QNJrv7pR7Ck"
9,2024-04-25 17:53:33.203948+00:00,2024-04-25T17:53:32.954000+00:00,https://mastodon.social/@stigmabase/112333135082237284,stigmabase,"[BISL] Progressive' Rwanda is safe for gay migrants, UK minister Michael Tomlinson insists https://uk.pairsonnalites.org/2024/04/progressive-rwanda-is-safe-for-gay_0694055202.html?utm_source=dlvr.it&utm_medium=mastodon"


In [10]:
import os
import psycopg2

conn = psycopg2.connect(config.database_url)

cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("DROP TABLE IF EXISTS embeddings")
cur.execute("""
CREATE TABLE IF NOT EXISTS embeddings(
    id INTEGER,
    url character varying NOT NULL UNIQUE,
    embedding vector(384)
)
""")

conn.commit()

Let's load up a local embedding model:

In [11]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Try comparing a few different ways to access an embedding model:

In [12]:
import requests 

chunks = [
    "Have you the like of pie!",
    "I like pie",
    "Etiam non feugiat sapien. Vestibulum accumsan elit massa, at volutpat augue lacinia lacinia.",
    "Lorem ipsum dolor sit amet consectetur adipiscing elit Aliquam mattis arcu sit amet ex convallis ac varius lacus vehicula",
]

local_api_resp = requests.post(
    f"{config.ml_api_url}/embeddings",
    json = { "inputs": chunks }
)
embeddings_from_local_api = local_api_resp.json()

response = requests.post(
    f"https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2",
    headers={"Authorization": f"Bearer {config.hf_token}"},
    json={
        "inputs": chunks,
        "options":{"wait_for_model":True}
    }
)
embeddings_from_hf = response.json()

embeddings_from_model = embedding_model.encode(chunks)

pd.DataFrame([
    embeddings_from_local_api[0],
    embeddings_from_hf[0],
    embeddings_from_model[0],
])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.015373,0.033664,0.040362,0.013719,-0.059501,0.02957,0.069842,-0.041759,0.016168,-0.008661,...,0.114981,-0.006349,0.086057,0.03734,-9.5e-05,0.106242,0.134751,-0.037046,0.053696,-0.025273
1,-0.015373,0.033664,0.040362,0.013719,-0.059501,0.02957,0.069842,-0.041759,0.016168,-0.008661,...,0.114981,-0.006349,0.086057,0.03734,-9.5e-05,0.106242,0.134751,-0.037046,0.053696,-0.025273
2,-0.015373,0.033664,0.040362,0.013719,-0.059501,0.02957,0.069842,-0.041759,0.016168,-0.008661,...,0.114981,-0.006349,0.086057,0.03734,-9.5e-05,0.106242,0.134751,-0.037046,0.053696,-0.025273


How many ingested statuses do we have since the last newest generated embedding?

In [13]:
%%sql
SELECT count(url)
FROM statuses
WHERE ingested_at > (SELECT created_at FROM status_embeddings ORDER BY created_at DESC LIMIT 1);

 * postgresql://postgres:***@localhost:55432/example
1 rows affected.


count
1197


Catch up with embeddings - use our local ML API to generate embeddings for statuses newer than the newest embedding:

In [14]:
conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn)

cur = conn.cursor()
cur.execute("""
    SELECT
        url,
        status->>'content' as content
    FROM statuses
    WHERE ingested_at > (
        SELECT created_at
        FROM status_embeddings
        ORDER BY created_at DESC
        LIMIT 1
    )
    LIMIT 5000
""")

CHUNK_SIZE = 100
chunks = []

def embed_statuses_chunk_2():
    global chunks
    urls = [c[0] for c in chunks]

    api_resp = requests.post(
        f"{config.ml_api_url}/embeddings",
        json = {
            "inputs": [c[1] for c in chunks]
        }
    )
    embeddings = api_resp.json()

    chunks = []

    with conn:
        with conn.cursor() as cur:
            for idx in range(0, len(urls)):
                url = urls[idx]
                embedding = embeddings[idx]
                cur.execute(
                    """
                        INSERT INTO status_embeddings (url, embedding) VALUES (%s, %s)
                          ON CONFLICT (url) DO UPDATE SET embedding = EXCLUDED.embedding;            
                    """,
                    (url, embedding)
                )

for row in tqdm(cur, total=cur.rowcount):
    chunks.append((row[0], row[1]))
    if len(chunks) >= CHUNK_SIZE:    
        embed_statuses_chunk_2()

embed_statuses_chunk_2()

  0%|          | 0/1197 [00:00<?, ?it/s]

Catch up with embeddings - use an in-kernel model to generate embeddings for statuses newer than the newest embedding:

In [15]:
conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn)

cur = conn.cursor()
cur.execute("""
    SELECT
        url,
        status->>'content' as content
    FROM statuses
    WHERE ingested_at > (
        SELECT created_at
        FROM status_embeddings
        ORDER BY created_at DESC
        LIMIT 1
    )
    LIMIT 5000
""")

CHUNK_SIZE = 100
chunks = []

def embed_statuses_chunk():
    global chunks
    urls = [c[0] for c in chunks]
    embeddings = embedding_model.encode([c[1] for c in chunks])
    chunks = []

    with conn:
        with conn.cursor() as cur:
            for idx in range(0, len(urls)):
                url = urls[idx]
                embedding = embeddings[idx]
                cur.execute(
                    """
                        INSERT INTO status_embeddings (url, embedding) VALUES (%s, %s)
                          ON CONFLICT (url) DO UPDATE SET embedding = EXCLUDED.embedding;            
                    """,
                    (url, embedding)
                )

for row in tqdm(cur, total=cur.rowcount):
    chunks.append((row[0], row[1]))
    if len(chunks) >= CHUNK_SIZE:    
        embed_statuses_chunk()

embed_statuses_chunk()

0it [00:00, ?it/s]

In [16]:
%sql SELECT count(embedding) FROM status_embeddings

 * postgresql://postgres:***@localhost:55432/example
1 rows affected.


count
134590


In [19]:
%%sql
SELECT created_at
        FROM status_embeddings
        ORDER BY created_at DESC
        LIMIT 1

 * postgresql://postgres:***@localhost:55432/example
1 rows affected.


created_at
2024-04-25 17:55:36.248821+00:00


In [18]:
embeddings = embedding_model.encode([
    """large language models can be used in peace and with goodwill"""
])


conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn)

cur = conn.cursor()
cur.execute(
    """
    SELECT
        ingested_at,
        url,
        status->'account'->>'acct' as acct,
        status->>'content' as content
    FROM statuses
    WHERE url in (
        SELECT url
        FROM status_embeddings
        WHERE created_at > now() - INTERVAL '6 hours'
        ORDER BY embedding <-> %s
        LIMIT 25
    )
    ORDER BY ingested_at DESC
    LIMIT 25
    """,
    (np.array(embeddings[0]),)
)
rows = cur.fetchall()

df = pd.DataFrame(rows, columns=("ingested_at", "url", "acct", "content"))
HTML(df.to_html(render_links=True, escape=False))

Unnamed: 0,ingested_at,url,acct,content
0,2024-04-25 17:51:30.070380+00:00,https://rss-parrot.net/u/lithub.com/status/1713448386088276336,lithub.com@rss-parrot.net,"More than a third of translators think they’ve already lost work to AI.lithub.com/more-than-a-third-of-translators-think-theyve-already-lost-work-to-aiThat’s according to a recently released survey by the Society of Authors, which heard from over 800 of their members about how they’re feeling about emergent technologies and their impact on their creative work. The Society, a UK-based trade organization…"
1,2024-04-25 17:40:48.147787+00:00,https://techpolicy.social/@rreisman/112333084866497713,rreisman@techpolicy.social,"@StanfordCyber New Logics for Governing Human Discourse in the Online Era - part of Freedom of Though Project @CIGIonline. How 1) user agency (Freedom of Impression), 2) restoring our traditional social mediation ecosystem, & 3) systems of social trust synergize. 2/3 https://www.cigionline.org/publications/new-logics-for-governing-human-discourse-in-the-online-era/"
2,2024-04-25 17:29:40.861965+00:00,https://toolsforthought.social/@boris/112333038790668691,boris@toolsforthought.social,A 2021 thread by @heyellieday on “proof of fidelity” instead of proof of human: we want non-human actors like squads or communities to be able to act as agents.  #identity https://x.com/heyellieday/status/1420666954067677187
3,2024-04-25 17:27:58.220852+00:00,https://newsmast.social/@southchinamorningpostrss/112333034304239821,southchinamorningpostrss@newsmast.social,"China’s hi-tech progress reshaping global politics as US and allies seek to build ‘balancing coalition’, study says #breakingnews #news"
4,2024-04-25 17:18:45.661961+00:00,https://newsmast.social/@politicsnewsrss/112332985953347757,politicsnewsrss@newsmast.social,Can we really trust AI to channel the public’s voice for ministers? | Seth Lazar #politics https://www.theguardian.com/commentisfree/2024/apr/25/ai-public-voice-ministers-large-language-model-chatgpt
5,2024-04-25 17:18:19.160385+00:00,https://www.threads.net/@victorsothervector/post/C6MUHAQRWqc,victorsothervector@threads.net,It's so good when you recommend a technical book & hear back on how useful it was! I suggested 'Natural Language Processing with Transformers' to an MLE & they said it's been so helpful in getting a firmer idea of these ML models! It's written by 3 Hugging Face folks (including @thomwolf) Definitely worth it in getting a better technical grasp of transformers in NLP! I found it super useful when it was first published in 2022 (I think there's a color version now) & it's still great today!
6,2024-04-25 17:06:16.102161+00:00,https://mastodon.social/@wearenew_public/112332949098258998,wearenew_public,"AI seems like it’s here to stay, but we must still help nurture online spaces as public resources for connection and shared knowledge.AI developers must protect the web from: ❌ Content pollution from Large Language Models ❌ AI generated misinformation ❌ Corporate manipulation of LLM training dataTo build a better digital sphere, we need to: ✅ support human moderators ✅ protect creative content and copyrights ✅ foster relationships between creators and audienceshttps://www.theatlantic.com/technology/archive/2024/04/generative-ai-search-llmo/678154/"
7,2024-04-25 17:03:21.525501+00:00,https://mastodon.social/@wearenew_public/112332937664921651,wearenew_public,"AI seems like it’s here to stay, but we must still help nurture online spaces as public resources for connection and shared knowledge.AI developers must protect the web from: ❌Content pollution from Large Language Models ❌AI generated misinformation ❌Corporate manipulation of LLM training dataTo build a better digital sphere, we need to: ✅ support human moderators ✅ protect creative content and copyrights ✅ foster relationships between creators and audienceshttps://www.theatlantic.com/technology/archive/2024/04/generative-ai-search-llmo/678154/"
8,2024-04-25 17:01:48.128779+00:00,https://mastodon.social/@wearenew_public/112332931590451810,wearenew_public,"AI seems like it’s here to stay, but we must still help nurture online spaces as public resources for connection and shared knowledge.AI developers must protect the web from: ❌Content pollution from Large Language Models ❌AI generated misinformation ❌Corporate manipulation of LLM training dataTo build a better digital sphere, we need to: ✅ support human moderators ✅ protect creative content and copyrights ✅ foster relationships between creators and audienceshttps://www.theatlantic.com/technology/archive/2024/04/generative-ai-search-llmo/678154/"
9,2024-04-25 17:01:31.851390+00:00,https://mastodon.social/@umasslinguistics/112332930510929320,umasslinguistics,Do you speak a ‘big’ global language? Here’s what my tiny language can teach you https://www.theguardian.com/commentisfree/2024/apr/24/language-speak-big-slovene-english-german
