# PubPulse laboratory

First, a little setup to use the database (output hidden, because it's noisy):

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
import os, sys
sys.path.insert(0, os.path.dirname(os.getcwd()))

from ipywidgets import IntProgress
from IPython.display import display
import time
import os
import numpy as np
import pandas as pd
import psycopg2
from pgvector.psycopg2 import register_vector
from tqdm.notebook import trange, tqdm
from IPython.display import HTML

pd.set_option('display.max_colwidth', 200)

In [6]:
from dotenv import load_dotenv
load_dotenv()

from mastodon_agent.config import config

config.debug = True
config.database_url = os.environ["DATABASE_URL"]
config.embeddings_api_url = 'http://127.0.0.1:8674/predictions/my_model'
config.ml_api_url = 'http://127.0.0.1:8673'

print(config)

Config(log_level='INFO', log_filename='', log_maxbytes=100000, log_backup_count=10, debug=True, database_url='postgresql://postgres:8675309jenny@localhost:55432/example', api_base_url='https://mastodon.social', client_key='j9do9mn6qeCGEl33-NVKhKoDzSL_nSv2aDhpKf--tH8', client_secret='k6sWxLw7MRCkUcZtvqkCo87MrCNW2dL7crov5zodKuc', access_token='nBPE0k-ZLVvz3YA8595LT7XDRXvVPo5o7xIq9X8Q8Tc', hf_token='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', embeddings_api_url='http://127.0.0.1:8674/predictions/my_model', ml_api_url='http://127.0.0.1:8673', user_agent='PubPulse 0.1', debug_requests=False)


In [7]:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session

engine = create_engine(config.database_url)

In [8]:
%reload_ext sql

In [9]:
import torch
import math
# this ensures that the current MacOS version is at least 12.3+
print(torch.backends.mps.is_available())
# this ensures that the current current PyTorch installation was built with MPS activated.
print(torch.backends.mps.is_built())

True
True


How many statuses have we ingested so far?

In [10]:
%sql SELECT count(*) FROM statuses

1 rows affected.


count
149402


Let's take a look at the latest posts ingested:

In [11]:
%%sql
SELECT
    url,
    ingested_at,
    status->>'created_at' as created_at,
    status->'account'->>'acct' as acct
FROM statuses
ORDER BY ingested_at DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:55432/example
5 rows affected.


url,ingested_at,created_at,acct
https://mastodon.social/@channelnewsasia/112332829161852577,2024-04-25 16:35:45.191479+00:00,2024-04-25T16:35:44.976000+00:00,channelnewsasia
https://twitter.com/JapersRink/status/1783535058994831388,2024-04-25 16:35:44.884070+00:00,2024-04-25T16:34:29+00:00,JapersRink@sportsbots.xyz
https://misskey.art/notes/9sj14rmz7x,2024-04-25 16:35:42.600844+00:00,2024-04-25T16:35:41.675000+00:00,mutsu_karita@misskey.art
https://mstdn.ca/@northernarlo/112332828797795010,2024-04-25 16:35:42.110228+00:00,2024-04-25T16:35:39+00:00,northernarlo@mstdn.ca
https://mastodon.online/@bloomberg/112332828896834715,2024-04-25 16:35:41.835075+00:00,2024-04-25T16:35:40+00:00,bloomberg@mastodon.online


Try fetching the latest posts using python:

In [16]:
from sqlalchemy.sql import text

with engine.connect() as conn:
    stmt = text("""
        SELECT
            ingested_at,
            status->>'created_at' as created_at,
            url,
            status->'account'->>'acct' as acct,
            status->>'content' as content
        FROM statuses
        ORDER BY ingested_at DESC
        LIMIT 10;
    """)
    result = conn.execute(stmt)
    
from collections import namedtuple

Record = namedtuple('Record', result.keys())
records = [Record(*r) for r in result.fetchall()]

texts = [r.content for r in records if r.content]

df = pd.DataFrame(records)
HTML(df.to_html(render_links=True, escape=False))

Unnamed: 0,ingested_at,created_at,url,acct,content
0,2024-04-25 16:37:34.043317+00:00,2024-04-25T16:37:26+00:00,https://social.tulsa.ok.us/@blogoklahoma/112332835811049421,blogoklahoma@social.tulsa.ok.us,"I'm almost tempted to ignore today's news and just leave running in the background either the Doctor Who Classic, The Carol Burnett Show, or MLB (classic games) channels. Ha!"
1,2024-04-25 16:37:33.451185+00:00,2024-04-25T16:37:29+00:00,https://mstdn.social/@kingu/112332835987557643,kingu@mstdn.social,#NowPlaying https://www.youtube.com/watch?v=zHXdZ3WqGI0
2,2024-04-25 16:37:32.982260+00:00,2024-04-25T16:37:28+00:00,https://monads.online/@velexiraptor/112332835937453491,velexiraptor@monads.online,"during its mobile adolescence, the maggotgun aggressively hunts organic prey and scavenges extant corpses to store fat in its legs for its transition to adulthood. you can identify a successful maggotgun by the number of shed exoskeletal casings in its nest - the more there are, the more its grown, and the more stable it remains during the recoil from its peristaltic salvos"
3,2024-04-25 16:37:32.588059+00:00,2024-04-25T16:37:31+00:00,https://rubber.social/@PsiDrone298/112332836160577584,PsiDrone298@rubber.social,Might splurge and have some small air bubbles slowly rising in it.
4,2024-04-25 16:37:32.123644+00:00,2024-04-25T16:36:15+00:00,https://twitter.com/NERevolution/status/1783535505805647915,NERevolution@sportsbots.xyz,"On this week's episode of Revolution All In, we hear from Dylan Borrero on his road to recovery. It's been a full year since Dylan suffered the injury that sidelined him on April 29, 2023. He's fought through pain and frustration to find motivation and gratitude. 💙❤️🎥⬇️"
5,2024-04-25 16:37:31.733640+00:00,2024-04-25T16:37:27+00:00,https://pixelfed.social/p/chrisjohnsen/689150353076035308,chrisjohnsen@pixelfed.social,"The bodhi tree at Sumathipala Nahimi Meditation Center in Kanduboda, Sri Lanka. \n \n#bodhitree #sumathipala #sumathipalanahimi #sumatipala #kanduboda #delgoda #srilanka"
6,2024-04-25 16:37:30.933938+00:00,2024-04-25T16:37:30+00:00,https://octodon.social/@kaye/112332836068539662,kaye@octodon.social,"I'm no prescriptivist, but ""i18n"" and ""a11y"" (for ""internationalisation"" and ""accessibility"") cause me to go into fight-or-flight mode"
7,2024-04-25 16:37:30.488923+00:00,2024-04-25T16:37:07+00:00,https://twitter.com/JasonLaCanfora/status/1783535720939856011,JasonLaCanfora@sportsbots.xyz,Interested to see the results https://x.com/IA1057TheFan/s…
8,2024-04-25 16:37:30.066951+00:00,2024-04-25T16:35:58+00:00,https://twitter.com/hayyyshayyy/status/1783535432052940894,hayyyshayyy@sportsbots.xyz,love this for Okposo https://x.com/ColbyDGuy/stat…
9,2024-04-25 16:37:29.942097+00:00,2024-04-25T16:37:28+00:00,https://botsin.space/@JohnMastodon/112332835913907587,JohnMastodon@botsin.space,"Please consider these lines from my diary:John Mastodon was dining in the Arctic. An old man questioned ""Where does a tomato begin, and where does it end?"" John whispered: ""It is you, Grasshopper.""Ignore the words of #JohnMastodon at your peril."


In [10]:
import os
import psycopg2

conn = psycopg2.connect(config.database_url)

cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("DROP TABLE IF EXISTS embeddings")
cur.execute("""
CREATE TABLE IF NOT EXISTS embeddings(
    id INTEGER,
    url character varying NOT NULL UNIQUE,
    embedding vector(384)
)
""")

conn.commit()

Let's load up a local embedding model:

In [19]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Try comparing a few different ways to access an embedding model:

In [24]:
import requests 

chunks = [
    "Have you the like of pie!",
    "I like pie",
    "Etiam non feugiat sapien. Vestibulum accumsan elit massa, at volutpat augue lacinia lacinia.",
    "Lorem ipsum dolor sit amet consectetur adipiscing elit Aliquam mattis arcu sit amet ex convallis ac varius lacus vehicula",
]

local_api_resp = requests.post(
    f"{config.ml_api_url}/embeddings",
    json = { "inputs": chunks }
)
embeddings_from_local_api = response.json()

response = requests.post(
    f"https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2",
    headers={"Authorization": f"Bearer {config.hf_token}"},
    json={
        "inputs": chunks,
        "options":{"wait_for_model":True}
    }
)
embeddings_from_hf = response.json()

embeddings_from_model = embedding_model.encode(chunks)

pd.DataFrame([
    embeddings_from_local_api[0],
    embeddings_from_hf[0],
    embeddings_from_model[0],
])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.025914,0.034922,0.04454,0.021113,-0.057222,0.014686,0.104046,-0.021375,0.039638,-0.018193,...,0.106572,-0.009632,0.052651,0.043245,0.008505,0.077057,0.121036,-0.018788,0.062502,-0.046634
1,-0.015373,0.033664,0.040362,0.013719,-0.059501,0.02957,0.069842,-0.041759,0.016168,-0.008661,...,0.114981,-0.006349,0.086057,0.03734,-9.5e-05,0.106242,0.134751,-0.037046,0.053696,-0.025273
2,-0.015373,0.033664,0.040362,0.013719,-0.059501,0.02957,0.069842,-0.041759,0.016168,-0.008661,...,0.114981,-0.006349,0.086057,0.03734,-9.5e-05,0.106242,0.134751,-0.037046,0.053696,-0.025273


How many ingested statuses do we have since the last newest ge

In [57]:
%%sql
SELECT count(url)
FROM statuses
WHERE ingested_at > (SELECT created_at FROM status_embeddings ORDER BY created_at DESC LIMIT 1);

 * postgresql://postgres:***@localhost:55432/example
1 rows affected.


count
194


Catch up with embeddings - use an in-kernel model to generate embeddings for statuses newer than the newest embedding:

In [112]:
conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn)

cur = conn.cursor()
cur.execute("""
    SELECT
        url,
        status->>'content' as content
    FROM statuses
    WHERE ingested_at > (
        SELECT created_at
        FROM status_embeddings
        ORDER BY created_at DESC
        LIMIT 1
    )
    LIMIT 5000
""")

CHUNK_SIZE = 100
chunks = []

def embed_statuses_chunk():
    global chunks
    urls = [c[0] for c in chunks]
    embeddings = embedding_model.encode([c[1] for c in chunks])
    chunks = []

    with conn:
        with conn.cursor() as cur:
            for idx in range(0, len(urls)):
                url = urls[idx]
                embedding = embeddings[idx]
                cur.execute(
                    """
                        INSERT INTO status_embeddings (url, embedding) VALUES (%s, %s)
                          ON CONFLICT (url) DO UPDATE SET embedding = EXCLUDED.embedding;            
                    """,
                    (url, embedding)
                )

for row in tqdm(cur, total=cur.rowcount):
    chunks.append((row[0], row[1]))
    if len(chunks) >= CHUNK_SIZE:    
        embed_statuses_chunk()

embed_statuses_chunk()

  0%|          | 0/50000 [00:00<?, ?it/s]

Catch up with embeddings - use our local ML API to generate embeddings for statuses newer than the newest embedding:

In [55]:
conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn)

cur = conn.cursor()
cur.execute("""
    SELECT
        url,
        status->>'content' as content
    FROM statuses
    WHERE ingested_at > (
        SELECT created_at
        FROM status_embeddings
        ORDER BY created_at DESC
        LIMIT 1
    )
    LIMIT 5000
""")

CHUNK_SIZE = 100
chunks = []

def embed_statuses_chunk_2():
    global chunks
    urls = [c[0] for c in chunks]

    api_resp = requests.post(
        f"{config.ml_api_url}/embeddings",
        json = {
            "inputs": [c[1] for c in chunks]
        }
    )
    embeddings = api_resp.json()

    chunks = []

    with conn:
        with conn.cursor() as cur:
            for idx in range(0, len(urls)):
                url = urls[idx]
                embedding = embeddings[idx]
                cur.execute(
                    """
                        INSERT INTO status_embeddings (url, embedding) VALUES (%s, %s)
                          ON CONFLICT (url) DO UPDATE SET embedding = EXCLUDED.embedding;            
                    """,
                    (url, embedding)
                )

for row in tqdm(cur, total=cur.rowcount):
    chunks.append((row[0], row[1]))
    if len(chunks) >= CHUNK_SIZE:    
        embed_statuses_chunk_2()

embed_statuses_chunk_2()

  0%|          | 0/7 [00:00<?, ?it/s]

In [56]:
%sql SELECT count(embedding) FROM status_embeddings

 * postgresql://postgres:***@localhost:55432/example
1 rows affected.


count
130964


In [44]:
embeddings = embedding_model.encode([
    """video games are not worth playing"""
])


conn = psycopg2.connect(os.environ["DATABASE_URL"])
register_vector(conn)

cur = conn.cursor()
cur.execute(
    """
    SELECT
        ingested_at,
        url,
        status->'account'->>'acct' as acct,
        status->>'content' as content
    FROM statuses
    WHERE url in (
        SELECT url
        FROM status_embeddings
        WHERE created_at > now() - INTERVAL '6 hours'
        ORDER BY embedding <-> %s
        LIMIT 25
    )
    ORDER BY ingested_at DESC
    LIMIT 25
    """,
    (np.array(embeddings[0]),)
)
rows = cur.fetchall()

df = pd.DataFrame(rows, columns=("ingested_at", "url", "acct", "content"))
HTML(df.to_html(render_links=True, escape=False))

Unnamed: 0,ingested_at,url,acct,content
0,2024-04-25 14:30:14.479216+00:00,https://rss-parrot.net/u/setsideb.com/status/1713448386088271194,setsideb.com@rss-parrot.net,"Thrilling Tales of Old Video Games, on Princess Peach Showtimesetsideb.com/thrilling-tales-of-old-video-games-on-princess-peach-showtimeThe article notes how few games in Nintendo’s many series star Peach. There’s really only been one headline game for her before, 2006’s Super Princess Peach, which was really easy. Showtime isn’t bad, but the article notes it’s more like a collection of…"
1,2024-04-25 14:30:05.989287+00:00,https://mastodon.social/@NoSoloBot/112332335066480205,NoSoloBot,"Theatrhythm Final Fantasy🏢 indieszero Co., Ltd., Square Enix 1st Production Department 📅 2012 🖥 3DS, iOS#videogames"
2,2024-04-25 14:28:13.935002+00:00,https://transfem.social/notes/9siwktb9ne6h0lap,akioogaki@transfem.social,Gaming before work at my friend's house!
3,2024-04-25 14:09:50.299168+00:00,https://labyrinth.zone/objects/2e3cb38d-c697-47dc-8da1-3a64bff8d745,slipoke@labyrinth.zone,"what’s up i’m slipoke (or just slip). i’m a wetdry world import and this is my first akkoma instance. i’ve been on fedi for about a year (give or take)i play Bungie games and flightsims and counter-strike and like every racing game on the PSXim an enby with a Dream to be the World’s First Federated Cognitohazard.i write code (C/C++ and sometimes Rust/Python) and fuck with computers a lot. i like messing with embedded/generally weak/locked down machines and getting them to do evil things. they are neat i make music quite a lot and sometimes even post it herewhat you can expect:lots of posting about games you’ve never considered playing that you will pretend you didn’t see on your tllots of posting about me breaking things I ownlots of posting about me attempting to fix the things i ownlots of posting about me getting mad at the people who make the things i ownrandom thoughts about things i think are neategregious, evil, heinous shitposting. the terrible kind. oh my god. you cant be prepared for the slopgood luck :Cat_girls_Emoji_022:"
4,2024-04-25 14:08:20.698019+00:00,https://lemmy.ml/post/14880667,gary_host_laptop@lemmy.ml,Torn - Online RPG game - Free text based gamehttps://lemmy.ml/post/14880667
5,2024-04-25 14:01:07.054568+00:00,https://wrestling.social/@setsideb/112332217024431357,setsideb@wrestling.social,"Thrilling Tales of Old Video Games, on Princess Peach Showtime The article notes how few games in Nintendo's many series star Peach. There's really only been one headline game for her before, 2006's Super Princess Peach, which was really easy. Showtime isn't bad, but the article notes it's https://setsideb.com/thrilling-tales-of-old-video-games-on-princess-peach-showtime/ #niche #lisasimpson #niche #nintendo #peach #princesspeachshowtime #superprincesspeach #thrillingtalesofoldvideogames"
6,2024-04-25 12:29:20.327727+00:00,https://kind.social/@Bright5park/112331860112753927,Bright5park@kind.social,"In Gaming News, Nintendo has decided to retroactively declare 2024 as the ""Year of our Legal Team"". To celebrate, that beloved developer has decided to take action against user-generated content featuring Nintendo's Intellectual Property in the game Garry's Mod after 20 years. (https://store.steampowered.com/news/app/4000/view/4200245595694413052?l=english) When asked about the delay, Nintendo's Legal Team commented that they were preoccupied with trying to find a non-lethal way to remove illegal copys of Nintendo games from their users' memories."
7,2024-04-25 12:29:14.770712+00:00,https://mstdn.games/@OutofPrintArchive/112331859534229796,OutofPrintArchive@mstdn.games,Review for R-Type 2 on Game Boy from CVG 134 - January 1993 (UK)This magazine can be downloaded here: https://www.outofprintarchive.com/catalogue/computerandvideogames7.html#retrogaming #Nintendo #gameboy
8,2024-04-25 11:34:31.319563+00:00,https://mastodon.social/@Theeo123/112331644633019679,Theeo123,"https://www.howtogeek.com/best-linux-game-stores-for-native-linux-games/Here, How-To-Geek provides a list of several stores that sell Linux Native games. Some of these you probably already know about. Others may be new to you. You can click through for more info on each of the following:- Steam - GOG (Galaxy of Games) - Humble Bundle - itch.io - Game Jolt - ArchWiki's ""List of Games"" Page - Flathub - Snap Store #Linux #Gaming #LinuxGaming #GamingonLinux #VideoGames"
9,2024-04-25 11:33:36.607087+00:00,https://kind.social/@Marmalade/112331641076974982,Marmalade@kind.social,"i haven't coded in a literal decade and wouldn't know where to starti have wants for the game that contradict each other so what i feel i want is impossible, that's disappointing i'm going to visit a place tomorrow that might be able to help with thisa place for people unable to work but to still have somewhere to be, and they deal in a media i have a fear of actually speaking about my game thoughts aloud i've noticed so we'll see"
