Polis datasets are publicly available at [github.com/compdemocracy/openData](https://github.com/compdemocracy/openData). We download these datasets and read the CSV files using [Polars DataFrame library](https://docs.pola.rs/). Once we have the data available in our Python environment, we use [Sentence Transformers](https://www.sbert.net/) to compute embeddings for each comment in the dataset and store that alongside the original data in our DataFrame. Then we save the DataFrame to a parquet file for further analysis.


## Import Packages and Setup Environment


In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os
import polars as pl

from argmap.dataModel import Summary, Comments

from dotenv import load_dotenv

load_dotenv()

os.getenv("EMBED_MODEL_ID")

'answerdotai/ModernBERT-base'

## Initialize Embedding Model

Here we consider several embedding models based on [HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). The following models are considered:

- [intfloat/e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) - 4096 dimensions, requires 14.5 GB RAM
- [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) - 1024 dimensions, requires 1.5 GB RAM
- [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) - consistently scores top, untested
- [OpenAI/text-embedding-3-large](https://openai.com/blog/new-embedding-models-and-api-updates) - hosted by OpenAI, not open source
- [OpenAI/text-embedding-ada-002](https://openai.com/blog/new-and-improved-embedding-model) - hosted by OpenAI, not open source


### Verify GPU Availability


In [2]:
from argmap.helpers import printTorchDeviceVersion

printTorchDeviceVersion()

Python: 3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:21:42) [Clang 18.1.8 ]
PyTorch: 2.5.1
No CUDA support. Using CPU.


### Load Model


In [4]:
from sentence_transformers import SentenceTransformer
from argmap.helpers import ensureCUDAMemory, printCUDAMemory, loadEmbeddingModel

if os.getenv("EMBED_MODEL_ID") is None:
    print("EMBED_MODEL_ID environment variable is required.")
    sys.exit(3)

model = loadEmbeddingModel()

No sentence-transformers model found with name answerdotai/ModernBERT-base. Creating a new one with mean pooling.


2024-12-26 17:45:51.968672 Initializing embedding model: answerdotai/ModernBERT-base on cpu...


Running on CPU - no CUDA memory to report


2024-12-26 17:45:53.117035 Embedding model initialized.


## Calculate and Store Embeddings


In [5]:
def calculate_embeddings(comments, model, show_progress_bar=False):
    documents = comments.df.get_column("commentText").to_list()
    embeddings = model.encode(documents, show_progress_bar=show_progress_bar)
    return embeddings

In [7]:
from argmap.dataModel import Summary, Comments

DATASETS = [
    "american-assembly.bowling-green",
    "march-on.operation-marchin-orders",
    "scoop-hivemind.biodiversity",
    "scoop-hivemind.freshwater",
    "scoop-hivemind.taxes",
    "scoop-hivemind.ubi",
    "scoop-hivemind.affordable-housing",
    "ssis.land-bank-farmland.2rumnecbeh.2021-08-01",
]

for dataset in DATASETS:

    summary = Summary(dataset)
    comments = Comments(dataset)

    if os.path.exists(comments.filename):
        comments.load_from_parquet()
        print(f"{dataset}: Loaded {comments.df.height} comments from Parquet DataFrame.")
    else:
        comments.load_from_csv()
        print(f"{dataset}: Loaded {comments.df.height} comments from original dataset CSV.")

    print(f"Topic: {summary.get('topic')}")

    embeddings = calculate_embeddings(comments, model, show_progress_bar=True)
    comments.addColumns(pl.Series(embeddings).alias(f"embedding-{EMBED_MODEL_ID}"))
    comments.save_to_parquet()
    print(f"{dataset}: Saved {comments.df.height} comments with embeddings to Parquet DataFrame.")
    print()

american-assembly.bowling-green: Loaded 896 comments from original dataset CSV.
Topic: Improving Bowling Green / Warren County


Batches:   0%|          | 0/28 [00:00<?, ?it/s]

american-assembly.bowling-green: Saved 896 comments with embeddings to Parquet DataFrame.

march-on.operation-marchin-orders: Loaded 2162 comments from original dataset CSV.
Topic: Operation Marching Orders


Batches:   0%|          | 0/68 [00:00<?, ?it/s]

march-on.operation-marchin-orders: Saved 2162 comments with embeddings to Parquet DataFrame.

scoop-hivemind.biodiversity: Loaded 316 comments from original dataset CSV.
Topic: Protecting and Restoring NZ’s Biodiversity


Batches:   0%|          | 0/10 [00:00<?, ?it/s]

scoop-hivemind.biodiversity: Saved 316 comments with embeddings to Parquet DataFrame.

scoop-hivemind.freshwater: Loaded 80 comments from original dataset CSV.
Topic: HiveMind - Freshwater Quality in NZ


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

scoop-hivemind.freshwater: Saved 80 comments with embeddings to Parquet DataFrame.

scoop-hivemind.taxes: Loaded 148 comments from original dataset CSV.
Topic: Tax HiveMind Window


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

scoop-hivemind.taxes: Saved 148 comments with embeddings to Parquet DataFrame.

scoop-hivemind.ubi: Loaded 71 comments from original dataset CSV.
Topic: A Universal Basic Income for Aotearoa NZ?


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

scoop-hivemind.ubi: Saved 71 comments with embeddings to Parquet DataFrame.

scoop-hivemind.affordable-housing: Loaded 165 comments from original dataset CSV.
Topic: ScoopNZ Hivemind on affordable housing


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

scoop-hivemind.affordable-housing: Saved 165 comments with embeddings to Parquet DataFrame.

ssis.land-bank-farmland.2rumnecbeh.2021-08-01: Loaded 297 comments from original dataset CSV.
Topic: JOIN THE DISCUSSION BELOW: Land use and conservation in the San Juan Islands


Batches:   0%|          | 0/10 [00:00<?, ?it/s]

ssis.land-bank-farmland.2rumnecbeh.2021-08-01: Saved 297 comments with embeddings to Parquet DataFrame.

