This repository contains analysis artifacts used to evaluate and audit the results reported in:
Ramos, R. M., Brom, P. C., Souza, J. G. M., Weigang, L., Di Oliveira, V., Reis, S. A., Salm Junior, J. F., Freitas, V., Kimura, H., Cajueiro, D. O., Luiz da Silva, G. and Celestino, V. R. R.
“Collective Intelligence with Large Language Models for the Review of Public Service Descriptions on Gov.br.”
DOI: 10.5220/0013831100003985
In: Proceedings of the 21st International Conference on Web Information Systems and Technologies (WEBIST 2025), pages 301–312
ISBN: 978-989-758-772-6; ISSN: 2184-3252
Proceedings Copyright © 2025 by SCITEPRESS – Science and Technology Publications, Lda.
The paper is published under the Creative Commons license CC BY-NC-ND 4.0 (see the publication venue for the official license terms).
The analyses in this repository focus on:
- Per-document named-entity preservation as a count-based outcome: for a document with denominator
m_d(reference entities) and preserved entitiesk_d, the preservation proportion isY_d = k_d / m_d. - Worst-case sampling guarantees for audit sizing (Wilks/order-1 style “exposure” guarantees).
- Bayesian A/B distributional comparison between two extraction/review methods (paired by document), using a binomial likelihood with a paired hierarchical logistic-normal structure.
-
inference.ipynb
Main Python notebook. It:- Loads the A/B comparison sheet
compare_results/entidades_comparacao_langextract_regex.xlsx. - Converts ratio columns to discrete binomial counts
k/m(using rational approximation with bounded denominators). - Fits a paired hierarchical binomial + bivariate logistic-normal model.
- Reports distribution-level estimands (mean/quantile shifts, tail-risk, superiority probability) and additional distribution-difference metrics (e.g., KS, quantile-W1).
- Runs Wilks / worst-case sampling checks, including “math vs. empirical” verification via Monte Carlo resampling on the observed data.
- Loads the A/B comparison sheet
-
inference_prior_sample.R
R script for sample-size planning based on the paper’s audit methodology. It:- Reads
data/entidades_extraidas.xlsx(entity totals by document). - Computes per-document denominators
m_dand exportsdata/m_per_document.csv. - Generates planning tables:
data/sample_size_plan_direct.csvfor direct assumptions on tail probabilitiesp_tail = P(Y < gamma).data/sample_size_plan_theta.csvmapping an entity-level preservation ratethetato a document-level tail risk viaBinomial(m, theta).
- Reads
-
data/entidades_extraidas.xlsx
Per-document entity counts (reference denominators). Columns:id: document identifier (matches the text filenames indataset/and the A/B sheet incompare_results/).- Entity-type counts:
institutions,dates,deadlines,costs,locations,urls,emails,phones,laws,ceps,addresses.
-
data/m_per_document.csv
Single-column filemwithm_d = sum(entity-type counts)per document (exported byinference_prior_sample.R). -
data/total.csv
Single-column filetotal. In this repo, it duplicatesdata/m_per_document.csv(same values, different column name). -
data/sample_size_plan_direct.csv
Sample-size planning table for the direct tail-probability approach. -
data/sample_size_plan_theta.csv
Sample-size planning table for the theta-to-tail mapping approach.
-
compare_results/entidades_comparacao_langextract_regex.xlsx
A/B evaluation sheet with 301 paired documents. Columns:id: document identifier.langextract: preservation ratio for method A.REGEX: preservation ratio for method B.
-
compare_results/abtest_posterior_predictive.png
Exported plot with posterior predictive comparisons from the Bayesian A/B analysis.
-
dataset/original.zip
ZIP archive containing original public service descriptions as plain text files underoriginal/(e.g.,original/10022.txt). -
dataset/newtext.zip
ZIP archive containing revised/new versions as plain text files undernewtext/(e.g.,newtext/10022.txt).
The filenames (IDs) align with data/entidades_extraidas.xlsx and compare_results/entidades_comparacao_langextract_regex.xlsx.
-
ci_stats.Rproj
RStudio project file (convenience for running the R script). -
.venv/
A local Python virtual environment directory (not required if you manage your own environment). -
.Rproj.user/,.Rhistory
Local RStudio/R session artifacts (not required for reproducibility; may be machine/user specific). -
.gitignore
Git ignore rules.
Open the notebook in Jupyter (or VS Code) using a Python environment that provides at least:
pandasnumpyscipymatplotlib- an Excel engine (e.g.,
openpyxl)
Then run cells top-to-bottom.
In R (or RStudio), install dependencies and run:
install.packages("readxl")
source("inference_prior_sample.R")Outputs are written to data/ (see “Repository structure” above).
- Observation model (counts, not continuous scores): preservation is treated as
k_d ~ Binomial(m_d, theta_d)rather than as a continuous score, preserving the denominator information and uncertainty. - Paired hierarchical A/B: document-level heterogeneity and correlation are captured by a bivariate normal prior on the logit scale.
- Distribution-level conclusions: the notebook reports not only mean lift but also quantile shifts and tail-risk reductions, plus distribution-difference metrics to summarize how the entire distribution moves between A and B.
- Worst-case sampling (Wilks): the notebook includes checks and planning formulas for selecting audit sizes that guarantee high probability of observing rare but relevant events under i.i.d. sampling assumptions.
- Repository code and files: licensed under the terms in
LICENSE(Apache License 2.0), unless otherwise indicated. - Paper: published under CC BY-NC-ND 4.0 (per the conference proceedings). This repository is an evaluation/audit companion to the paper; it does not change the publication’s licensing terms.
If you use this repository as part of your work, please cite the paper (DOI: 10.5220/0013831100003985).