In [1]:
library(jsonlite)
library(dsl)

dsl v0.1.0 successfully loaded. See ?dsl for help. Note this is an early alpha release and backwards compatability may not be maintained.



# Introduction

Let:
* $N$ be the total number of samples in the dataset
* $n$ be the number of samples selected for expert annotation
* $M_{true}$ be a logistic regression trained with expert annotations for all $N$ samples
* $M_{sub}$ be a logistic regression trained with expert annotations on only the subset of $n$ samples selected for expert annotation
* $M_{dsl}$ be a logistic regression trained with DSL using predicted annotations for all $N$ samples and expert annotations for a subset of $n$ samples

As part of some experiments to try to measure the effective sample size of different datasets when using DSL (along the lines of [this paper](https://osf.io/preprints/socarxiv/j3bnt_v3)), we tried to compare the performance of $M_{sub}$ and $M_{dsl}$ when varying $n$. To this end, we measured the RMSE of the coefficients of both models compared to the coefficients of $M_{true}$. We expected that $M_{dsl}$ would always have a smaller RMSE than $M_{sub}$ and that in both cases the RMSE would go to 0 as $\frac{n}{N} \to 1$.

However, for one of of the datasets we tested we observed a different trend: the RMSE for $M_{dsl}$ became larger than for $M_{sub}$ as we increased $n$ and it never converged to zero:

![rmse.pdf](rmse.png)

Our question is: **why is this happening?** So far we haven't been able to figure it out. Interestingly, these effects do not manifest in our other datasets (I included a second dataset based on Amazon reviews for which DSL behaves as expected).

Below we provide an example of this phenomenon for the extreme case where $n = N = 10000$, where we expect that $M_{sub} = M_{dsl} = M_{true}$

# The Data

The data is a small subset of the [misinfo-general dataset](https://huggingface.co/datasets/ioverho/misinfo-general) of newspaper articles. We trimmed the dataset down to articles from two sources: The Guardian and The Sun. The goal of the logistic regression is to predict whether the article comes from The Guardian or The Sun based on certain independent variables of the text, such as the text length in words, the length of the title, and so forth. We balanced the dataset to have 5000 samples for each class. Predictions for the outcome label were obtained by training a classifier on the DistilBERT embeddings of the articles and their true labels.

In [2]:
data <- fromJSON("misinfo.json")
head(data)

Unnamed: 0_level_0,x1,x2,x3,x4,y,y_hat
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1595,235,98,106,0,0
2,1562,237,59,78,0,1
3,2574,410,69,95,0,0
4,1828,289,37,102,0,0
5,1995,323,88,118,0,0
6,1852,300,115,119,0,0


In [3]:
cat("Number of samples:", nrow(data), "\n")
cat("Class balance:", sum(data$y) / nrow(data), "\n")

Number of samples: 10000 
Class balance: 0.5 


# Training

In [19]:
SEED <- 0
run <- function(data) {
    set.seed(SEED)

    # Train model with gold labels for all samples
    M_true <- glm(
        y ~ x1 + x2 + x3 + x4,
        data = data,
        family = binomial
    )
    summary(M_true)

    # Train model with DSL
    M_dsl <- dsl(
        model = "logit",
        formula = y ~ x1 + x2 + x3 + x4,
        predicted_var = "y",
        prediction = "y_hat",
        data = data,
        seed = SEED
    )
    summary(M_dsl)

    # Compare the two models
    true_coeffs = M_true$coefficients
    dsl_coeffs = M_dsl$coefficients
    print(dsl_coeffs)
    print(true_coeffs)
    cat("RMSE:", sqrt(mean((dsl_coeffs - true_coeffs)^2)), "\n")
}

# Results

Since we actually have true labels for all samples when doing DSL, we expect the coefficients of both logistic regressions to be very similar, and therefore we expect both the standardised bias and the RMSE to be close to 0. However, we do not observe this:

In [None]:
run(data)

In [9]:
cat("RMSE:", sqrt(mean((dsl_coeffs - true_coeffs)^2)), "\n")

RMSE: 0.1990291 


# Standardising the data

Let's try the same thing again, but this time we mean-center the data first (giving it mean 0) before we do the rest. According to Claude, this could result in more sensitive intercepts ... https://claude.ai/chat/a64d67e3-fe1d-4579-8a92-48c8c2f53d85

In [12]:
data

Unnamed: 0_level_0,x1,x2,x3,x4,y,y_hat
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1595,235,98,106,0,0
2,1562,237,59,78,0,1
3,2574,410,69,95,0,0
4,1828,289,37,102,0,0
5,1995,323,88,118,0,0
6,1852,300,115,119,0,0
7,1511,237,38,71,0,0
8,2415,374,110,120,0,0
9,4114,614,125,104,0,0
10,1842,301,49,119,0,0


In [17]:
scale(data, center = TRUE, scale = FALSE)

Unnamed: 0,x1,x2,x3,x4,y,y_hat
1,-2318.5942,-361.5261,1.3997,19.9455,-0.5,-0.5038
2,-2351.5942,-359.5261,-37.6003,-8.0545,-0.5,0.4962
3,-1339.5942,-186.5261,-27.6003,8.9455,-0.5,-0.5038
4,-2085.5942,-307.5261,-59.6003,15.9455,-0.5,-0.5038
5,-1918.5942,-273.5261,-8.6003,31.9455,-0.5,-0.5038
6,-2061.5942,-296.5261,18.3997,32.9455,-0.5,-0.5038
7,-2402.5942,-359.5261,-58.6003,-15.0545,-0.5,-0.5038
8,-1498.5942,-222.5261,13.3997,33.9455,-0.5,-0.5038
9,200.4058,17.4739,28.3997,17.9455,-0.5,-0.5038
10,-2071.5942,-295.5261,-47.6003,32.9455,-0.5,-0.5038
