simIReff

Provides tools for the stochastic simulation of effectiveness scores to mitigate data-related limitations of Information Retrieval evaluation research. These tools include:

Fitting of continuous and discrete distributions to model system effectiveness.
Plotting of effectiveness distributions.
Selection of distributions best fitting to given data.
Transformation of distributions towards a prespecified expected value.
Proxy to fitting of copula models based on these distributions.
Simulation of new evaluation data from these distributions and copula models.

For reference please refer to Julián Urbano and Thomas Nagler, "Stochastic Simulation of Test Collections: Evaluation Scores", ACM SIGIR, 2018.

Installation

You may install the stable release from CRAN

install.packages("simIReff")

or the latest development version from GitHub

devtools::install_github("julian-urbano/simIReff", ref = "develop")

Usage

Fit a marginal AP distribution and simulate new data

x <- web2010ap[,10] # sample AP scores of a system
e <- effContFitAndSelect(x, method = "BIC") # fit and select based on BIC
plot(e) # plot pdf, cdf and quantile function
e$mean # expected value
y <- reff(50, e) # simulation of 50 new topics

and transform the distribution to have a pre-specified expected value.

e2 <- effTransform(e, mean = .14) # transform for expected value of .14
plot(e2)
e2$mean # check the result

Build a copula model of two systems

d <- web2010ap[,2:3] # sample AP scores
e1 <- effCont_norm(d[,1]) # force the first margin to follow a truncated gaussian
e2 <- effCont_bks(d[,2]) # force the second margin to follow a beta kernel-smoothed
cop <- effcopFit(d, list(e1, e2)) # copula
y <- reffcop(1000, cop) # simulation of 1000 new topics
c(e1$mean, e2$mean) # expected means
colMeans(y) # observed means

and modify the model so both systems have the same distribution

cop2 <- cop # copy the model
cop2$margins[[2]] <- e1 # modify 2nd margin
y <- reffcop(1000, cop2) # simulation of 1000 new topics
colMeans(y) # observed means

Automatically build a gaussian copula to many systems,

d <- web2010p20[,1:20] # sample P@20 data from 20 systems
effs <- effDiscFitAndSelect(d, support("p20")) # fit and select margins
cop <- effcopFit(d, effs, family_set = "gaussian") # fit copula
y <- reffcop(1000, cop) # simulate new 1000 topics

compare observed vs. expected mean,

E <- sapply(effs, function(e) e$mean)
E.hat <- colMeans(y)
plot(E, E.hat)
abline(0:1)

compare observed vs. expected variance,

Var <- sapply(effs, function(e) e$var)
Var.hat <- apply(y, 2, var)
plot(Var, Var.hat)
abline(0:1)

and compare original vs. simulated distributions.

o <- order(colMeans(d))
boxplot(d[,o])
points(colMeans(d)[o], col = "red", pch = 4) # plot means
boxplot(y[,o])
points(colMeans(y)[o], col = "red", pch = 4) # plot means

License

simIReff is released under the terms of the MIT License.

When using this archive, please cite the above paper:

@inproceedings{urbano2018simulation,
  author = {Urbano, Juli\'{a}n and Nagler, Thomas},
  booktitle = {International ACM SIGIR Conference on Research and Development in Information Retrieval},
  title = {{Stochastic Simulation of Test Collections: Evaluation Scores}},
  pages = {695--704},
  year = {2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
R		R
data		data
logo		logo
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
cran-comments.md		cran-comments.md
simIReff.Rproj		simIReff.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simIReff

Installation

Usage

License

About

Releases

Packages

Languages

License

julian-urbano/simIReff

Folders and files

Latest commit

History

Repository files navigation

simIReff

Installation

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages