Skip to content

julian-urbano/simIReff

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Travis-CI Build Status License CRAN version CRAN downloads

simIReff

Provides tools for the stochastic simulation of effectiveness scores to mitigate data-related limitations of Information Retrieval evaluation research. These tools include:

  • Fitting of continuous and discrete distributions to model system effectiveness.
  • Plotting of effectiveness distributions.
  • Selection of distributions best fitting to given data.
  • Transformation of distributions towards a prespecified expected value.
  • Proxy to fitting of copula models based on these distributions.
  • Simulation of new evaluation data from these distributions and copula models.

For reference please refer to Julián Urbano and Thomas Nagler, "Stochastic Simulation of Test Collections: Evaluation Scores", ACM SIGIR, 2018.

Installation

You may install the stable release from CRAN

install.packages("simIReff")

or the latest development version from GitHub

devtools::install_github("julian-urbano/simIReff", ref = "develop")

Usage

Fit a marginal AP distribution and simulate new data

x <- web2010ap[,10] # sample AP scores of a system
e <- effContFitAndSelect(x, method = "BIC") # fit and select based on BIC
plot(e) # plot pdf, cdf and quantile function
e$mean # expected value
y <- reff(50, e) # simulation of 50 new topics

and transform the distribution to have a pre-specified expected value.

e2 <- effTransform(e, mean = .14) # transform for expected value of .14
plot(e2)
e2$mean # check the result

Build a copula model of two systems

d <- web2010ap[,2:3] # sample AP scores
e1 <- effCont_norm(d[,1]) # force the first margin to follow a truncated gaussian
e2 <- effCont_bks(d[,2]) # force the second margin to follow a beta kernel-smoothed
cop <- effcopFit(d, list(e1, e2)) # copula
y <- reffcop(1000, cop) # simulation of 1000 new topics
c(e1$mean, e2$mean) # expected means
colMeans(y) # observed means

and modify the model so both systems have the same distribution

cop2 <- cop # copy the model
cop2$margins[[2]] <- e1 # modify 2nd margin
y <- reffcop(1000, cop2) # simulation of 1000 new topics
colMeans(y) # observed means

Automatically build a gaussian copula to many systems,

d <- web2010p20[,1:20] # sample P@20 data from 20 systems
effs <- effDiscFitAndSelect(d, support("p20")) # fit and select margins
cop <- effcopFit(d, effs, family_set = "gaussian") # fit copula
y <- reffcop(1000, cop) # simulate new 1000 topics

compare observed vs. expected mean,

E <- sapply(effs, function(e) e$mean)
E.hat <- colMeans(y)
plot(E, E.hat)
abline(0:1)

compare observed vs. expected variance,

Var <- sapply(effs, function(e) e$var)
Var.hat <- apply(y, 2, var)
plot(Var, Var.hat)
abline(0:1)

and compare original vs. simulated distributions.

o <- order(colMeans(d))
boxplot(d[,o])
points(colMeans(d)[o], col = "red", pch = 4) # plot means
boxplot(y[,o])
points(colMeans(y)[o], col = "red", pch = 4) # plot means

License

simIReff is released under the terms of the MIT License.

When using this archive, please cite the above paper:

@inproceedings{urbano2018simulation,
  author = {Urbano, Juli\'{a}n and Nagler, Thomas},
  booktitle = {International ACM SIGIR Conference on Research and Development in Information Retrieval},
  title = {{Stochastic Simulation of Test Collections: Evaluation Scores}},
  pages = {695--704},
  year = {2018}
}

About

Stochastic Simulation for IR Evaluation Research: Effectiveness Scores

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages