Skip to content

lum-ai/ie-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ie-benchmarks

speed benchmarks for information extraction systems

Data

We provide a random sample of 5K and 10K articles from the English Wikipedia for use in benchmarking.

These articles were processed using the ie-benchmarks branch of wikiparse, which in turn uses the org.clulab.processor.CluProcessor (v7.51) text annotator for sentence segmentation, tokenization, part of speech tagging, lemmatization, chunking, named entity recognition, and dependency parsing.

As of this writing, the following link will retrieve the most recently completed dump of the English Wikipedia:

A select number of recent dumps can be found at https://dumps.wikimedia.org/enwiki/.

The sample datasets released here were generated from the June 2, 2014 dump of the English Wikipedia.

5K article sample

# Download a random sample of 5K EN Wikipedia articles
curl https://public.lum.ai/ie-benchmarks/parsed-documents/wikipedia/en/5K.tar.gz --output 5K.tar.gz
# unpack the archive
tar xvzf 5K.tar.gz

10K article sample

# Download a random sample of 10K EN Wikipedia articles
curl https://public.lum.ai/ie-benchmarks/parsed-documents/wikipedia/en/10K.tar.gz --output 10K.tar.gz
# unpack the archive
tar xvzf 10K.tar.gz

Odinson

Building an Odinson index

5K article sample

sbt "odinson/runMain ai.lum.benchmarks.odinson.IndexDocuments -i 5K -o 5k-index"

10K article sample

sbt "odinson/runMain ai.lum.benchmarks.odinson.IndexDocuments -i 10K -o 10k-index"

Benchmarking

sbt "odinson/runMain ai.lum.benchmarks.odinson.BenchmarkQueries -i 5k-index -q queries/odinson/president.txt -n 1000 -o output/5k/odinson"

Odin

Benchmarking

sbt "odin/runMain ai.lum.benchmarks.odin.BenchmarkQueries -d 5K -g queries/odin/system.yml -n 1000 -o output/5k/odin"

About

speed benchmarks for information extraction systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages