Skip to content

Holds machine readable test data for the initial SCAPE Characterisation Components

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE.html
Unknown
LICENSE.md
Notifications You must be signed in to change notification settings

opf-attic/cc-benchmark-tests

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCAPE Characterisation Components Benchmarking Data

Holds machine readable test data for the initial SCAPE Characterisation Components

Overview

Data Release Type

  • [] a one-off release of a single dataset
  • a one-off release of a set of related datasets
  • [] ongoing release of a series of related datasets
  • [] a service or API for accessing open data

Dataset Type

  • [] human-readable documents
  • statistical data, such as counts, averages and percentages
  • [] geographic information, such as points and boundaries
  • [] other kinds of structured data

Description

A record of Tika identification results and benchmark timing, tested on the GovDocs corpora. There are 8 result files, two for each of the four versions of Tika that were evaluated, 1.0, 1.1, 1.2 and 1.3.

The following four files include the evaluation results obtained when calling Tika using a file input stream only, i.e. no file name Tika_1_0.csv - Evaluation results when using Tika v1.0 Tika_1_1.csv - Evaluation results when using Tika v1.1 Tika_1_2.csv - Evaluation results when using Tika v1.2 Tika_1_3.csv - Evaluation results when using Tika v1.3

The following four files include the result when calling Tika using a file input stream and also a file name Tika_1_0_with_filename.csv - Evaluation results when using Tika v1.0 Tika_1_0_with_filename.csv - Evaluation results when using Tika v1.1 Tika_1_0_with_filename.csv - Evaluation results when using Tika v1.2 Tika_1_0_with_filename.csv - Evaluation results when using Tika v1.3

The data is in CSV format, each record consists of five fields:

  1. the name of the input file
  2. the MIME type returned by Tika
  3. the MIME type(s) from the GovDocs groundtruth (separated with semi colons)
  4. time in milliseconds taken to perform Tika identification
  5. boolean OK/FAIL indicating whether Tika successfully identified the file type

Copyright & License

All content and data are © 2013 Scape Project and licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Open Data

All data in this dataset is released at Open Data. It is available freely for anyone to use, re-use and re-distribute subject only to the requirements set out in the license. These requirements stipulate that you must attribute our data in your work AND share the resulting works under a similar license.

This dataset has been self assessed to be of "3-Star" quality (see 5 Star Data) and compatible with the "Pilot Level" requirements of the Open Data Institutes Open Data Certificate.

Personally Identifiable Data

This dataset does not contain any personally identifiable data other than that relating to the curators, maintainers and publishers of the dataset and related information.

Using the Data

The CSV (text/csv) file can be opened in Excel and other spreadsheet applications.

Both CSV and JSON are machine readable formats and more information about each can be found at the following locations:

CSV: http://en.wikipedia.org/wiki/Comma-separated_values JSON: http://en.wikipedia.org/wiki/JSON

Findability

  • [] Data can be found within 3 clicks of the organisation's home page
  • [] Is the data listed somewhere, alongside data from the wider sector

Applicability

This dataset does not contain any time sensitive information

Quality

Guarantees

This data is available experimentally but should be around until July 2014

References

  • [] Do the data formats use vocabularies?
  • [*] Are there any codes used within the data?

MIME Types are used within the dataset, e.g. text/csv. A list of IANA MIME types is available at http://www.iana.org/assignments/media-types.

The File name field relates to the name of the file within the GovDocs dataset, available at http://digitalcorpora.org/corpora/files.

Support

contact: carl@openplanetsfoundation.org

We're currently not supporting a mailing list or forum for discussion of this dataset. Problems can be logged at the projects issue page: https://github.com/openplanets/cc-benchmark-tests/issues

Services

Links to tools for working with this data:t

About

Holds machine readable test data for the initial SCAPE Characterisation Components

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE.html
Unknown
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •