ClueWeb Annotations

Tools for processing annotations on the Clueweb datasets.

This project includes tools for the three steps of the process:

Convert the WARC files into the format required to do the annotations.
Convert the annotations into the published format.
Group the annotations by the source WARC file.
Test the annotations in the original documents.

Tools

Convert the annotations into the published format

The original annotations are generated in plain text files that contain the output for several documents from ClueWeb. We convert them to another format that makes them easier to process.

The following is an example of the annotation in the original format:

clueweb09-en0000-00-01198.html
VOC     41856   41859   1.4636309e-05   /m/03zbcv
US      42670   42672   0.0054688235    /m/09c7w0
Birch   42722   42727   8.6088057e-06   /m/0hpx4
MDF     42751   42754   5.1136312e-06   /m/01pd_h

clueweb09-en0000-00-01500.html
Illinois Appellate Court        7750    7774    0.00019704235   /m/02z5_ft
Lisa Madigan    8996    9008    0.00038186592   /m/04q09d

Each block contains the annotations for a single document; where th first line corresponds to the document ID (WARC-TREC-ID) and each of the next lines are the annotations identified in them. Each annotation has the following format (tab separated):

text        start-pos  end-pos  probability   freebase-id

text is the matched text on the document. This is not exactly the same string, since it may have been transformed: removing additional whitespace or normalizing the characters.

start-pos The byte in which the identified entity starts.

end-pos The byte in which the identified entity ends.

probability A confidence score assigned to the annotation.

freebase-id The identifier of the entity in Freebase; it is also knwon as mid (Metabase Id). This identifier is used to form the URL the entity in Freebase:

http://www.freebase.com/m/XXXXX

Group by source WARC file

The original annotations are included in several plain text files that are not necessarily aligned to the original WARC files. That is, the text files do not follow any naming convention that allows us to quickly identify the source WARC file for the annotations included in them. To simplify the tests and the format conversion, we group the annotations by their source WARC file.

The script scripts/group-annotations.pytakes as input several annotation files (in .tsv format, optionally gzipped), identifies the name of the WARC file that contains the page and, for each one, creates an output file that contains the annotations included in the respective WARC file. This allows us later to process the annotations and the WARC files in parallel.

Output format

The annotations are converted to the a format that is easier to process. The format adds two columns at the begining of each line:

Document identifier (WARC-TREC-ID)
Original encoding: name of the encoding used to process the entry. The positions of the annotations correspond to the locations byte offsets after processing the file in this encoding.

The following is an example of the format:

clueweb09-en0000-00-01198   UTF-8  VOC     41856   41859   1.4636309e-05   
/m/03zbcv
clueweb09-en0000-00-01198   UTF-8  US      42670   42672   0.0054688235

/m/09c7w0

Note that the first field is the actual value of WARC-TREC-ID, without the .html extension.

Converting the original annotations

The program lemur.cw.ann.FormatAnnotations allows you to convert the original annotations into the output format. It supports both formats used for the annotations:

Plain text files using the format described above
.tar.gz archives that contains plain text files, using the same format

The program is a command line utility used as follows:

lemur.cw.ann.ConvertOriginal tar|text [input1 [input2 ...]] > output.tsv

Where inputN is the an input file, either plain text or .tar.gz, depending on the mode selected. If no input files are specified, the standard input is used.

The annotations are written to the standard output.

Test the annotations in the original documents

This command allows you to test that the annotations match the original documents. For each annotation, the program reads the content of the document and it compares the text mentioned in the annotation with the actual text found within the offsets. The program does approximate match on the strings, ignoring whitespace, and case differences.

If the annotations do not match, they are compared after reading the document using different encodings in the following order:

UTF-8
ISO-8859-1 (HTTP's default)
Encoding identified using the ICU library

If there is no match, an error is printed to the standard output. Otherwise (i.e. if the texts match), no output is generated.

Command line usage:

lemur.cw.ann.DetectEncoding file.warc.gz annotations.tsv[.gz] [debug]

Use 'true' as the last argument to print the annotations that can't be matched to the stderr

Convert the WARC files into .tar.gz files

The class ProcessWarc allows you to convert a WARC file into a .tar.gz, where for each response record a file entry is added to the tar.gz. The content of the file entry corresponds to the body of the WARC record, with some processing steps.

Each character of the the HTTP headers is replaced by white space
The body is written as it is.

The file entries have the same size (in bytes) than the actual WARC records.

Each file entry is named using the WARC header WARC-TREC-ID with the extension .html appended to it.

Command line usage:

The program has two modes: reading individual files or reading an entire directory.

Individual files:

lemur.nopol.ProcessWarc file file.warc.gz output.tar.gz

Directories:

lemur.nopol.ProcessWarc dir input-directory output-directory

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
condor-scripts		condor-scripts
scripts		scripts
src/main/java/lemur		src/main/java/lemur
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
pom.xml		pom.xml
process-warc.sh		process-warc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

condor-scripts

condor-scripts

scripts

scripts

src/main/java/lemur

src/main/java/lemur

.gitignore

.gitignore

README.md

README.md

build.sbt

build.sbt

pom.xml

pom.xml

process-warc.sh

process-warc.sh

Repository files navigation

ClueWeb Annotations

Tools

Convert the annotations into the published format

Group by source WARC file

Output format

Converting the original annotations

Test the annotations in the original documents

Convert the WARC files into .tar.gz files

Command line usage:

About

Releases

Packages

Languages

lemurproject/nopol

Folders and files

Latest commit

History

Repository files navigation

ClueWeb Annotations

Tools

Convert the annotations into the published format

Group by source WARC file

Output format

Converting the original annotations

Test the annotations in the original documents

Convert the WARC files into .tar.gz files

Command line usage:

About

Resources

Stars

Watchers

Forks

Languages