Extracts plain text, language identification and more metadata from WARC records
Clone this repo along with submodules:
git clone --recurse-submodules https://github.com/bitextor/warc2text.git
Or:
git clone https://github.com/bitextor/warc2text.git
git submodule update --init --recursive
On Debian/Ubuntu/Mint:
apt-get install uchardet libuchardet-dev libzip-dev
On Mac:
brew install uchardet libzip
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/your/prefix/path ..
# cmake .. -DCMAKE_BUILD_TYPE=Debug # for debug
# cmake .. -DICU_ROOT_DIR=(brew --prefix icu4c)/lib # for macOS
make -j
make install
note: for warcs with many languages you might hit the open file limit quite quickly. It is therefore advised to increase it, e.g. ulimit -n 8192
.
warc2text -o <output_folder> [ -f <output_files> ] [ --pdfpass <output_warc> ]
[ --paragraph-identification ] [ --tag-filters <filters_file> ] <warc_file>...
-
--output
/-o
output folder -
--files
/-f
list of output files separated by commas (and without.gz
);text
andurl
are always written, whilemime
andhtml
are optional -
--pdfpass
WARC file where PDF records will be stored -
--paragraph-identification
print the paragraph identifier for each sentence extracted from the HTML -
--classifier
classifier to use:cld2
orfasttext
. -
--fasttext-model
path to FastText model for fasttext classifier. -
--tag-filters
file containing filters that are used to eliminate matching documents -
--invert-tag-filters
output only documents that match the filter -
--url-filters
file containing regular expressions that match urls of documents to eliminate -
--verbose
/-v
print progress and filtering information -
--silent
/-s
print only warnings and errorsTag Filter format is the following:
tag <tab> attribute <tab> regexp
For example,
meta <tab> name <tab> translation-stats
will remove documents that contain<meta name="translation-stats" ... >
URL Filter format is a single regular expression per line.
Lines beginning with
#
and empty lines are ignored. Any invalid filter will raise a warning message, but will not prevent other filters from being read.
HTML Tokenizer by c-smile
HTML entities decoder by Christoph Gärtner
Charset detection using uchardet
Zip support for open document format using libzip
All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.