warc2text

Extracts plain text, language identification and more metadata from WARC records

Download

Clone this repo along with submodules:

git clone --recurse-submodules https://github.com/bitextor/warc2text.git

Or:

git clone https://github.com/bitextor/warc2text.git
git submodule update --init --recursive

Install dependencies

On Debian/Ubuntu/Mint:

apt-get install uchardet libuchardet-dev libzip-dev

On Mac:

brew install uchardet libzip

Compile

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/your/prefix/path ..
# cmake .. -DCMAKE_BUILD_TYPE=Debug # for debug
# cmake .. -DICU_ROOT_DIR=(brew --prefix icu4c)/lib # for macOS
make -j
make install

Usage

note: for warcs with many languages you might hit the open file limit quite quickly. It is therefore advised to increase it, e.g. ulimit -n 8192.

warc2text -o <output_folder> [ -f <output_files> ] [ --pdfpass <output_warc> ]
          [ --paragraph-identification ] [ --tag-filters <filters_file> ] <warc_file>...

--output/-o output folder
--files/-f list of output files separated by commas (and without .gz); text and url are always written, while mime and html are optional
--pdfpass WARC file where PDF records will be stored
--paragraph-identification print the paragraph identifier for each sentence extracted from the HTML
--classifier classifier to use: cld2 or fasttext.
--fasttext-model path to FastText model for fasttext classifier.
--tag-filters file containing filters that are used to eliminate matching documents
--invert-tag-filters output only documents that match the filter
--url-filters file containing regular expressions that match urls of documents to eliminate
--verbose/-v print progress and filtering information
--silent/-s print only warnings and errors

Tag Filter format is the following: tag <tab> attribute <tab> regexp

For example, meta <tab> name <tab> translation-stats will remove documents that contain <meta name="translation-stats" ... >

URL Filter format is a single regular expression per line.

Lines beginning with # and empty lines are ignored. Any invalid filter will raise a warning message, but will not prevent other filters from being read.

Included dependencies

HTML Tokenizer by c-smile

HTML entities decoder by Christoph Gärtner

Charset detection using uchardet

Zip support for open document format using libzip

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

Name		Name	Last commit message	Last commit date
Latest commit History 340 Commits
cld2 @ 3072166		cld2 @ 3072166
fasttext @ ffee8e4		fasttext @ ffee8e4
preprocess @ 8ee9fe8		preprocess @ 8ee9fe8
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
debug-filters.sh		debug-filters.sh
warc2text_main.cc		warc2text_main.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

warc2text

Download

Install dependencies

Compile

Usage

Included dependencies

About

Releases

Packages

Languages

License

jelmervdl/warc2text

Folders and files

Latest commit

History

Repository files navigation

warc2text

Download

Install dependencies

Compile

Usage

Included dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages