This is the implementation of our paper Boilerplate Removal using a Neural Sequence Labeling Model.
This code is tested with Python 3.7.5 and
- tensorflow==2.1.0
- numpy==1.17.3
- tqdm==4.39.0
- nltk==3.4.5
- beautifulsoup4==4.8.1
- html5lib==1.0.1
- scikit-learn==0.21.3
The datasets are available for download here:
usage: preprocess.py [-h] [-s SPLIT_DIR] [-w NUM_WORDS] [-t NUM_TAGS]
[--save SAVE]
DIRS [DIRS ...]
positional arguments:
DIRS A list of directories containing the HTML files
optional arguments:
-h, --help show this help message and exit
-s SPLIT_DIR, --split_dir SPLIT_DIR
Directory that contains train-/dev-/testset split
-w NUM_WORDS, --num_words NUM_WORDS
Only use the top-k words
-t NUM_TAGS, --num_tags NUM_TAGS
Only use the top-l HTML tags
--save SAVE Where to save the results
After downloading and extracting one of the zip files above, preprocess your dataset, for example:
python3 net/preprocess.py googletrends-2017/prepared_html/ -s googletrends-2017/50-30-100-split/ -w 1000 -t 50 --save googletrends_data
The training script takes care of both training and evaluating on dev- and testset:
usage: train.py [-h] [-l NUM_LAYERS] [-u HIDDEN_UNITS] [-d DROPOUT]
[-s DENSE_SIZE] [-e EPOCHS] [-b BATCH_SIZE]
[--interval INTERVAL] [--working_dir WORKING_DIR]
DATA_DIR
positional arguments:
DATA_DIR Directory of files produced by the preprocessing
script
optional arguments:
-h, --help show this help message and exit
-l NUM_LAYERS, --num_layers NUM_LAYERS
The number of RNN layers
-u HIDDEN_UNITS, --hidden_units HIDDEN_UNITS
The number of hidden LSTM units
-d DROPOUT, --dropout DROPOUT
The dropout percentage
-s DENSE_SIZE, --dense_size DENSE_SIZE
Size of the dense layer
-e EPOCHS, --epochs EPOCHS
The number of epochs
-b BATCH_SIZE, --batch_size BATCH_SIZE
The batch size
--interval INTERVAL Calculate metrics and save the model after this many
epochs
--working_dir WORKING_DIR
Where to save checkpoints and logs
For example, the model can be trained like this:
python3 net/train.py googletrends_data --working_dir googletrends_train
In order to reproduce the paper results, use the following hyperparameters:
-s googletrends-2017/50-30-100-split -w 1000 -t 50
(preprocessing)-l 2 -u 256 -d 0.5 -s 256 -e 50 -b 16 --interval 1
(training)
Select the checkpoint with the highest F1 score on the validation set.
@inproceedings{10.1145/3366424.3383547,
author = {Leonhardt, Jurek and Anand, Avishek and Khosla, Megha},
title = {Boilerplate Removal Using a Neural Sequence Labeling Model},
year = {2020},
isbn = {9781450370240},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3366424.3383547},
doi = {10.1145/3366424.3383547},
booktitle = {Companion Proceedings of the Web Conference 2020},
pages = {226–229},
numpages = {4},
location = {Taipei, Taiwan},
series = {WWW ’20}
}