Skip to content

michaelshekasta/boilernet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BoilerNet

This is the implementation of our paper Boilerplate Removal using a Neural Sequence Labeling Model.

Requirements

This code is tested with Python 3.7.5 and

  • tensorflow==2.1.0
  • numpy==1.17.3
  • tqdm==4.39.0
  • nltk==3.4.5
  • beautifulsoup4==4.8.1
  • html5lib==1.0.1
  • scikit-learn==0.21.3

Usage

The datasets are available for download here:

Preprocessing

usage: preprocess.py [-h] [-s SPLIT_DIR] [-w NUM_WORDS] [-t NUM_TAGS]
                     [--save SAVE]
                     DIRS [DIRS ...]

positional arguments:
  DIRS                  A list of directories containing the HTML files

optional arguments:
  -h, --help            show this help message and exit
  -s SPLIT_DIR, --split_dir SPLIT_DIR
                        Directory that contains train-/dev-/testset split
  -w NUM_WORDS, --num_words NUM_WORDS
                        Only use the top-k words
  -t NUM_TAGS, --num_tags NUM_TAGS
                        Only use the top-l HTML tags
  --save SAVE           Where to save the results

After downloading and extracting one of the zip files above, preprocess your dataset, for example:

python3 net/preprocess.py googletrends-2017/prepared_html/ -s googletrends-2017/50-30-100-split/ -w 1000 -t 50 --save googletrends_data

Training

The training script takes care of both training and evaluating on dev- and testset:

usage: train.py [-h] [-l NUM_LAYERS] [-u HIDDEN_UNITS] [-d DROPOUT]
                [-s DENSE_SIZE] [-e EPOCHS] [-b BATCH_SIZE]
                [--interval INTERVAL] [--working_dir WORKING_DIR]
                DATA_DIR

positional arguments:
  DATA_DIR              Directory of files produced by the preprocessing
                        script

optional arguments:
  -h, --help            show this help message and exit
  -l NUM_LAYERS, --num_layers NUM_LAYERS
                        The number of RNN layers
  -u HIDDEN_UNITS, --hidden_units HIDDEN_UNITS
                        The number of hidden LSTM units
  -d DROPOUT, --dropout DROPOUT
                        The dropout percentage
  -s DENSE_SIZE, --dense_size DENSE_SIZE
                        Size of the dense layer
  -e EPOCHS, --epochs EPOCHS
                        The number of epochs
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        The batch size
  --interval INTERVAL   Calculate metrics and save the model after this many
                        epochs
  --working_dir WORKING_DIR
                        Where to save checkpoints and logs

For example, the model can be trained like this:

python3 net/train.py googletrends_data --working_dir googletrends_train

Hyperparameters

In order to reproduce the paper results, use the following hyperparameters:

  • -s googletrends-2017/50-30-100-split -w 1000 -t 50 (preprocessing)
  • -l 2 -u 256 -d 0.5 -s 256 -e 50 -b 16 --interval 1 (training)

Select the checkpoint with the highest F1 score on the validation set.

Citation

@inproceedings{10.1145/3366424.3383547,
author = {Leonhardt, Jurek and Anand, Avishek and Khosla, Megha},
title = {Boilerplate Removal Using a Neural Sequence Labeling Model},
year = {2020},
isbn = {9781450370240},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3366424.3383547},
doi = {10.1145/3366424.3383547},
booktitle = {Companion Proceedings of the Web Conference 2020},
pages = {226–229},
numpages = {4},
location = {Taipei, Taiwan},
series = {WWW ’20}
}

About

Boilerplate Removal using Deep Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 77.8%
  • JavaScript 20.9%
  • HTML 1.3%