Skip to content

ppke-nlpg/boilerplateResults

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluation results of boilerplate removal tools on two datasets: CleanEval and CleanPortalEval

CleanEval source:

CleanPortalEval source: https://github.com/ppke-nlpg/CleanPortalEval

Evaluation script from Stefan Evert (http://www.lrec-conf.org/proceedings/lrec2008)

Tested algorithms

  • boilerpipe,
  • bte,
  • goldminer,
  • goldminer+onion,
  • justext,
  • justext+onion

Contents

  • cleanEvalResults: results of boilerplate removal algorithms on CleanEval dataset
  • cleanPortalEvalResults: results of boilerplate removal algorithms on CleanPortalEval dataset

Reference

If you use the tool, please cite the following paper: More effective boilerplate removal - the GoldMiner algorithm

http://www.gelbukh.com/polibits/2013_48/More%20Effective%20Boilerplate%20Removal%20-%20the%20GoldMiner%20Algorithm.pdf

@article{endredy_more_2013,
	title = {More {Effective} {Boilerplate} {Removal} - the {GoldMiner} {Algorithm}},
	issn = {1870-9044},
	url = {http://polibits.gelbukh.com/2013_48},
	language = {eng},
	number = {48},
	journal = {Polibits - Research journal on Computer science and computer engineering with applications},
	author = {Endr{\'e}dy, Istv{\'a}n and Nov{\'a}k, Attila},
	year = {2013},
	keywords = {boilerplate removal, Corpus building, the web as corpus},
	pages = {79--83}
}

About

Results of boilerplate removal algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published