This repo contains the data and code used in this CIKM 2015 paper: http://dl.acm.org/citation.cfm?id=2806613
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
calculate_readability.py
check_num_words.py
correlations.py
distrib.tar.gz
lucene_html.out
readability_scores_boilerpipe_cont.tar.gz
readability_scores_boilerpipe_ncont.tar.gz
readability_scores_bs4_cont.tar.gz
readability_scores_bs4_ncont.tar.gz
readability_scores_justext_cont.tar.gz
readability_scores_justext_ncont.tar.gz
unpack_dat.py

README.md

Source Code and data used for CIKM Readability 2015 short paper

You can obtain the text at:

  1. ACM

  2. Research Gate

Dependences

The code used in this work takes advantage of the ReadabilityCalculator python module, that can be downloaded using pip:

$ pip install ReadabilityCalculator

Data:

readability_scores*.tar.gz

It contains the readability scores for every document from CLEF eHealth 2014/2015 dataset.

distrib.tar.gz

It contains the distribution of words and sentences for each preprocessing variant for the documents in CLEF eHealth 2014/2015 dataset.

lucene_html.out

It is the lucene result list based on a default VSM search using the topics from CLEF eHealth 2014.


Code:

check_num_words.py

Script that creates table 2 from the paper.

unpack_dat.py

Script to unpack the original '.dat' files from CLEF and preprocess them using any of the boilerplate removal options.

calculate_readability.py

Python script used to create the files in readability_score.tar.gz

correlations.py

Calculates the correlations between the ranking list generated by different readability measure for the same Lucene based initial ranked list.