The Stackoverflow-Technotes dataset for IR Evaluation

This project contains two sets of codes.

benchmark_generator contains codes for benchmark data creation from Stackoverflow-Technotes, an IT domain specific IR benchmark data from Stackoverflow public forums and IBM Technotes corpus.
benchmark_evaluator contains codes for simulating DFS experiments.

For the details about this dataset please see -

@inproceedings{dfs-2021-internlp,
  title={Dynamic Facet Selection by Maximizing Graded Relevance},
  author={Glass, Michael and Chowdhury, Md Faisal Mahbub and Deng, Yu and Mahindru, Ruchi and Fauceglia, Nicolas Rodolfo and Gliozzo, Alfio and Mihindukulasooriya, Nandana},
  booktitle={Proceedings of the 1st Workshop on Interactive Learning for Natural Language Processing (InterNLP 2021)},
  year={2021},
  publisher = "Association for Computational Linguistics",
}

Stackoverflow-Technotes benchmark creation (automatically)

First download the TechQA technote corpus (which contains full_technote_collection.txt.bz2) from https://leaderboard.techqa.us-east.containers.appdomain.cloud/ This requires a free registration.

This script will download the forum post histories from archive.org. The files downloaded are stackoverflow.com-Posts.7z (15.1GB) and stackoverflow.com-PostHistory.7z (26.5 GB). Run it in the same directory that you downloaded full_technote_collection.txt.bz2. It will produce the dataset with a train and test split. You must have 7z installed (sudo apt-get install p7zip-full).

sh benchmark_generator/run_generator.sh

Automatic simulated evaluation of Dynamic Faceted Search Appraoches

Use the following script to run the experiments. Before running the script, set the correct parameter values for input/output file or directory paths.

sh run_experiment.sh

If you already have a .tsv file generated with the results, then run -

python -m benchmark_evaluator.util.results_reader YOUR_TSV_FILE_PATH

The format of each line in the .tsv file should be as following -

Query ID\tQuery\tBaseLine_TIP_RANK\tDFS_TIP_RANK\tSelected_Facets (in order)

The following output files will be written inside resources\tmp folder.

Results for all queries -

summary_hits.csv : Contains summary results for all queries
result_ranks.tsv : Contains results for each individual query

Results for queries for which DFS was activated -

summary_per_iteration.csv : Stats are calculated by restricting number of iterations, i.e. number of max. facet selection by oracle
bottom_queries_summary.csv : (view at BOTTOM) Stats calculated by restricting to those queries for which the baseline (ES) rank for the desired document (in the returned results) is greater than than a threshold rank
top_queries_summary.csv : (view at TOP) Stats calculated by restricting to those queries for which the baseline (ES) rank for the desired document (in the returned results) is at k or lower

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
benchmark_evaluator		benchmark_evaluator
benchmark_generator		benchmark_generator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_experiment.sh		run_experiment.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark_evaluator

benchmark_evaluator

benchmark_generator

benchmark_generator

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

run_experiment.sh

run_experiment.sh

Repository files navigation

The Stackoverflow-Technotes dataset for IR Evaluation

Stackoverflow-Technotes benchmark creation (automatically)

Automatic simulated evaluation of Dynamic Faceted Search Appraoches

Results for all queries -

Results for queries for which DFS was activated -

About

Releases

Packages

Contributors 2

Languages

License

IBM/Stackoverflow-Technotes-dataset

Folders and files

Latest commit

History

Repository files navigation

The Stackoverflow-Technotes dataset for IR Evaluation

Stackoverflow-Technotes benchmark creation (automatically)

Automatic simulated evaluation of Dynamic Faceted Search Appraoches

Results for all queries -

Results for queries for which DFS was activated -

About

Resources

License

Stars

Watchers

Forks

Languages