TREC 2019 Fair Ranking Track

config contains configurations files
semanticscholar contains the raw data files from https://api.semanticscholar.org/corpus/. The files' ending was changed to json.gz with the ./src/etl/rename.py script
elasticsearch contains the index data with the semanticscholar corpus
src contains the modules
training contains the provided training docs, group defintitions, a python script to generate query sequences, the training corpus for feature engineering
evaluation contains the provided evaluation files and a python script to validate a submission

requirements

training, evaluation and submission files from the fair-trec website
pandas
fairsearchdeltr
elasticsearch python client
elasticsearch docker instance

build the ltr module for elasticsearch in docker

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.0.1
docker build -t elasticsearch-ltr ./config

start the elasticsearch db

docker run -d --rm --name es \
-p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
-e "http.cors.enabled=true" \
-e "http.cors.allow-origin=*" \
-e "http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization" \
-e "http.cors.allow-credentials=true" \
-e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
-v `pwd`/elasticsearch:/usr/share/elasticsearch/data \
elasticsearch-ltr

push the data to es and remove missing ids

python ./src/etl/data_to_es.py
python ./src/etl/remove_missing_ids.py

example

random run

python -m src.runs.random

lambdamart

python -m src.runs.lambdamart

programmatic access

see the example scripts in src/runs

some stats

the corpus contains 46 947 044 unique documents
the training sample contains 4641 documents (4490 unique docs) and 652 queries
the cleand training sample contains 557 queries, as some doc_ids are missing in the corpus (see ./src/etl/remove_missing_ids.py)
3863 docs from the training sample are included in the corpus
the length of each ranking ranges from 2 to 26 docs with an average of 7 docs
on average arround 50.94% of docs per query are not relevant

modules

runs

contains the final run script that build on all other modules

etl

imports and maps the raw data to elasticsearch index

interface

processes the input training and group files (inputhandler)
layer between program modules and elasticsearch (corpus)

reranker

contains the learning to rank model to rerank the document sets: DELTR algorithm for training
provides implementations of the evaluation measures (evaluation.py)
provides module to generate features from the corpus (features.py)

utils

contains modules for command line args, logger file initialization and IO functionalities

test

contains test files and scripts

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
elasticsearch		elasticsearch
notebook-paper		notebook-paper
semanticscholar		semanticscholar
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

irgroup/fair-trec

Folders and files

Latest commit

History

Repository files navigation

TREC 2019 Fair Ranking Track

requirements

build the ltr module for elasticsearch in docker

start the elasticsearch db

push the data to es and remove missing ids

example

random run

lambdamart

programmatic access

some stats

modules

runs

etl

interface

reranker

utils

test

About

Resources

License

Stars

Watchers

Forks

Languages