In [74]:
import pandas as pd

Preprocessing of file 'queries_gender_annotated.csv' needed, as separator is ',' and second column contains text which also contains ','. Hence, pandas cannot read the file without problems.
Solution: put "" around the text
download files first and put into subfolder 'code_own\data'. Also, you need to unpack the compressed files directly into the data folder: 
- curl -O https://github.com/navid-rekabsaz/GenderBias_IR/blob/master/resources/queries_gender_annotated.csv
- curl -O https://github.com/navid-rekabsaz/GenderBias_IR/blob/master/resources/wordlist_genderspecific.txt
- curl -O https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz
- curl -O https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.tsv

In [84]:
PATH_ANNOTATED_DATASET = 'data/queries_gender_annotated.csv'
PATH_TARGET_MODIFIED_ANNOTATED_DATASET = 'data/queries_gender_annotated_modified.csv'

PATH_QUERIES = 'data/queries.dev.tsv'
PATH_QRELS = 'data/qrels.dev.tsv'
PATH_TARGET_MSMACRO = 'data/msmacro.tsv'

In [76]:
def transform_annotated_dataset(source, target):

    #preprcess queries_gender_annotated.csv as it cannot be read witout errors (, in text)
    #iterate over all rows of file queries_gender_annotated.csv
    result_query_gender_annontated = []
    with open(source, 'r') as f:
        lines = f.readlines()
        for line in lines:
            #split by comma
            split = line.split(',')
            #remove newline in last element
            split[-1] = split[-1].replace('\n', '')
            #put into 3 columns (if ',' in text)
            if len(split) > 3:
                #combine all but first and last element
                text = ','.join(split[1:-1])
            else:
                text = split[1]

            #combine elements and put text in quotes
            new_line = split[0] + ',"' + text + '",' + split[-1] + '\n'
            #print(new_line)
            result_query_gender_annontated.append(new_line)

    #add headers to file
    result_query_gender_annontated.insert(0, 'qid,query,annotation\n')

    #write to file
    with open(target, 'w') as f:
        f.writelines(result_query_gender_annontated)

transform_annotated_dataset(PATH_ANNOTATED_DATASET, PATH_TARGET_MODIFIED_ANNOTATED_DATASET)
df_gender = pd.read_csv(PATH_TARGET_MODIFIED_ANNOTATED_DATASET, header=0, sep=',') 

The 'msmacro.csv' isn't provided either. Therefore, we need to gather the data ourselves.
As the MS Macro dataset is divided into multiple files, we will need to use multiple files too to obtain the final file.

The authors of our paper described, that they used the *dev* sets, which had at least one related human-judged relevance judgement document. The human-judged relevance judgement documents are contained in the *qrels.dev.csv* file. 

On the source website if the qrels dataset it is described that the qrels file is in the *TREC qrels format* [https://github.com/microsoft/msmarco/blob/master/TREC-Deep-Learning-2019.md; https://microsoft.github.io/msmarco/].

This format has 4 columns (TOPIC, ITERATION, DOCUMENT, RELEVANCY) [https://trec.nist.gov/data/qrels_eng/]. In our case the TOPIC equals the query(id), which is in out case with RELEVANCY the only relevant column for us.


In [77]:
#1. Read the rqrels file and the queries file
df_qrels = pd.read_csv(PATH_QRELS, sep='\t', header=None)
df_queries = pd.read_csv(PATH_QUERIES, sep='\t', header=None)

#set column names
df_qrels.columns = ['qid', 'iter', 'docid', 'rel']
df_queries.columns = ['qid', 'query']

In [78]:
#check for duplicates in column qid
print("Duplicates in qid: ", df_qrels.duplicated('qid').sum())


#check for irrelevant rows
print("Number of irrelevant rows: ", df_qrels[df_qrels['rel'] != 1]['rel'].count())


Duplicates in qid:  3695
Number of irrelevant rows:  0


As we can see we have duplicated rows, but all rows are relevant. Therefore, we will remove the duplicates and also the queries, which are contained in the *annotated gender dataset*


In [87]:
#remove duplicates
relevant_qid = set(df_qrels['qid'].unique())


#remove qids that are in df_gender
relevant_qid = relevant_qid - set(df_gender['qid'].unique())

#select only queries with relevant qid
df_queries_relevant = df_queries[df_queries['qid'].isin(relevant_qid)]

print("Number of relevant queries: ", df_queries_relevant['qid'].count())

df_queries_relevant.to_csv(PATH_TARGET_MSMACRO, index=False, header=True, sep='\t')


Number of relevant queries:  51828


As we can see, we get **51828** relevant queries. In our assigned paper however, there are 51,827 relevant queries (1 less). We cannot explain the single row difference.