# Special data processing

### Two things to do before we start:
1. Point environment variable `COLLECT_ROOT` to the collection root.
2. Change directory to the location of installed scripts/binaries

In [1]:
%env COLLECT_ROOT=/home/leo/flexneuart_collections

env: COLLECT_ROOT=/home/leo/flexneuart_collections


In [2]:
cd /home/leo/flexneuart_scripts/

/home/leo/flexneuart_scripts


### QA data: weak supervision with answer-based QRELs

In the case of QA the set of relevance passages is obtained by retrieving a top-K set of passages using a candidate provider and checking if the passages contain an answer as a substring. Facebook Wikipedia DPR data is shipped with relevance information obtain in such as a way. However, not all collections are. Furthermore, this data depends a lot on the quality of a candidate generator. Ideally, when multiple retrieval systems are used and comapred, the sets of  relevance documents (from their respective top-k sets) need to be combined (i.e., **pooled**). Our framework does support such a functionality. To this end, each query entry in a JSONL file needs to have special field "answer_list", e.g.:


```
{
    "DOCNO": "dev_official_0",
    "text": "sing love reba",
    "text_raw": "who sings does he love me with reba",
    "answer_list": [
        "Linda Davis"
    ]
}
```

Then the respective set of QRELs can be generated using the following command:

In [3]:
!data_convert/create_answ_based_qrels.sh  \
    wikipedia_dpr_nq_sample \
    bitext \
    text_raw \
    qrels_generated_from_bitext_queries.txt

Using collection root: /home/leo/flexneuart_collections
Collection directory:      /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample
Data directory:            /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/input_data/bitext
Output file:               /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/qrels_generated_from_bitext_queries.txt
Candidate provider options: -u lucene_index/ 
# of candidate documents:  1000
Field name:                text_raw
Forward index directory:   forward_index/
Query file name prefix:    /home/leo/flexneuart_collections/wikipedia_dpr_nq_sample/input_data/bitext/QuestionFields
# of threads:              8
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.AnswerBasedQRELGenerator - Candidate provider type: lucene URI: lucene_index/ config: null
[main] INFO edu.cmu.lti.oaqa.flexneuart.apps.AnswerBasedQRELGenerator - Number of threads: 8
[main] INFO edu.cmu.lti.oaqa.flexneuart.resources.ResourceManager - Resource manager initialization. Re

### Generating parallel corpus (bitext) without explicitly paired data

A set of queries paired with a set of short relevant passages can be used to train a lexical IBM Model 1 model whose fusion with BM25 can be quite effective as a ranking model. In the case of QA data, such as corpus can be easily created by pairing questions with sentences containing an answer. This is what we do when we process the Wikipedia DPR corpus. However, such pairing generally does not exist for more generic ad hoc retrieval collections. It can still be possible to create a reasonable quality paired data by splitting a relevant passage into multiple short chunks and pairing each chunk with the respective queries. This works especially well for short passages or short information snippets such as titles, urls, or headings.

Here is an example of creating such an artificial bitext corpus:

In [None]:
!./giza/export_bitext_plain.sh \
    wikipedia_dpr_nq_sample \
    text_bert_tok text_bert_tok \
    2 \
    -bitext_out_subdir bitext_generated

Then we can train the Model 1 model as follows:

In [None]:
!./giza/create_tran.sh wikipedia_dpr_nq_sample text_bert_tok \
   -bitext_subdir bitext_generated \
   -model1_subdir giza_generated

We now need to prune the translation table and store it in a special format:

In [None]:
!min_tran_prob=0.001 ; top_word_qty=1000000 ; echo $min_tran_prob ; top_word_qty=100000 ; \
./giza/filter_tran_table_and_voc.sh \
    wikipedia_dpr_nq_sample \
    text_bert_tok \
    $min_tran_prob \
    $top_word_qty \
    -model1_subdir giza_generated