# Data preparation/downloading/processing

### First, we create the root collection directory and point environment variable `COLLECT_ROOT` to this directory

In [1]:
!mkdir -p /home/leo/flexneuart_collections

In [2]:
%env COLLECT_ROOT=/home/leo/flexneuart_collections

env: COLLECT_ROOT=/home/leo/flexneuart_collections


In [34]:
!bash -c "echo $COLLECT_ROOT"

/home/leo/flexneuart_collections


## Downloading preprocessed data

This notebook works with a sub-sample of the natural question collection (__Wikipedia DPR__) prepared by [Karpukhin et al.](https://github.com/facebookresearch/DPR). This subset includes all the questions from __Wikipedia DPR__, but only a sample  of passages (about one million). 

The generation of this subset is briefly described below, but for your convenience we provide an archive with already processed data.

Change the directory, downloaded and unpack data:

In [35]:
cd /home/leo/flexneuart_collections

/home/leo/flexneuart_collections


In [22]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_conf_2021-09-15.tar.bz2

--2021-09-16 15:02:07--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_conf_2021-09-15.tar.bz2
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2657 (2.6K) [application/x-bzip2]
Saving to: ‘wikipedia_dpr_nq_sample_conf_2021-09-15.tar.bz2’


2021-09-16 15:02:07 (113 MB/s) - ‘wikipedia_dpr_nq_sample_conf_2021-09-15.tar.bz2’ saved [2657/2657]



In [31]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2

--2021-09-16 19:13:14--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 414972906 (396M) [application/x-bzip2]
Saving to: ‘wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2’


2021-09-16 19:15:47 (2.60 MB/s) - ‘wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2’ saved [414972906/414972906]



In [7]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2

--2021-09-15 23:57:25--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2722927168 (2.5G) [application/x-bzip2]
Saving to: ‘wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2’


2021-09-16 00:40:55 (1019 KB/s) - ‘wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2’ saved [2722927168/2722927168]



In [25]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2  

--2021-09-16 15:33:41--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43284754 (41M) [application/x-bzip2]
Saving to: ‘wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2’


2021-09-16 15:34:18 (1.12 MB/s) - ‘wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2’ saved [43284754/43284754]



In [37]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2  

--2021-09-16 19:53:30--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 55835230 (53M) [application/x-bzip2]
Saving to: ‘wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2.1’


2021-09-16 19:53:54 (2.23 MB/s) - ‘wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2.1’ saved [55835230/55835230]



In [14]:
!tar jxvf wikipedia_dpr_nq_sample_data_2021-09-15.tar.bz2

wikipedia_dpr_nq_sample/
wikipedia_dpr_nq_sample/input_data/
wikipedia_dpr_nq_sample/input_data/train_fusion/
wikipedia_dpr_nq_sample/input_data/train_fusion/QuestionFields.jsonl
wikipedia_dpr_nq_sample/input_data/train_fusion/qrels.txt
wikipedia_dpr_nq_sample/input_data/train_fusion/QuestionFields.bin
wikipedia_dpr_nq_sample/input_data/dev/
wikipedia_dpr_nq_sample/input_data/dev/QuestionFields.jsonl
wikipedia_dpr_nq_sample/input_data/dev/qrels.txt
wikipedia_dpr_nq_sample/input_data/dev/QuestionFields.bin
wikipedia_dpr_nq_sample/input_data/bitext/
wikipedia_dpr_nq_sample/input_data/bitext/QuestionFields.jsonl
wikipedia_dpr_nq_sample/input_data/bitext/qrels.txt
wikipedia_dpr_nq_sample/input_data/pass_sample/
wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.bin
wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.jsonl.gz
wikipedia_dpr_nq_sample/input_data/dev_official/
wikipedia_dpr_nq_sample/input_data/dev_official/QuestionFields.jsonl
wikipedia_dpr_nq_sample/input_da

In [23]:
!tar jxvf wikipedia_dpr_nq_sample_conf_2021-09-15.tar.bz2

wikipedia_dpr_nq_sample/
wikipedia_dpr_nq_sample/model_conf/
wikipedia_dpr_nq_sample/model_conf/vanilla_bert.json
wikipedia_dpr_nq_sample/model_conf/vanilla_bert_with_scores.json
wikipedia_dpr_nq_sample/derived_data/
wikipedia_dpr_nq_sample/derived_data/ir_models/
wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/
wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/todays_experiment/
wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/todays_experiment/0/
wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/todays_experiment/0/vanilla_bert.json
wikipedia_dpr_nq_sample/exper_desc.best/
wikipedia_dpr_nq_sample/exper_desc.best/extractors/
wikipedia_dpr_nq_sample/exper_desc.best/extractors/avgembed.json
wikipedia_dpr_nq_sample/exper_desc.best/extractors/cedr8080.json
wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25_ance_exported_sparse.json
wikipedia_dpr_nq_sample/exper_desc.best/extractors/bm25=text+model1=text_bert_tok+lambda=0.3+pro

In [26]:
!tar jxvf wikipedia_dpr_nq_sample_bitext_2021-09-15.tar.bz2  

wikipedia_dpr_nq_sample/derived_data/bitext/
wikipedia_dpr_nq_sample/derived_data/bitext/answer_text_unlemm
wikipedia_dpr_nq_sample/derived_data/bitext/question_text_bert_tok
wikipedia_dpr_nq_sample/derived_data/bitext/answer_text_bert_tok
wikipedia_dpr_nq_sample/derived_data/bitext/question_text_unlemm


In [32]:
!tar jxvf wikipedia_dpr_nq_sample_models_2021-09-15.tar.bz2

wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/model.best


In [38]:
!tar jxvf wikipedia_dpr_nq_sample_embed_2021-09-15.tar.bz2  

wikipedia_dpr_nq_sample/derived_data/embeddings/
wikipedia_dpr_nq_sample/derived_data/embeddings/glove/
wikipedia_dpr_nq_sample/derived_data/embeddings/glove/glove.6B.50d.txt.bz2


#### For all the following experiments we use scripts installed via `flexneuart_install_extra.sh`. They must be called from their respective installation directory:

In [19]:
cd /home/leo/flexneuart_scripts/

/home/leo/flexneuart_scripts


#### Carry out a basic sanity check:

In [20]:
!report/get_basic_collect_stat.sh wikipedia_dpr_nq_sample

Using collection root: /home/leo/flexneuart_collections
Checking input sub-directory: bitext
Checking input sub-directory: dev
Checking input sub-directory: dev_official
Checking input sub-directory: pass_sample
Found indexable data file: pass_sample/AnswerFields.jsonl.gz
Checking input sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
getIndexQueryDataDirs return value:  pass_sample AnswerFields.jsonl.gz bitext,dev,dev_official,train_fusion
Using data file: AnswerFields.jsonl.gz
Index dirs: pass_sample
Query dirs: bitext dev dev_official train_fusion
Queries/questions:
bitext 53880
dev 2500
dev_official 6515
train_fusion 2500
Documents/passages/answers:
pass_sample 774392


## Preprocessing in more details : This is for information purposes only because the downloaded data is already pre-processed

The download and conversion script can be found in the directory `data_convert/wikipedia_dpr`.

In [None]:
!mkdir -p $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw

### Converting passages and queries

In [None]:
!data_convert/wikipedia_dpr/download_dpr_passages.sh $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw

In [None]:
!data_convert/wikipedia_dpr/download_dpr_queries.sh nq $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw

### Randomly split the training set into the new training and development sets. This script also converts the data into FlexNeuART format

In [None]:
!data_convert/wikipedia_dpr/split_and_convert_dpr_queries.sh \
    wikipedia_dpr_nq_sample \
    collections/wikipedia_dpr_nq_sample/input_raw/ \
    nq \
    -partition_sizes ,5000,2500 

### The split & convert script produces outputs of two types:
1. The set of questions in JSONL format. These questions are divided into several subsets:

In [None]:
!ls $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data

The `bitext` subset and the `train_fusion` subsets are supposed to be used to train models. The difference is that `train_fusion` is a smaller subset that can be used to create fusion models. The `bitext` part can be used to train, e.g., neural models.

For the queries from the `bitext` set, the conversion script creates parallel data (bitext) where questions are aligned with respective answer-bearing sentences. We create three parallel corpora that correspond to three ways to lemmatize & tokenize input (lemmas and original tokens with stopwords removed and BERT-tokenized text). They are stored in the `derived_data/bitext` subdirectory:

In [None]:
!ls $COLLECT_ROOT/wikipedia_dpr_nq_sample/derived_data/bitext

### Embedding document and queries (ANCE, Sentencer BERT)

1. We already __ship__ data with documents and queries (except for the bitext part) embedded using an [ANCE Wikipedia model](https://github.com/microsoft/ANCE). This is done using the scripts in the `data_convert/biencoder/ance` directory.
2. A much more diverse set of embeddings (provided by [Sentence BERT](https://www.sbert.net/)) is available if use the script `data_convert/biencoder/sbert/embed.py`.
3. First, one needs to download the models using the script `data_convert/biencoder/ance/download_ance_models.sh`.
4. Then, one can embed documents using a command like this one:

```
data_convert/biencoder/ance/embed.py \
    --input $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_raw/psgs_w100.tsv.gz \
    --output $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.bin \
    --field_name dense  \
    --model_dir <model download directory> \
    --data_type dpr_nq \
    --doc_ids collections/wikipedia_dpr_nq_sample/input_raw/nq_selected_psg_ids.npy
```

4. ... and queries using a command like this one (note we specify __the binary field name__):

```
data_convert/biencoder/ance/embed.py \
    --input collections/wikipedia_dpr_nq_sample/input_raw/psgs_w100.tsv.gz \
    --output collections/wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.bin \
    --field_name dense  \
    --model_dir <model download directory> \
    --data_type dpr_nq \
    --doc_ids collections/wikipedia_dpr_nq_sample/input_raw/nq_selected_psg_ids.npy
```

```
for part in train_fusion dev dev_official ; do \
    data_convert/biencoder/ance/embed.py \
        --input $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data/$part/QuestionFields.jsonl \
        --output $COLLECT_ROOT/wikipedia_dpr_nq_sample/input_data/$part/QuestionFields.bin \
        --field_name dense  \
        --model_dir <model download directory> \
        --data_type dpr_nq 
done
```