# Data preparation/downloading/processing

### First we need to move to the top-level directory.

In [1]:
cd ../..

/home/leo/SourceTreeGit/FlexNeuART.refact2021


## Downloading preprocessed data

This notebook works with a sub-sample of the natural question collection (__Wikipedia DPR__) prepared by [Karpukhin et al.](https://github.com/facebookresearch/DPR). This subset includes all the questions, but only about one million Wikipedia passages. The generation of this subset is briefly described below, but for your convenience we provide an archive with already processed data:

In [2]:
!wget boytsov.info/datasets/wikipedia_dpr_nq_sample_2021-07-16.tar.bz2

--2021-07-26 18:27:40--  http://boytsov.info/datasets/wikipedia_dpr_nq_sample_2021-07-16.tar.bz2
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3237388795 (3.0G) [application/x-bzip2]
Saving to: ‘wikipedia_dpr_nq_sample_2021-07-16.tar.bz2’


2021-07-26 18:44:22 (3.08 MB/s) - ‘wikipedia_dpr_nq_sample_2021-07-16.tar.bz2’ saved [3237388795/3237388795]



In [3]:
!tar jxvf wikipedia_dpr_nq_sample_2021-07-16.tar.bz2

wikipedia_dpr_nq_sample/derived_data/ir_models/vanilla_bert/model.best
wikipedia_dpr_nq_sample/derived_data/embeddings/
wikipedia_dpr_nq_sample/derived_data/embeddings/glove/
wikipedia_dpr_nq_sample/derived_data/embeddings/glove/glove.6B.50d.txt.bz2
wikipedia_dpr_nq_sample/input_data/
wikipedia_dpr_nq_sample/input_data/train_fusion/
wikipedia_dpr_nq_sample/input_data/train_fusion/QuestionFields.jsonl
wikipedia_dpr_nq_sample/input_data/train_fusion/qrels.txt
wikipedia_dpr_nq_sample/input_data/train_fusion/QuestionFields.bin
wikipedia_dpr_nq_sample/input_data/dev/
wikipedia_dpr_nq_sample/input_data/dev/QuestionFields.jsonl
wikipedia_dpr_nq_sample/input_data/dev/qrels.txt
wikipedia_dpr_nq_sample/input_data/dev/QuestionFields.bin
wikipedia_dpr_nq_sample/input_data/bitext/
wikipedia_dpr_nq_sample/input_data/bitext/QuestionFields.jsonl
wikipedia_dpr_nq_sample/input_data/bitext/qrels.txt
wikipedia_dpr_nq_sample/input_data/pass_sample/
wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerField

In [4]:
!mkdir collections

In [4]:
!mv wikipedia_dpr_nq_sample collections

#### Carry out a basic sanity check:

In [5]:
!scripts/report/get_basic_collect_stat.sh wikipedia_dpr_nq_sample

Using collection root: collections
Checking input sub-directory: bitext
Checking input sub-directory: dev
Checking input sub-directory: dev_official
Checking input sub-directory: pass_sample
Found indexable data file: pass_sample/AnswerFields.jsonl.gz
Checking input sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
getIndexQueryDataDirs return value:  pass_sample AnswerFields.jsonl.gz bitext,dev,dev_official,train_fusion
Using data file: AnswerFields.jsonl.gz
Index dirs: pass_sample
Query dirs: bitext dev dev_official train_fusion
Queries/questions:
bitext 53880
dev 2500
dev_official 6515
train_fusion 2500
Documents/passages/answers:
pass_sample 774392


## Preprocessing in more details : This is for information purposes only because the downloaded data is already pre-processed

The download and conversion script can be found in the directory `scripts/data_convert/wikipedia_dpr`.

In [None]:
!mkdir -p collections/wikipedia_dpr_nq_sample/input_raw

### Converting passages and queries

In [None]:
!scripts/data_convert/wikipedia_dpr/download_dpr_passages.sh collections/wikipedia_dpr_nq_sample/input_raw

In [None]:
!scripts/data_convert/wikipedia_dpr/download_dpr_queries.sh nq collections/wikipedia_dpr_nq_sample/input_raw

### Randomly split the training set into the new training and development sets. This script also converts the data into FlexNeuART format

In [None]:
!scripts/data_convert/wikipedia_dpr/split_and_convert_dpr_queries.sh \
    wikipedia_dpr_nq_sample \
    collections/wikipedia_dpr_nq_sample/input_raw/ \
    nq \
    -partition_sizes ,5000,2500 

### The split & convert script produces outputs of two types:
1. The set of questions in JSONL format. These questions are divided into several subsets:

In [None]:
!ls collections/wikipedia_dpr_nq_sample/input_data

The `bitext` subset and the `train_fusion` subsets are supposed to be used to train models. The difference is that `train_fusion` is a smaller subset that can be used to create fusion models. The `bitext` part can be used to train, e.g., neural models.

For the queries from the `bitext` set, the conversion script creates parallel data (bitext) where questions are aligned with respective answer-bearing sentences. We create three parallel corpora that correspond to three ways to lemmatize & tokenize input (lemmas and original tokens with stopwords removed and BERT-tokenized text). They are stored in the `derived_data/bitext` subdirectory:

In [None]:
!ls collections/wikipedia_dpr_nq_sample/derived_data/bitext

### Embedding document and queries

1. We already __ship__ data with documents and queries (except for the bitext part) embedded using an [ANCE Wikipedia model](https://github.com/microsoft/ANCE). This is done using the scripts in the `scripts/data_convert/ance` directory.
2. First, one needs to download the models using the script `scripts/data_convert/ance/download_ance_models.sh`.
3. Then, one can embed documents using a command like this one:

```
scripts/data_convert/ance/embed.py \
    --input collections/wikipedia_dpr_nq_sample/input_raw/psgs_w100.tsv.gz \
    --output collections/wikipedia_dpr_nq_sample/input_data/pass_sample/AnswerFields.bin \
    --field_name dense  \
    --model_dir <model download directory> \
    --data_type dpr_nq \
    --doc_ids collections/wikipedia_dpr_nq_sample/input_raw/nq_selected_psg_ids.npy
```

4. ... and queries using a command like this one (note we specify __the binary field name__):

```
for part in train_fusion dev dev_official ; do \
    scripts/data_convert/ance/embed.py \
        --input collections/wikipedia_dpr_nq_sample/input_data/$part/QuestionFields.jsonl \
        --output collections/wikipedia_dpr_nq_sample/input_data/$part/QuestionFields.bin \
        --field_name dense  \
        --model_dir <model download directory> \
        --data_type dpr_nq 
done
```