Credits: the provided initial code is an adaptation of the [Starter code for Stanford CS224n default final project on SQuAD 2.0](https://github.com/chrischute/squad) which is shared under MIT License. 

This notebook does initial preprocessing for the SberQuAD dataset and will give you the starting point in this assignment. If it looks too complex and/or time/resourse-expensive, you may stick to homework05 as well.

### 1. Preprocessing
This code is a bit changed version of the code from `setup.py`. If you want to work with the SQuAD dataset, stick to the original instructions from the https://github.com/chrischute/squad repository.

In [1]:
"""Train a model on SQuAD.

Author:
    Chris Chute (chute@stanford.edu)
"""

import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.optim.lr_scheduler as sched
import torch.utils.data as data
import util

from args import get_train_args
from collections import OrderedDict
from json import dumps
from models import BiDAF
from tensorboardX import SummaryWriter
from tqdm import tqdm
from ujson import load as json_load
from util import collate_fn, SQuAD

In [2]:
from pathlib import Path
Path("./data").mkdir(parents=True, exist_ok=True)
Path("./save").mkdir(parents=True, exist_ok=True)

Downloading the SberQuAD data

In [2]:
!wget http://files.deeppavlov.ai/datasets/sber_squad_clean-v1.1.tar.gz -nc -O ./data/sber_squad_clean-v1.1.tar.gz

--2020-06-16 12:58:06--  http://files.deeppavlov.ai/datasets/sber_squad_clean-v1.1.tar.gz
Resolving files.deeppavlov.ai (files.deeppavlov.ai)... 93.175.29.74
Connecting to files.deeppavlov.ai (files.deeppavlov.ai)|93.175.29.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22766184 (22M) [application/octet-stream]
Saving to: ‘./data/sber_squad_clean-v1.1.tar.gz’


2020-06-16 12:58:09 (5.84 MB/s) - ‘./data/sber_squad_clean-v1.1.tar.gz’ saved [22766184/22766184]



In [3]:
! tar -xzvf ./data/sber_squad_clean-v1.1.tar.gz
! mv train-v1.1.json data
! mv dev-v1.1.json data

train-v1.1.json
dev-v1.1.json


Downloading the word vectors

In [9]:
! wget http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec -nc -O ./data/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec

--2020-06-16 13:00:35--  http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec
Resolving files.deeppavlov.ai (files.deeppavlov.ai)... 93.175.29.74
Connecting to files.deeppavlov.ai (files.deeppavlov.ai)|93.175.29.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4110986108 (3.8G) [application/octet-stream]
Saving to: ‘./data/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec’

ft_native_300_ru_wi   0%[                    ]  13.53M  4.11MB/s    eta 16m 56s^C


And finally the preprocessing for the SberQuAD dataset:

In [3]:
train_file = './data/train-v1.1.json'
dev_file = './data/dev-v1.1.json'
glove_file = './data/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec'

In [4]:
from setup import *

In [5]:
# Uncomment this cell if needed
# !pip install pymorphy2

In [6]:
nlp = spacy.blank("ru")

The following cell may take a while (usually 10 minutes or less).

In [7]:
# Process training set and use it to decide on the word/character vocabularies
word_counter, char_counter = Counter(), Counter()
train_examples, train_eval = process_file(train_file, "train", word_counter, char_counter, nlp)
word_emb_mat, word2idx_dict = get_embedding(
    word_counter, 'word', emb_file=glove_file, vec_size=300, num_vectors=1560132)
char_emb_mat, char2idx_dict = get_embedding(
    char_counter, 'char', emb_file=None, vec_size=64)


dev_examples, dev_eval = process_file(dev_file, "dev", word_counter, char_counter, nlp)

Pre-processing train examples...


100%|██████████| 1/1 [01:48<00:00, 108.28s/it]
  0%|          | 903/1560132 [00:00<02:52, 9023.68it/s]

45328 questions in total
Pre-processing word vectors...


100%|██████████| 1560132/1560132 [02:40<00:00, 9696.36it/s]


135451 / 156143 tokens have corresponding word embedding vector
Pre-processing char vectors...


  0%|          | 0/1 [00:00<?, ?it/s]

701 tokens have corresponding char embedding vector
Pre-processing dev examples...


100%|██████████| 1/1 [00:13<00:00, 13.72s/it]

5036 questions in total





Now we have the preprocessed data:

In [9]:
train_record_file = './data/train.npz'
dev_record_file = './data/dev.npz'

In [10]:
from args import add_common_args, get_setup_args

In [14]:
# Retreiving the default arguments for the preprocessing script
_args = get_setup_args(bypass=True)

In [15]:
_args

Namespace(ans_limit=30, answer_file='./data/answer.json', char2idx_file='./data/char2idx.json', char_dim=64, char_emb_file='./data/char_emb.json', char_limit=16, dev_eval_file='./data/dev_eval.json', dev_meta_file='./data/dev_meta.json', dev_record_file='./data/dev.npz', dev_url='https://github.com/chrischute/squad/data/dev-v2.0.json', glove_dim=300, glove_num_vecs=2196017, glove_url='http://nlp.stanford.edu/data/glove.840B.300d.zip', include_test_examples=True, para_limit=400, ques_limit=50, test_eval_file='./data/test_eval.json', test_meta_file='./data/test_meta.json', test_para_limit=1000, test_ques_limit=100, test_record_file='./data/test.npz', test_url='https://github.com/chrischute/squad/data/test-v2.0.json', train_eval_file='./data/train_eval.json', train_record_file='./data/train.npz', train_url='https://github.com/chrischute/squad/data/train-v2.0.json', word2idx_file='./data/word2idx.json', word_emb_file='./data/word_emb.json')

In [13]:
build_features(_args, train_examples, "train", train_record_file, word2idx_dict, char2idx_dict)
dev_meta = build_features(_args, dev_examples, "dev", dev_record_file, word2idx_dict, char2idx_dict)


293it [00:00, 2917.97it/s]

Converting train examples to indices...


45328it [00:14, 3226.18it/s]
312it [00:00, 3115.78it/s]

Built 45213 / 45328 instances of features in total
Converting dev examples to indices...


5036it [00:01, 3237.44it/s]


Built 5022 / 5036 instances of features in total


In [16]:
save(_args.word_emb_file, word_emb_mat, message="word embedding")
save(_args.char_emb_file, char_emb_mat, message="char embedding")
save(_args.train_eval_file, train_eval, message="train eval")
save(_args.dev_eval_file, dev_eval, message="dev eval")
save(_args.word2idx_file, word2idx_dict, message="word dictionary")
save(_args.char2idx_file, char2idx_dict, message="char dictionary")
save(_args.dev_meta_file, dev_meta, message="dev meta")


Saving word embedding...
Saving char embedding...
Saving train eval...
Saving dev eval...
Saving word dictionary...
Saving char dictionary...
Saving dev meta...


### 2. The experiment

Now you are almost ready to go. You may follow these steps to begin (or just start your experiments here).

1. Try running the `train.py` script from the console (or via `!`) (default command-line arguments are ok for the start). If will run the BiDAF model on the preprocessed data. Set `--use_squad_v2` flag to False (SberQuAD is similar to SQuAD v1.1).

Example code (be careful with the path and the names of the variables):
```
python train.py --name first_run_on_sberquad --use_squad_v2 False
```

2. After if finishes (might take an 1-2-3 hours depending on the hardware), evaluate your model on the `dev` set and measure the quality.
Example code (be careful with the path and the names of the variables):
```
 python test.py --split dev --load_path ./save/train/first_run_on_sberquad-02/best.pth.tar --name best_evaluation_experiment
```
The result should be similar to the following:
```
>>> Dev NLL: 04.94, F1: 49.22, EM: 30.53, AvNA: 97.68
```

The [DeepPavlov's RuBERT](http://docs.deeppavlov.ai/en/master/features/models/squad.html) achieves $EM = 66.30\pm0.24$ and $F1 = 84.60\pm0.11$

#### Here comes your quest: try to improve the quality of this QA system. 

This is a very creative assignment. It is all about experimenting, trying different approaches (and a lot of computations). But if you wish to stick to some numbers, try to increase F1 at least by $5$ points.

Here are some ideas that might help you on your way:
* Try adapting the optimization hyperparameters/network structure to Russian language (the baseline is designed for English SQuAD dataset).
* Incorporating the additional information about the data (like PoS tags) might be a good idea.
* __Distilling the knowledge from a pre-trained RuBERT__ (e.g. try to use the predictions of the model we've discussed on `week10` as soft targets).
* Or anything else.


And, first of all, read the initial code carefully.


Good luck! Feel free to share your results :)