Credits: the provided initial code is an adaptation of the [Starter code for Stanford CS224n default final project on SQuAD 2.0](https://github.com/chrischute/squad) which is shared under MIT License. 

This notebook does initial preprocessing for the SberQuAD dataset and will give you the starting point in this assignment. If it looks too complex and/or time/resourse-expensive, you may stick to homework05 as well.

### 1. Preprocessing
This code is a bit changed version of the code from `setup.py`. If you want to work with the SQuAD dataset, stick to the original instructions from the https://github.com/chrischute/squad repository.

In [None]:
# If running on Colab, uncomment the following lines 

# !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/homeworks/homework04/args.py -nc
# !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/homeworks/homework04/layers.py -nc
# !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/homeworks/homework04/models.py -nc
# !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/homeworks/homework04/setup.py -nc
# !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/homeworks/homework04/test.py -nc
# !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/homeworks/homework04/train.py -nc
# !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/homeworks/homework04/util.py -nc

In [None]:
# If running on Colab, uncomment the following lines 

# !pip install ujson
# !pip install tensorboardX
# !pip install pymorphy2==0.8

In [None]:
"""Train a model on SQuAD.

Author:
    Chris Chute (chute@stanford.edu)
"""

import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.optim.lr_scheduler as sched
import torch.utils.data as data
import util

from args import get_train_args
from collections import OrderedDict
from json import dumps
from models import BiDAF
from tensorboardX import SummaryWriter
from tqdm import tqdm
from ujson import load as json_load
from util import collate_fn, SQuAD

In [None]:
from pathlib import Path
Path("./data").mkdir(parents=True, exist_ok=True)
Path("./save").mkdir(parents=True, exist_ok=True)

Downloading the SberQuAD data

In [None]:
!wget http://files.deeppavlov.ai/datasets/sber_squad_clean-v1.1.tar.gz -nc -O ./data/sber_squad_clean-v1.1.tar.gz

In [None]:
! tar -xzvf ./data/sber_squad_clean-v1.1.tar.gz
! mv train-v1.1.json data
! mv dev-v1.1.json data

Downloading the word vectors (this may take a while)

In [None]:
! wget http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec -nc -O ./data/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec

And finally the preprocessing for the SberQuAD dataset:

In [None]:
train_file = './data/train-v1.1.json'
dev_file = './data/dev-v1.1.json'
glove_file = './data/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec'

In [None]:
from setup import *

In [None]:
# Uncomment this cell if needed
# !pip install pymorphy2

In [None]:
nlp = spacy.blank("ru")

The following cell may take a while (usually 10 minutes or less).

In [None]:
# Process training set and use it to decide on the word/character vocabularies
word_counter, char_counter = Counter(), Counter()
train_examples, train_eval = process_file(train_file, "train", word_counter, char_counter, nlp)
word_emb_mat, word2idx_dict = get_embedding(
    word_counter, 'word', emb_file=glove_file, vec_size=300, num_vectors=1560132)
char_emb_mat, char2idx_dict = get_embedding(
    char_counter, 'char', emb_file=None, vec_size=64)


dev_examples, dev_eval = process_file(dev_file, "dev", word_counter, char_counter, nlp)

Now we have the preprocessed data:

In [None]:
train_record_file = './data/train.npz'
dev_record_file = './data/dev.npz'

In [None]:
from args import add_common_args, get_setup_args

In [None]:
# Retreiving the default arguments for the preprocessing script
_args = get_setup_args(bypass=True)

In [None]:
_args

In [None]:
build_features(_args, train_examples, "train", train_record_file, word2idx_dict, char2idx_dict)
dev_meta = build_features(_args, dev_examples, "dev", dev_record_file, word2idx_dict, char2idx_dict)


In [None]:
save(_args.word_emb_file, word_emb_mat, message="word embedding")
save(_args.char_emb_file, char_emb_mat, message="char embedding")
save(_args.train_eval_file, train_eval, message="train eval")
save(_args.dev_eval_file, dev_eval, message="dev eval")
save(_args.word2idx_file, word2idx_dict, message="word dictionary")
save(_args.char2idx_file, char2idx_dict, message="char dictionary")
save(_args.dev_meta_file, dev_meta, message="dev meta")


### 2. The experiment

Now you are almost ready to go. You may follow these steps to begin (or just start your experiments here).

1. Try running the `train.py` script from the console (or via `!`) (default command-line arguments are ok for the start). If will run the BiDAF model on the preprocessed data. Set `--use_squad_v2` flag to False (SberQuAD is similar to SQuAD v1.1).

Example code (be careful with the path and the names of the variables):
```
python train.py --name first_run_on_sberquad --use_squad_v2 False
```

2. After if finishes (might take an 1-2-3 hours depending on the hardware), evaluate your model on the `dev` set and measure the quality.
Example code (be careful with the path and the names of the variables):
```
 python test.py --split dev --load_path ./save/train/first_run_on_sberquad-02/best.pth.tar --name best_evaluation_experiment
```
The result should be similar to the following:
```
>>> Dev NLL: 02.47, F1: 75.62, EM: 55.73, AvNA: 99.42
```

The [DeepPavlov's RuBERT](http://docs.deeppavlov.ai/en/master/features/models/squad.html) achieves $F1 = 84.60\pm0.11$ and $EM = 66.30\pm0.24$

#### Here comes your quest: try to improve the quality of this QA system. 

This is a very creative assignment. It is all about experimenting, trying different approaches (and a lot of computations). But if you wish to stick to some numbers, try to increase F1 at least by $5$ points.

Here are some ideas that might help you on your way:
* Try adapting the optimization hyperparameters/network structure to Russian language (the baseline is designed for English SQuAD dataset).
* Incorporating the additional information about the data (like PoS tags) might be a good idea.
* __Distilling the knowledge from a pre-trained RuBERT__ (e.g. try to use the predictions of the model we've discussed on `week10` as soft targets).
* Or anything else.


And, first of all, read the initial code carefully.


Good luck! Feel free to share your results :)