# Training your own model

This notebook will walk you through training your own model using [DeCLUTR](https://github.com/JohnGiorgi/DeCLUTR).

## 🔧 Install the prerequisites

In [None]:
# !pip install git+https://github.com/JohnGiorgi/DeCLUTR.git

# go to main dir i.e. DeCLUTR on local and run "pip install --editable .""

## 📖 Preparing a dataset


A dataset is simply a file containing one item of text (a document, a scientific paper, etc.) per line. For demonstration purposes, we have provided a script that will download the [WikiText-103](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) dataset and format it for training with our method.

The only "gotcha" is that each piece of text needs to be long enough so that we can sample spans from it. In general, you should collect documents of a minimum length according to the following:

```python
min_length = num_anchors * max_span_len * 2
```

In our paper, we set `num_anchors=2` and `max_span_len=512`, so we require documents of `min_length=2048`. We simply need to provide this value as an argument when running the script:

In [None]:
import os

train_data_path = "./data/wiki_text/wikitext-103/train.txt"

# run this to download and preprocess data

min_length = 2048

!python ../scripts/preprocess_wikitext_103.py $train_data_path --min-length $min_length --max-instances 100

Lets confirm that our dataset looks as expected.

In [None]:
!wc -l $train_data_path  # This should be approximately 17.8K lines

In [None]:
# !head -n 1 $train_data_path  # This should be a single Wikipedia entry

### Look at sampling technique

This will help get an idea of what 

In [None]:
import torch
from declutr.common.contrastive_utils import sample_anchor_positive_pairs
from declutr.losses import NTXentLoss
from transformers import AutoTokenizer

In [None]:
text = "this is just an example sentence to test out some sampling and loss calculation from DeCLUTR. We want to see exactly how it works in order to implement it for our own use case"
len_text = len(text.split())

In [None]:
# just go with one anchor for now

num_anchors = 1
max_span_len = int((len_text / 2) / num_anchors)

min_span_len = 5
num_positives = 5
sampling_strat = "adjacent"

In [None]:
anchor_spans, positive_spans = sample_anchor_positive_pairs(
    text=text,
    num_anchors=num_anchors,
    num_positives=num_positives,
    max_span_len=max_span_len,
    min_span_len=min_span_len,
    sampling_strategy=sampling_strat,
)

In [None]:
anchor_spans

In [None]:
positive_spans

In [None]:
# test loss function
anchor_emb = torch.rand(64).unsqueeze(0)
pos_emb = torch.rand(64).unsqueeze(0)
neg_emb = torch.rand(64).unsqueeze(0)

In [None]:
anchor_pos_embs = torch.cat((anchor_emb, pos_emb))
loss_func = NTXentLoss
embs, labels = NTXentLoss.get_embeddings_and_label(anchor_emb, pos_emb)

## 🏃 Training the model

Once you have collected the dataset, you can easily initiate a training session with the `allennlp train` command. An experiment is configured using a [Jsonnet](https://jsonnet.org/) config file. Lets take a look at the config for the DeCLUTR-small model presented in [our paper](https://arxiv.org/abs/2006.03659):

In [None]:
# with open("../training_config/declutr_small.jsonnet", "r") as f:
#     print(f.read())


The only thing to configure is the path to the training set (`train_data_path`), which can be passed to `allennlp train` via the `--overrides` argument (but you can also provide it in your config file directly, if you prefer):

In [None]:
# overrides = (
#     f"{{'train_data_path': '{train_data_path}', "
#     # lower the batch size to be able to train on Colab GPUs
#     "'data_loader.batch_size': 2, "
#     # training examples / batch size. Not required, but gives us a more informative progress bar during training
#     "'data_loader.batches_per_epoch': None}"
# )


overrides = (
    f"{{'train_data_path': '{train_data_path}', "
    # lower the batch size to be able to train on Colab GPUs
    "'data_loader.batch_size': 4,}"
)

In [None]:
overrides

In [None]:
!allennlp train "../training_config/declutr_small_v2.jsonnet" \
    --serialization-dir "./saved_models/declutr/wiki/output" \
    --overrides "$overrides" \
    --include-package "declutr" \
    -f

### 🤗 Exporting a trained model to HuggingFace Transformers

We have provided a simple script to export a trained model so that it can be loaded with [Hugging Face Transformers](https://github.com/huggingface/transformers)

In [None]:
archive_file = "./saved_models/declutr/wiki/output/"
save_directory = "./saved_models/declutr/wiki/output/transformers_format/"

!python ../scripts/save_pretrained_hf.py --archive_file $archive_file --save_directory $save_directory

In [None]:
# !python ../scripts/save_pretrained_hf.py --help

The model, saved to `--save-directory`, can then be loaded using the Hugging Face Transformers library

> See the [embedding notebook](https://colab.research.google.com/github/JohnGiorgi/DeCLUTR/blob/master/notebooks/embedding.ipynb) for more details on using trained models.

In [None]:
from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(f"{save_directory}")
model = AutoModel.from_pretrained(f"{save_directory}")

In [None]:
model

> If you would like to upload your model to the Hugging Face model repository, follow the instructions [here](https://huggingface.co/transformers/model_sharing.html).

## ♻️ Conclusion

That's it! In this notebook, we covered how to collect data for training the model, and specifically how _long_ that text needs to be. We then briefly covered configuring and running a training session. Please see [our paper](https://arxiv.org/abs/2006.03659) and [repo](https://github.com/JohnGiorgi/DeCLUTR) for more details, and don't hesitate to open an issue if you have any trouble!