# Coreference resolution for Slovene language

You can find the source code of this project in the GitHub [repository](https://github.com/matejklemen/slovene-coreference-resolution).

This notebook serves as instructions on how to use the provided source code and can also be **run in a Google Colab** environment (it is actually preferred you do so).

Our work includes four different models:

- the baseline model (including evaluation of trivial models),
- non-contextual model (with word2vec embeddings),
- contextual model with ELMo embeddings,
- contextual model with BERT embeddings.

Note that if you want to run only one of the models, you **do not need to run all the cells in this notebook**. 
For example, if you are interested in running the contextual model with ELMo, you probably do not need the pre-trained BERT or word2vec embeddings.

Contents of the notebook are as follows:

1. fetching the source code of our work,

2. obtaining the datasets,

3. obtaining the pre-trained data,

4. running the models (evaluation of pre-trained models or training new models)

**Note:** if you are running this in Google Colab, do not forget to set Runtime type to GPU by navigating to *Runtime* menu -> *Change runtime type* -> make sure that *GPU* is selected in the dropdown.

# 1. Fetching source code

First of all, fetch the source code from the repository and install the needed requirements with pip.

In [0]:
%cd /content
!git clone https://github.com/matejklemen/slovene-coreference-resolution
%cd slovene-coreference-resolution/
!pip install -r requirements.txt -q

# 2. Obtaining datasets

In our work we used coref149 dataset (a coreference-annotated subset of ssj500k) and SentiCoref dataset.

You do not necessarily have to fetch both, but only the one you want to use for training and/or evaluation of the models.


## Coref149 & ssj500k datasets

As mentioned, coref149 is a subset of documents from the ssj500k dataset, but with additional annotations of entities and coreferences. Thus we fetch both datasets.

In [0]:
%cd /content/slovene-coreference-resolution/data

!wget https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1182/coref149_v1.0.zip
!unzip -q coref149_v1.0.zip -d coref149
%rm coref149_v1.0.zip

In [0]:
%cd /content/slovene-coreference-resolution/data

!wget https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1210/ssj500k-sl.TEI.zip
!unzip -q ssj500k-sl.TEI.zip
%rm ssj500k-sl.TEI.zip

Since we only need a subset of the whole ssj500k, there's a python script in the repository that produces a "reduced" ssj500k dataset, containing only the documents also existing in the coref149.

In [0]:
%cd /content/slovene-coreference-resolution

# Trim SSJ500k dataset to decrease size/improve loading speed
!python src/trim_ssj.py --coref149_dir=data/coref149 \
    --ssj500k_path=data/ssj500k-sl.TEI/ssj500k-sl.body.xml \
    --target_path=data/ssj500k-sl.TEI/ssj500k-sl.body.reduced.xml

## SentiCoref

In [0]:
%cd /content/slovene-coreference-resolution/data

!wget https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1285/SentiCoref_1.0.zip
!unzip -q SentiCoref_1.0.zip -d senticoref1_0
%rm SentiCoref_1.0.zip

## 3. Obtaining pre-trained data

With the datasets obtained, it's time to obtain pre-trained data. That includes:

- pre-trained models we made as a part of our work (those can be immediately evaluated without training)
- pre-trained base data

### Pre-trained models (from our work)

This section is relevant only **if you want to evaluate pre-trained models** we prepared in our work. If you are interested in training models from scratch, skip to the next section.

Pretrained models are available at the following Google Drive link:

https://drive.google.com/open?id=15xKYqSy5WgedFIPGP-HZz7YVmZa6oKg_ (~100 MB zip of all the pre-trained models).

Following cells will download and extract zip into `/content/pretrained-models/`, and then copy models into appropriate places where they can be read by our source code:

- baseline models go into `<repository root>/src/baseline_model/<model_name>`
- non-contextual models go into `<repository root>/src/noncontextual_model/<model_name>`
-contextual models (ELMo emb.) go into `<repository root>/src/contextual_model_elmo/<model_name>`
-contextual models (BERT emb.) go into `<repository root>/src/contextual_model_bert/<model_name>`.

If you are training your own models, they will as well be saved in the locations listed above.

In [0]:
%mkdir -p /content/pretrained-models/
%cd /content/pretrained-models/

!gdown --id "15xKYqSy5WgedFIPGP-HZz7YVmZa6oKg_"
!unzip -q slo-coref-pretrained.zip
%rm slo-coref-pretrained.zip
%ls

In [0]:
# Copy pre-trained baseline models
%mkdir -p /content/slovene-coreference-resolution/src/baseline_model/
%cp -R baseline_coref149_0.05/ /content/slovene-coreference-resolution/src/baseline_model/
%cp -R baseline_senticoref_0.005/ /content/slovene-coreference-resolution/src/baseline_model/

In [0]:
# Copy pre-trained non-contextual models
%mkdir -p /content/slovene-coreference-resolution/src/noncontextual_model/
%cp -R nc_w2v100_unfrozen_hs1024_senticoref_0.001_dr0.4/ /content/slovene-coreference-resolution/src/noncontextual_model/
%cp -R nc_w2v100_unfrozen_hs512_coref149_0.001_dr0.0/ /content/slovene-coreference-resolution/src/noncontextual_model/

In [0]:
# Copy pre-trained contextual (ELMo) models
%mkdir -p /content/slovene-coreference-resolution/src/contextual_model_elmo/
%cp -R elmo_coref149_lr0.005_fchs64_dr0.4/ /content/slovene-coreference-resolution/src/contextual_model_elmo/
%cp -R elmo_senticoref_lr0.001_hs512_fchs512_dr0.6/ /content/slovene-coreference-resolution/src/contextual_model_elmo/

In [0]:
# Copy pre-trained contextual (BERT) models
%mkdir -p /content/slovene-coreference-resolution/src/contextual_model_bert/
%cp -R bert_coref149_lr0.001_fchs64_dr0.4/ /content/slovene-coreference-resolution/src/contextual_model_bert/
%cp -R bert_senticoref_lr0.001_fchs1024_seg256_dr0.2/ /content/slovene-coreference-resolution/src/contextual_model_bert/

### Pre-trained base data

You **must obtain these** regardless of doing training or evaluation.

#### Slovene word2vec embeddings

Needed only for **non-contextual model (word2vec embeddings)**.

In [0]:
%cd /content/slovene-coreference-resolution/data

!wget http://vectors.nlpl.eu/repository/20/67.zip
!unzip -q 67.zip
%rm 67.zip

#### Pre-trained Slovene ELMo embeddings

Needed only for **contextual model with ELMo embeddings**.

In [0]:
%cd /content/slovene-coreference-resolution/data

# Download pretrained Slovene ELMo
!wget https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1277/slovenian-elmo.tar.gz
!tar -xvzf slovenian-elmo.tar.gz
!mv slovenian slovenian-elmo
%rm slovenian-elmo.tar.gz

#### Pre-trained slovene BERT

Needed only for training **contextual model with BERT embeddings**.

https://drive.google.com/open?id=1IC9SidISzE8xAfqoZJ3lJtp3XVS1nNQQ

In [0]:
%cd /content/slovene-coreference-resolution/data

!gdown --id "1IC9SidISzE8xAfqoZJ3lJtp3XVS1nNQQ"
!unzip -q slo-hr-en-bert-pytorch.zip
%rm slo-hr-en-bert-pytorch.zip

import os
os.environ['CUSTOM_PRETRAINED_BERT_DIR'] = "/content/slovene-coreference-resolution/data/slo-hr-en-bert-pytorch"

# 4. Running the models (training/evaluation)

With all the necessary data obtained, you can now actually run the models.
Below are cells for each model that run evaluation on pre-trained models.

#### Training your own models

If you want to train your own models, you can take these cells as a template.
When training, you should define a new unique model name. Running the script with **a model name that already exists will only load the model and perform evaluation**.

When loading existing models, it is important that passed parameters match with the parameters of the loaded models. (Cells for the pre-trained models below are already set in that way.)

**Note:** You should be in `/content/slovene-coreference-resolution/src` folder when executing below cells.

In [0]:
%cd /content/slovene-coreference-resolution/src

## Baseline model

In [0]:
# Baseline evaluation of pre-trained model (coref149 dataset)
!python baseline.py \
  --model_name="baseline_coref149_0.05" \
  --learning_rate="0.05" \
  --dataset="coref149" \
  --num_epochs="20" \
  --fixed_split

In [0]:
# Baseline evaluation of pre-trained model (senticoref dataset)
!python baseline.py \
    --model_name="baseline_senticoref_0.005" \
    --dataset="senticoref" \
    --learning_rate="0.005" \
    --num_epochs="50" \
    --fixed_split

## Non-contextual model

In [0]:
# Non-contextual model evaluation of pre-trained model (coref149 dataset)
!python noncontextual_model.py \
    --model_name="nc_w2v100_unfrozen_hs512_coref149_0.001_dr0.0" \
    --fc_hidden_size="512" \
    --dropout="0.0" \
    --learning_rate="0.001" \
    --dataset="coref149" \
    --embedding_size="100" \
    --use_pretrained_embs="word2vec" \
    --freeze_pretrained \
    --fixed_split

In [0]:
# Non-contextual model evaluation of pre-trained model (senticoref dataset)
!python noncontextual_model.py \
    --model_name="nc_w2v100_unfrozen_hs1024_senticoref_0.001_dr0.4" \
    --fc_hidden_size="1024" \
    --dropout="0.4" \
    --learning_rate="0.001" \
    --dataset="senticoref" \
    --embedding_size="100" \
    --use_pretrained_embs="word2vec" \
    --freeze_pretrained \
    --fixed_split

## Contextual model with ELMo

Note: In Google Colab, do not forget to set runtime type to GPU for faster evaluation.

In [0]:
!python contextual_model_elmo.py \
    --model_name="elmo_coref149_lr0.005_fchs64_dr0.4" \
    --dataset="coref149" \
    --fc_hidden_size="64" \
    --dropout="0.4" \
    --learning_rate="0.005" \
    --num_epochs="20" \
    --freeze_pretrained \
    --fixed_split

In [0]:
!python contextual_model_elmo.py \
    --model_name="elmo_senticoref_lr0.001_hs512_fchs512_dr0.6" \
    --dataset="senticoref" \
    --hidden_size="512" \
    --fc_hidden_size="512" \
    --dropout="0.6" \
    --learning_rate="0.001" \
    --num_epochs="20" \
    --freeze_pretrained \
    --fixed_split

## Contextual model with BERT

Note: In Google Colab, do not forget to set runtime type to GPU for faster evaluation.

In [0]:
!python contextual_model_bert.py \
    --model_name="bert_coref149_lr0.001_fchs64_dr0.4" \
    --dataset="coref149" \
    --fc_hidden_size="64" \
    --dropout="0.4" \
    --learning_rate="0.001" \
    --num_epochs="20" \
    --freeze_pretrained \
    --fixed_split

In [0]:
!python contextual_model_bert.py \
    --model_name="bert_senticoref_lr0.001_fchs1024_seg256_dr0.2" \
    --dataset="senticoref" \
    --fc_hidden_size="1024" \
    --dropout="0.2" \
    --learning_rate="0.001" \
    --num_epochs="20" \
    --freeze_pretrained \
    --fixed_split