# ***ACL22 MLC Paper - Baselines***

This repository contains all the instructions to reproduce the state-of-the-art models that we use as baselines in our paper. We remind you that these repositories are maintained by the authors, so if any problem arises with their code, please add an issue in the corresponding repositories.

Source codes:

1. Layered: [https://github.com/meizhiju/layered-bilstm-crf](https://github.com/meizhiju/layered-bilstm-crf)
2. Exhaustive: [https://github.com/csJd/deep_exhaustive_model](https://github.com/csJd/deep_exhaustive_model)
3. Boundary: [https://github.com/thecharm/boundary-aware-nested-ner](https://github.com/thecharm/boundary-aware-nested-ner)
4. Recursive-CRF: [https://github.com/yahshibu/nested-ner-tacl2020-flair](https://github.com/yahshibu/nested-ner-tacl2020-flair)

In [None]:
# First, we set up the working environment in google drive. If you are working locally, it will not be necessary but make sure that you are using the GPU.
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# We will clone the repositories in the "MyDrive" folder.
%cd gdrive/MyDrive/

In [None]:
# We create a folder where we will clone each repository. If the folder is already created, then skip this step.
!mkdir mlc-paper-baselines

# ***Layered model***

In [None]:
# We advance to the folder where we will save the baselines.
%cd mlc-paper-baselines/

In [None]:
# Clone the project from the official repository. If you have already cloned it, skip this step.
!git clone https://github.com/meizhiju/layered-bilstm-crf.git

In [None]:
# Navigate to the "src" folder. This folder contains the main scripts with which we will train and test this model.
%cd layered-bilstm-crf/src/

On the left side of Google Colab, you can navigate to the folder containing the project (MyDrive > mlc-paper-baselines > layered-bilstm-crf > src). To execute this model, we had to make the following changes to the source code:

1. **Data loading**: In the folder named "dataset", put the files train.data, dev.data and test.data. Make sure that you follow the format explained in the repository. 

2. **Embeddings**: Create a folder named "embeddings" inside the src folder, and put the embeddings described in the repository.

3. **Hyperparameters**: Change the hyperparameters and directories in the config file (add: ../src/embeddings/embedding_name to the path_pre_emb param). If you are working with the Chilean Waiting List corpus, remember that these clinical embeddings have a dimension of 300. If you are working with a GPU, set the 'main' key to 0, and if you are training, set the mode key to 'train'. As an example in the Chilean corpus:


```
- word_embedding_dim: 300 
- char_embedding_dim: 25
- dropout_ratio: 0.3
- lr_param: 0.001
- threshold: 5 
- decay_rate: 0
- batch_size: 16
- tags: 7 
- epochs: 20
- replace_digit: false
- lowercase: false
- use_singletons: false
```




4. **Code fixes**: Go to the layered_model.py script in the model folder. Due to an error generated by the class called "Evaluator", you must make the following two changes in that class:

```
1. Before (line 397): def __init__(self, iterator, target, device): 
   Now: def __init__(self, iterator, target, device=cuda.cupy):

2. Before (line 398): super(Evaluator, self).__init__(iterator=iterator, target=target, device=device) 
   Now: super(Evaluator, self).__init__(iterator=iterator, target=target)
```



In [None]:
# We install the repository dependencies.
!pip install chainer
!pip install texttable
!pip install 'cupy-cuda101>=7.7.0,<8.0.0'

In [None]:
# Next, we train the Layered model.
!python train.py

To evaluate, change the training mode to test in the configuration file and specify the path to the best model located in the "result" folder using the "path_model" key. In the "evaluation" folder will be found the file with the output. Then, we can calculate the task-specific metrics using this file (go to the metrics Jupyter notebook). Also, in utils.py in the "model" folder, you have to comment on the following lines to save the predictions file:



```
1. Before (line 202): os.remove(output_path)
   Now: #os.remove(output_path)

2. Before (line 203): os.remove(scores_path)
   Now: #os.remove(scores_path)
```



In [None]:
# And we run the best model calculated on the validation set, now on our testing data. The file with the predictions will be located in the "evaluation" folder with the extension '.scores'.
!python test.py

# ***Exhaustive model***

In [None]:
# If you were in the layered folder
#%cd ..
#%cd ..

In [None]:
# We advance to the folder where we will save the baselines.
%cd mlc-paper-baselines/

In [None]:
# Clone the project from the official repository. If you have already cloned it, skip this step.
!git clone https://github.com/csJd/deep_exhaustive_model.git

In [None]:
# In the folder called "data", we will create a folder for each dataset. We also create a folder for the models.
%cd deep_exhaustive_model/data
!mkdir genia
!mkdir wl
!mkdir germ
!mkdir model
%cd ..

On the left side of Google Colab, you can navigate to the folder containing the project (MyDrive > mlc-paper-baselines > deep_exhaustive_model). To execute this model, we had to make the following changes to the source code:

For example using the Chilean Waiting List dataset.

1. **Data loading**: In the folder named "data", wl.put the files train.iob2, wl.dev.iob2 and wl.test.iob2 in the wl folder, following the format explained in the repository (As an example, we will use the case of the waiting list).

2. **Embeddings**: Put the embeddings described in the repository in the "embedding" folder, which is inside the "data" folder.

3. **Hyperparameters**: Change the hyperparameters and directories in the "train.py" script. If you are working with the Chilean Waiting List corpus, remember that these clinical embeddings have a dimension of 300 (line 90). As an example in the Chilean corpus:



```
- embedding_dim: 300 
- char_feat_dim: 25
- learning_rate: 0.001
- clip_norm: 5 
- batch_size: 16
- epochs: 30
- hidden_size = 128
```

4. **Code fixes**: You must make the following changes in lines 24 to 28.

```
EMBD_URL = from_project_root("data/embedding/cwlce.vec")
VOCAB_URL = from_project_root("data/wl/vocab.json")
TRAIN_URL = from_project_root("data/wl/wl.train.iob2")
DEV_URL = from_project_root("data/wl/wl.dev.iob2")
TEST_URL = from_project_root("data/wl/wl.test.iob2")
```

In the "dataset.py script", you have to change "data_urls" (line 272) variable to: 

```
data_urls = [from_project_root("data/wl/wl.train.iob2"),
                 from_project_root("data/wl/wl.dev.iob2"),
                 from_project_root("data/wl/wl.test.iob2")]
```

Add the pre-trained embedding path (line 275)
```
prepare_vocab(data_urls, from_project_root("data/embedding/cwlce.vec"), update=True, min_count=1)
```

Finally, in "model.py", change the CharLSTM class (line 101), use the following:


```
class CharLSTM(nn.Module):

    def __init__(self, n_chars, embedding_size, hidden_size, lstm_layers=1, bidirectional=True):
        super().__init__()
        self.n_chars = n_chars
        self.embedding_size = embedding_size
        self.n_hidden = hidden_size * (1 + bidirectional)

        self.embedding = nn.Embedding(n_chars, embedding_size, padding_idx=0)

        self.lstm = nn.LSTM(
            input_size=embedding_size,
            hidden_size=hidden_size,
            bidirectional=bidirectional,
            num_layers=lstm_layers,
            batch_first=True,
        )

    def sent_forward(self, words, lengths, indices):
        sent_len = words.shape[0]
        # words shape: (sent_len, max_word_len)

        embedded = self.embedding(words)
        # in_data shape: (sent_len, max_word_len, embedding_dim)

        packed = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu().numpy(), batch_first=True)
        _, (hn, _) = self.lstm(packed)
        # shape of hn:  (n_layers * n_directions, sent_len, hidden_size)

        hn = hn.permute(1, 0, 2).contiguous().view(sent_len, -1)
        hn2 = hn.clone()
        # shape of hn:  (sent_len, n_layers * n_directions * hidden_size) = (sent_len, 2*hidden_size)

        # shape of indices: (sent_len, max_word_len)
        hn[indices] = hn2  # unsort hn
        # unsorted = hn.new_empty(hn.size())
        # unsorted.scatter_(dim=0, index=indices.unsqueeze(-1).expand_as(hn), src=hn)
        return hn

    def forward(self, sentence_words, sentence_word_lengths, sentence_word_indices):
        # sentence_words [batch_size, *sent_len, max_word_len]
        # sentence_word_lengths [batch_size, *sent_len]
        # sentence_word_indices [batch_size, *sent_len, max_word_len]

        batch_size = len(sentence_words)
        batch_char_feat = torch.nn.utils.rnn.pad_sequence(
            [self.sent_forward(sentence_words[i], sentence_word_lengths[i], sentence_word_indices[i])
             for i in range(batch_size)], batch_first=True)

        return batch_char_feat
        # (batch_size, sent_len, 2 * hidden_size)
```






In [None]:
# Now, we create the dataset pickle files.
!python dataset.py

In [None]:
# And we train the Exhaustive model.
!python train.py

To evaluate and generate the prediction file, in the eval.py script, uncomment the following line (line 144): 

```
# predict_on_iob2(model, test_url)
```

Change the model_url to the best model found (line 139):

```
model_url = from_project_root("data/model/model.pt")
```

And, change the test_url (line 143):


```
test_url = from_project_root("data/wl/wl.test.iob2")
```

In [None]:
!python eval.py

In the corpus folder (wl in this example) will be located the wl.test.pred.txt file with the output, to which the metrics can be calculated in our notebook of task-specific metrics. Note that in this output, the multilabel entities are not considered. Therefore, the Chilean Waiting List predictions must be compared with the original data to obtain the objective metric. This is done in the metrics notebook.

# ***Boundary model***

In [None]:
# If you were in the exhaustive folder
# %cd ..
# %cd ..

In [None]:
# Clone the project from the official repository. If you have already cloned it, skip this step.
%cd mlc-paper-baselines/

In [None]:
# Clone the project from the unofficial repository, if you have already cloned it skip this step.
!git clone https://github.com/thecharm/boundary-aware-nested-ner

In [None]:
%cd boundary-aware-nested-ner/Our_boundary-aware_model/

In [None]:
# In the folder called "data", we will create a folder for each dataset. We also create a folder for the models and the embeddings folder.
%cd data
!mkdir wl
!mkdir germ
!mkdir model
!mkdir embedding
%cd ..

On the left side of Google Colab, you can navigate to the folder containing the project (MyDrive > mlc-paper-baselines > boundary-aware-nested-ner > Our_boundary-aware_model). To execute this model, we had to make the following changes to the source code:

For example using the Chilean Waiting List dataset.

1. **Data loading**: In the "data" folder, place the files wl.train.iob2, wl.dev.iob2 and wl.test.iob2 in the wl folder, following the format explained in the repository. (As an example, we will use the case of the waiting list).

2. **Embeddings**: Put the embeddings described in the repository in the "embedding" folder, which is inside the "data" folder.

3. **Hyperparameters**: Change the hyperparameters and directories in the "train.py" script. If you are working with the Chilean Waiting List corpus, remember that these clinical embeddings have a dimension of 300 (line 95). 



```
MAX_REGION = 10
EARLY_STOP = 5
LR = 0.005
BATCH_SIZE = 16 
MAX_GRAD_NORM = 5
N_TAGS = 8
TAG_WEIGHTS = [1, 1, 1, 1, 1, 1, 1, 1]
FREEZE_WV = False
LOG_PER_BATCH = 10
hidden_size=128,
lstm_layers=3
```


4. **Code fixes**

Lines 34-38

```
PRETRAINED_URL = from_project_root("data/embedding/cwlce.vec")
EMBED_URL = from_project_root("data/wl/embeddings.npy")
VOCAB_URL = from_project_root("data/wl/vocab.json")
TRAIN_URL = from_project_root("data/wl/wl.train.iob2")
DEV_URL = from_project_root("data/wl/wl.dev.iob2")
TEST_URL = from_project_root("data/wl/wl.test.iob2")
```

Change tags and weights
```
N_TAGS = 8 (line 29)
TAG_WEIGHTS = [1, 1, 1, 1, 1, 1, 1, 1] (line 30)
```

And delete the following lines:
```
import pdb (line 21)
pdb.set_trace() (line 22)
```

In the "dataset.py" script, change data_urls (line 286) to 

```
data_urls = [from_project_root("data/wl/wl.train.iob2"),
                 from_project_root("data/wl/wl.dev.iob2"),
                 from_project_root("data/wl/wl.test.iob2")]
```

Add the pre-trained embedding path (line 289)
```
prepare_vocab(data_urls, from_project_root("data/embedding/cwlce.vec"), update=True, min_count=1)
```

Change these three variables (lines 15-17):
```
LABEL_IDS = {"neither": 0, "Disease": 1, "Finding": 2, "Medication": 3, "Procedure": 4, "Family_Member": 5,"Body_Part": 6, "Abbreviation": 7}
PRETRAINED_URL = from_project_root("data/embedding/cwlce.vec")
LABEL_LIST = {"O", "Disease", "Finding", "Medication", "Procedure", "Family_Member","Body_Part", "Abbreviation"}
```

Finally, in "model.py", change the CharLSTM class, use the following:

```
class CharLSTM(nn.Module):

    def __init__(self, n_chars, embedding_size, hidden_size, lstm_layers=1, bidirectional=True):
        super().__init__()
        self.n_chars = n_chars
        self.embedding_size = embedding_size
        self.n_hidden = hidden_size * (1 + bidirectional)

        self.embedding = nn.Embedding(n_chars, embedding_size, padding_idx=0)

        self.lstm = nn.LSTM(
            input_size=embedding_size,
            hidden_size=hidden_size,
            bidirectional=bidirectional,
            num_layers=lstm_layers,
            batch_first=True,
        )

    def sent_forward(self, words, lengths, indices):
        sent_len = words.shape[0]
        # words shape: (sent_len, max_word_len)

        embedded = self.embedding(words)
        # in_data shape: (sent_len, max_word_len, embedding_dim)

        packed = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu().numpy(), batch_first=True)
        _, (hn, _) = self.lstm(packed)
        # shape of hn:  (n_layers * n_directions, sent_len, hidden_size)

        hn = hn.permute(1, 0, 2).contiguous().view(sent_len, -1)
        hn2 = hn.clone()
        # shape of hn:  (sent_len, n_layers * n_directions * hidden_size) = (sent_len, 2*hidden_size)

        # shape of indices: (sent_len, max_word_len)
        hn[indices] = hn2  # unsort hn
        # unsorted = hn.new_empty(hn.size())
        # unsorted.scatter_(dim=0, index=indices.unsqueeze(-1).expand_as(hn), src=hn)
        return hn

    def forward(self, sentence_words, sentence_word_lengths, sentence_word_indices):
        # sentence_words [batch_size, *sent_len, max_word_len]
        # sentence_word_lengths [batch_size, *sent_len]
        # sentence_word_indices [batch_size, *sent_len, max_word_len]

        batch_size = len(sentence_words)
        batch_char_feat = torch.nn.utils.rnn.pad_sequence(
            [self.sent_forward(sentence_words[i], sentence_word_lengths[i], sentence_word_indices[i])
             for i in range(batch_size)], batch_first=True)

        return batch_char_feat
        # (batch_size, sent_len, 2 * hidden_size)
```

In [None]:
!python train.py

Chilean Waiting List: Replace eval.py with the one we passed in the supplementary material (eval-2.py). This edited file writes the result of the predictions in a text file, which can be used to calculate task-specific metrics. The name of the file is boundary_result.txt. For the case of the waiting list dataset it is essential to do this since there are multilabel entities that the model cannot capture. 

Genia and Germeval: Simply take the functions from the metrics notebook and calculate them in eval.py, this is done in the eval-3.py file.


In both cases, change the following lines, according to your dataset (just in testing step). In the "model" folder you will find the best model save.

```
model_url = from_project_root("data/model/end2end_model_epochnumber_score.pt")
test_url = from_project_root("data/wl/wl.test.iob2")
```

In [None]:
!python eval.py

# ***Recursive-CRF model***

In [None]:
%cd mlc-paper-baselines/

In [None]:
# Clone the project from the unofficial repository, if you have already cloned it skip this step.
!git clone https://github.com/yahshibu/nested-ner-tacl2020-flair.git

In [None]:
%cd nested-ner-tacl2020-flair/

In [None]:
# First we create the folder where we are going to place the pre-trained word embeddings.
!mkdir embeddings

In [None]:
!pip install adabound
!pip install flair

4. **Code fixes** 

- Add to the folder the files: gen_data_for_wl.py and gen_data_for_germ.py

- Change the reader.py and embeddings.py files to the files given.

- In crf.py change device problem if you are using a gpu, add the following line: indices_3 = indices_3.cuda() before line 300.

All these changes were necessary, otherwise we could not run the code from the repository.

In [None]:
# We train, and then the file with the predictions will be saved in the "dumps" folder.
!python train.py

In [None]:
%cd ..

In [None]:
!git clone https://github.com/yahshibu/nested-ner-tacl2020-flair.git

In [None]:
%cd nested-ner-tacl2020-flair/

In [None]:
!python gen_data_for_genia.py
!python gen_data_for_germ.py
!python gen_data_for_wl.py

In [None]:
#%cd embeddings/

In [None]:
# If you don't have the embeddings in binary format
#from gensim.models.keyedvectors import KeyedVectors

#model = KeyedVectors.load_word2vec_format('cwlce.vec', binary=False)
#model.save_word2vec_format('cwlce.bin', binary=True)

In [None]:
#%cd ..

In [None]:
# We train and the predictions will be saved in the dumps folder.
!python train.py