# Building a custom vocabulary for BERT
https://github.com/kwonmha/bert-vocab-builder

In order to achieve good results on domain-specific datasets, BERT has to be pre-trained to enable a better understanding. Having a look at the vocabulary from the German BERT-Base model by deepset.ai, there are 'only' 30000 vocabulary words (of which 3001 are unused), while some of the most frequent vocabulary from medical texts is absent. For example:

| German Word | English Translation  |
|-------------|----------------------|
|Pneumothorax | pneumothorax         |
|Erguss       | effusion             |
|Infiltrat    | infiltrate           |
|Dystelektase | dystelectasis        |
| ...         | ...                  |


Google's research does not provide tools to create a custom vocabulary, however [this](https://github.com/kwonmha/bert-vocab-builder) Github repository of [kwonmha](https://github.com/kwonmha) does. In order to use the scripts, they have been downloaded into the folder `bert-vocab-builder`. 

The vocabulary can be build via the following bash-command: 

```bash
python subword_builder.py \
--corpus_filepattern {corpus_for_vocab} \
--output_filename {name_of_vocab}
--min_count {minimum_subtoken_counts}
```

To define a reasonable mininum subtoken count, we proceeded as follows: 
In a [previous notebook](https://github.com/kbressem/bert-for-radiology/blob/master/pretraining/sentencizing.ipynb), the word frequency was counted in all text-reports and then put into a .json file. This shows the frequency of specific words, enabling the definition of a reasonable threshold.  

## Initializing the enviroment

```bash
conda create --name=bert-vocab tensorflow
conda activate bert-vocab
conda install ipykernel spacy
ipython kernel install --user --name=bert-vocab
```

## Importing the .json file
Since we work with very sensible data, neither the original text nor the .json file can be uploaded, as a small risk remains that a patient name could be mentioned somewhere in a report text. 

In [1]:
import json
from collections import OrderedDict

with open('../data/word-count-report-dump.json') as json_file:
    wordcount = json.load(json_file)

In [6]:
def sortSecond(val): 
    return val[1]  

wordcount['__individual count__'].sort(key = sortSecond, reverse = True)

In [7]:
GREATER_THAN = 1000

wordcount_greater = []
for i in wordcount['__individual count__']:
    if i[1] > GREATER_THAN:
        wordcount_greater.append(i)

In [8]:
wordcount_greater.sort(key = sortSecond, reverse = False)
len(wordcount_greater)

23783

I would suggest to set `--min_count` to 5000. 

## Generation of a  custom vocabulary

Installing spaCy and tensorflow automaticall downgrades tensorflow to version 1.13.1. Although the code to create the custom vocabulary is based on tensorflow 1.11, it currently works: 

```bash
python subword_builder.py \
    --corpus_filepattern '../../data/report-dump.raw' \
    --output_filename '../../pretraining/vocab-bert.txt' \
    --min_count 5000
```