# Preprocessing Report Texts with Spacy
link to [spaCy](https://spacy.io/models/de#de_trf_bertbasecased_lg)

In order to pre-train BERT, the text needs to be pre-processed. As a starting point, there were many individual text files within the report texts. Using 'R', the texts were cleaned (white-space stripping), whereby unusable reports were excluded and put into one csv-file. The structure of the csv-file is as follows: 

| FILE_ADDRESS_TEXT_REPORT | TEXT |
|--------------------------|------|
| path/to/text/report/1234_51324_8571_341_.txt | Klinik, Fragestellung, Indikation: Beispieltext für Thorax erbeten. Befund und Beurteilung: Keine Voraufnahmen. Keine Stauung, kein Erguss, kein Pneu, keine Infiltrate. Knöcherner Thorax unauffällig |
| path/to/text/report/61246_523424_85245_62345_.txt | Anamnese: Weiterer Beispieltext erbeten: Befund: Keine Voraufnahmen. Hier ist ein weiterer Beispieltext. Unauffälliger Befund |
| ...  | ... |

BERT requires a specific text format for pre-training, which can be created with scripts from [Google](https://github.com/google-research/bert/blob/master/create_pretraining_data.py). 
However, for the scripts to work, even the raw-input data needs a specific format. In the Google-research Git Repository it reads:  

> "The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the "next sentence prediction" task). Documents are delimited by empty lines."

In our case, the text data is not in the desired format, because of which it needs to be pre-processed. This can be done as specified below. 


## Notebook summary
As different computers and operating systems were used, the code to the set-up of the environment was always provided as a first step.  

Two functions were defined: The `sentencizer` function, which splits each report text into sentences and the `fix_wrong_splits` function, which fixes wrong splits with the `sentencizer`. 

An example of wrong splitting is provided below: 

Original text:

```python
['Thorax Bedside vom 12.01.2016 Klinik, Fragestellung, Indikation: Z.n. Drainagenanlage. Frage nach Drainagenlage. Pneu? Befund und Beurteilung: Keine Voraufnahmen. 1. Kein Pneumothorax. 2. Drainage Regelrecht. 3. Zunehmende Infiltrate links. Darüber hinaus keine Befundänderung'] 
```    
    
This will be splitted into: 

```python
['Thorax Bedside vom 12.01.2016 Klinik, Fragestellung, Indikation: Z.n.',
 'Drainagenanlage.',
 'Frage nach Drainagenlage.',
 'Pneu?',
 'Befund und Beurteilung: Keine Voraufnahmen.',
 '1.',
 'Kein Pneumothorax.',
 '2.',
 'Drainage Regelrecht.',
 '3.',
 'Zunehmende Infiltrate links.',
 'Darüber hinaus keine Befundänderung'] 
```    
As can be seen, this is not optimal. After using `fix_wrong_splits`, it will instead be converted into: 

```python
['Thorax Bedside vom 12.01.2016 Klinik, Fragestellung, Indikation: Z.n. Drainagenanlage.',
 'Frage nach Drainagenlage.',
 'Pneu? Befund und Beurteilung: Keine Voraufnahmen.',
 '1. Kein Pneumothorax.',
 '2. Drainage Regelrecht.',
 '3. Zunehmende Infiltrate links.',
 'Darüber hinaus keine Befundänderung']
```

Even though this still leaves some splits unfixed, if they appear too close after each other, it greatly improves the overall performance.  
Evaluation of the notebook took approximately 10 hours.

## Initializing the enviroment

```bash
conda create --name=text-preprocessing spacy
conda activate text-preprocessing
conda install ipykernel pandas
ipython kernel install --user --name=spacy
```

## Import packages
`spacy` - workhorse for sentencizing  
`pandas` - for importing the csv file  
`time` - for monitoring time of sentencizing  

In [None]:
import spacy
from spacy.lang.de import German
import pandas as pd
import time

In [None]:
nlp = German()
nlp.add_pipe(nlp.create_pipe('sentencizer')) 

In [None]:
texts = pd.read_csv('../data/cleaned-text-dump.csv', low_memory=False) 

In [None]:
def sentencizer(raw_text, nlp):
    doc = nlp(raw_text)
    sentences = [sent.string.strip() for sent in doc.sents]
    return(sentences)

## Fixing wrong splits
Sentences with specific endings were glued together and hardcoded into an if-statement. Then 'elif' was used to check if a sentence was very short (e.g. _'1.'_ ) and in that case to also glue it to the next sentence.  
As the length of the document varys depending on the number of fixes, a while-loop was used instead of a for-loop. 

In [None]:
def fix_wrong_splits(sentences): 
    i=0
    
    while i < (len(sentences)-2): 
        if sentences[i].endswith(('Z.n.','V.a.','v.a.', 'Vd.a.' 'i.v', ' re.', 
                                  ' li.', 'und 4.', 'bds.', 'Bds.', 'Pat.', 
                                  'i.p.', 'i.P.', 'b.w.', 'i.e.L.', ' pect.', 
                                  'Ggfs.', 'ggf.', 'Ggf.',  'z.B.', 'a.e.'
                                  'I.', 'II.', 'III.', 'IV.', 'V.', 'VI.', 'VII.', 
                                  'VIII.', 'IX.', 'X.', 'XI.', 'XII.')):
            sentences[i:i+2] = [' '.join(sentences[i:i+2])]

        elif len(sentences[i]) < 10: 
            sentences[i:i+2] = [' '.join(sentences[i:i+2])]

        i+=1
    return(sentences)

In [None]:
loggingstep = []
for i in range(1000): 
    loggingstep.append(i*10000)

We used the standard sentencizer from spaCy, as it perfomes similar to other natural language processing modules, such as `de_trf_bertbasecased_lg`. If more complex text-processing is required, e.g. tokenization, the `de_trf_bertbasecased_lg` natural language processing module could be used, which can be installed via: 

```bash
conda activate text-preprocessing
pip install spacy-transformers
python -m spacy download de_trf_bertbasecased_lg
```
However, only using `de_trf_bertbasecased_lg` for sentencizing is extremely slow (aprox. 10-100 times slower), because of which ist was  not used in this notebook. 

In [None]:
tic = time.clock()
for i in range(len(texts)):
    text = texts.TEXT[i]
    sentences = sentencizer(text, nlp)
    sentences = fix_wrong_splits(sentences)
    with open('../data/report-dump.txt', 'a+') as file:
        for sent in sentences:
            file.write(sent + '\n')
        file.write('\n')   
    if i in loggingstep:
        toc = time.clock()
        print('dumped the ' + str(i) + "th report. " + str(toc - tic) + "seconds passed.")
toc = time.clock()

All of the above-referenced steps may be executed by running the run-sentencizing.py file:

```bash
python run-sentencizing.py
```

## Summary statistics
Goal is to get extract the number of words as a word-frequency list. To split each string by words, `string.split()` can be used, but it only split by spaces and ignores special characters like colons, periods, brackets etc..
A tokenizer can be used as a more robust method but this is very slow and therefore probably not worth it.

In [4]:
## count all words
n = 0 
file = open(r'../data/report-dump.txt', 'r',  encoding="utf-8-sig")
for word in file.read().split():
    n += 1

In [20]:
## count lines
lines = 0 
file = open(r'../data/report-dump.txt', 'r',  encoding="utf-8-sig")
for line in file:
    lines += 1
lines

54068691

In [13]:
## Count individual words of file
file = open(r'../data/report-dump.txt', 'r',  encoding="utf-8-sig")
from collections import Counter
wordcount = Counter(file.read().split())

In [None]:
counts = {}
counts['__Overall count__'] = []
counts['__Overall count__'].append(['overall', n])
counts['__individual count__'] = []
for item in wordcount.items():
    counts['__individual count__'].append(item)

In [None]:
import json
with open('../statistics/word-count-report-dump.json', 'w') as outfile:
    json.dump(counts, outfile)