<h1 style="background-color:#0071BD;color:white;text-align:center;padding-top:0.8em;padding-bottom: 0.8em">
  LDA Spike 1 - Cleaning
</h1>

This notebook "cleans" the text files containing answers with the help of the Natural Language Processing Library [spaCy](https://spacy.io/). By default the text files are expected to be found in the folder `Corpus` and the cleaned files are written into the folder `Cleaned`. We want to keep only useful information in the files and remove any "noise". We decided to do the following:

  * Replace all words by their lemmata ('sang', 'singe', 'singt' --> 'singen').
  * Keep the capitalization for nouns and proper nouns but otherwise change to lower case.
  * Keep only verbs, nouns, proper nouns and adjectives.

The randomly picked example below will (probably) demonstrate the impact of these transformations. Nevertheless, there is still much room for improvement. You may try other NLP libraries as well or even skip this step altogether.

<font color="darkred">__This notebooks writes to and reads from your file system.__ Per default all used directory are within `~/TextData/Abgeordnetenwatch`, where `~` stands for whatever your operating system considers your home directory. To change this configuration either change the default values in the second next cell or edit [LDA Spike - Configuration.ipynb](./LDA%20Spike%20-%20Configuration.ipynb) and run it before you run this notebook.</font>

This notebooks operates on text files. In our case we retrieved these texts from www.abgeordnetenwatch.de guided by data that was made available under the [Open Database License (ODbL) v1.0](https://opendatacommons.org/licenses/odbl/1.0/).

<p style="background-color:#66A5D1;padding-top:0.2em;padding-bottom: 0.2em" />

In [1]:
from pathlib import Path

import random as rnd
import time

import spacy

In [2]:
# Read stored values of configuration parameters or set a default

%store -r project_name
if not('project_name' in globals()): project_name = 'AbgeordnetenWatch'

%store -r text_data_dir
if not('text_data_dir' in globals()): text_data_dir = Path.home() / 'TextData'

In [3]:
update_only_missing_texts = True

corpus_dir  = text_data_dir / project_name / 'Corpus'
cleaned_dir = text_data_dir / project_name / 'Cleaned'

assert corpus_dir.exists(),                      'Directory should exist.'
assert corpus_dir.is_dir(),                      'Directory should be a directory.'
assert next(corpus_dir.iterdir(), None) != None, 'Directory should not be empty.'

cleaned_dir.mkdir(parents=True, exist_ok=True) # Creates a local directory!

## NLP Configuration and Initialization

In [4]:
notaword_pos = ['SPACE', 'PUNCT']
keepcase_pos = ['NOUN', 'PROPN']
keepword_pos = ['ADJ', 'NOUN', 'PROPN', 'VERB']

In [5]:
german = spacy.load('de')

In [6]:
def cleaned_text(text):
    text_model = german(text)
    lemmata = [token.lemma_ if token.pos_ in keepcase_pos else token.lemma_.lower() 
                   for token in text_model if token.pos_ in keepword_pos]
    return ' '.join(lemmata)

In [7]:
text = 'Die Kuh rannte bis sie fiel, in die Vertiefung.'
print(text, '-->', cleaned_text(text))

Die Kuh rannte bis sie fiel, in die Vertiefung. --> Kuh rennen fallen Vertiefung


## Load all files

In [8]:
answer_filenames = []
answer_texts = []

files = list(corpus_dir.glob('*A*.txt'))
list.sort(files)

for file in files:
    answer_filenames.append(file.name)
    answer_texts.append(file.read_text().strip())

files = None

## Random Example Text

In [9]:
min_len = 300
max_len = 600
example_text = ''

while (len(example_text) < min_len or len(example_text) > max_len):
    example = rnd.randint(0, len(answer_filenames))
    example_text = answer_texts[example]

print(answer_filenames[example])
print()
print(example_text)

dr-gregor-gysi_die-linke_Q0053_2018-05-06_A01_2018-05-22_demokratie-und-bürgerrechte.txt

Sehr geehrte Frau  N.N. ,
an den elf Tagen, an denen ich nicht im Bundestag war, war ich keineswegs krank. Ich fehlte deshalb entschuldigt, weil ich Mandatspflichten, also Pflichten als Bundestagsabgeordneter außerhalb Berlins wahrnahm. So sprach ich zum Beispiel vor Gewerkschaftern, an Universitäten oder vor Unternehmern.
Mit freundlichen Grüßen
Gregor Gysi


In [10]:
# Create a model of the text. We use POS-Tagging to filter the words:
# https://spacy.io/api/annotation#pos-tagging

text_model = german(example_text)

### Lemmatized words with part of speech tags

In [11]:
for token in text_model:
    if token.pos_ in notaword_pos: 
        print(token, end='') 
    else: 
        print(token.lemma_, token.pos_, end=' ')

Sehr ADV geehrt ADJ Frau NOUN  N.N. PROPN ,
an ADP der DET elf NUM Tag NOUN ,an ADP der PRON ich PRON nicht PART im ADP Bundestag NOUN sein AUX ,sein AUX ich PRON keineswegs ADV kranken ADJ .Ich PRON fehlen VERB deshalb ADV entschuldigen VERB ,weil SCONJ ich PRON Mandatspflichten NOUN ,also ADV Pflicht NOUN als ADP Bundestagsabgeordneter NOUN außerhalb ADP Berlin PROPN wahrnehmen VERB .So ADV sprechen VERB ich PRON zum ADP Beispiel NOUN vor ADP Gewerkschafter NOUN ,an ADP Universität NOUN oder CONJ vor ADP Unternehmer NOUN .
Mit ADP freundlich ADJ Gruß NOUN 
Gregor PROPN Gysi PROPN 

### Words by part of speech

In [12]:
parts_of_speech = {}

for token in text_model:
    pos = token.pos_
    if pos in ['SPACE', 'PUNCT']: continue
    words = parts_of_speech.setdefault(pos, set())
    if pos in keepcase_pos: words.add(token.text)
    else: words.add(token.text.lower())

for key in sorted(parts_of_speech.keys()):
    words = list(parts_of_speech[key])
    list.sort(words)
    print('{:5}: {}'.format(key, ', '.join(words)))

ADJ  : freundlichen, geehrte, krank
ADP  : als, an, außerhalb, im, mit, vor, zum
ADV  : also, deshalb, keineswegs, sehr, so
AUX  : war
CONJ : oder
DET  : den
NOUN : Beispiel, Bundestag, Bundestagsabgeordneter, Frau, Gewerkschaftern, Grüßen, Mandatspflichten, Pflichten, Tagen, Universitäten, Unternehmern
NUM  : elf
PART : nicht
PRON : denen, ich
PROPN: Berlins, Gregor, Gysi, N.N.
SCONJ: weil
VERB : entschuldigt, fehlte, sprach, wahrnahm


### Lemmatizations

In [13]:
lemmatizations = list(set(
    token.text + ' -> ' + token.lemma_ 
    for token in text_model if token.text != token.lemma_
))
list.sort(lemmatizations)
print(', '.join(lemmatizations))

Berlins -> Berlin, Gewerkschaftern -> Gewerkschafter, Grüßen -> Gruß, Pflichten -> Pflicht, Tagen -> Tag, Universitäten -> Universität, Unternehmern -> Unternehmer, den -> der, denen -> der, entschuldigt -> entschuldigen, fehlte -> fehlen, freundlichen -> freundlich, geehrte -> geehrt, krank -> kranken, sprach -> sprechen, wahrnahm -> wahrnehmen, war -> sein


### Filtered by part of speech

In [14]:
for token in text_model:
    if token.pos_ in keepword_pos: 
        print(token.lemma_, end=' ')

geehrt Frau N.N. Tag Bundestag kranken fehlen entschuldigen Mandatspflichten Pflicht Bundestagsabgeordneter Berlin wahrnehmen sprechen Beispiel Gewerkschafter Universität Unternehmer freundlich Gruß Gregor Gysi 

### Cleaned Example Text

In [15]:
print(30 * '-' + ' Original text: ' + 30 * '-')
print(example_text)
print(30 * '-' + ' Cleaned text: ' + 30 * '-')
print(cleaned_text(example_text))

------------------------------ Original text: ------------------------------
Sehr geehrte Frau  N.N. ,
an den elf Tagen, an denen ich nicht im Bundestag war, war ich keineswegs krank. Ich fehlte deshalb entschuldigt, weil ich Mandatspflichten, also Pflichten als Bundestagsabgeordneter außerhalb Berlins wahrnahm. So sprach ich zum Beispiel vor Gewerkschaftern, an Universitäten oder vor Unternehmern.
Mit freundlichen Grüßen
Gregor Gysi
------------------------------ Cleaned text: ------------------------------
geehrt Frau N.N. Tag Bundestag kranken fehlen entschuldigen Mandatspflichten Pflicht Bundestagsabgeordneter Berlin wahrnehmen sprechen Beispiel Gewerkschafter Universität Unternehmer freundlich Gruß Gregor Gysi


## Write all cleaned files

In [16]:
nlp_start_time = time.perf_counter()

success = []
failure = []
   
for filename, answer_text in zip(answer_filenames, answer_texts):

    target_file = cleaned_dir / filename
    if update_only_missing_texts and target_file.exists(): continue
        
    try:
        target_file.write_text(cleaned_text(answer_text))
        success.append(filename)

    except Exception as exception:
        failure.append((filename, exception))

    finally:
        print('\r{} files succesfully processed. {} files failed.'.format(len(success), len(failure)), end='')

nlp_end_time = time.perf_counter()
print('\nParsing the text as natural language and cleaning took {:.2f}s'.format(nlp_end_time - nlp_start_time))        


Parsing the text as natural language and cleaning took 1.84s


In [17]:
for filename, exception in failure:
    print('Exception while processing "{}" was:'.format(filename))
    print(exception)
else:
    print('No exception during preprocessing :-)')

No exception during preprocessing :-)


<table style="width:100%">
  <tr>
      <td colspan="1" style="text-align:left;background-color:#0071BD;color:white">
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">
            <img alt="Creative Commons License" style="border-width:0;float:left;padding-right:10pt"
                 src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
        </a>
        &copy; T. Dong, D. Speicher<br/>
        Licensed under a 
        <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/" style="color:white">
            CC BY-NC 4.0
        </a>.
      </td>
      <td colspan="2" style="text-align:left;background-color:#66A5D1">
          <b>Acknowledgments:</b>
          This material was prepared within the project
          <a href="http://www.b-it-center.de/b-it-programmes/teaching-material/p3ml/" style="color:black">
              P3ML
          </a> 
          which is funded by the Ministry of Education and Research of Germany (BMBF)
          under grant number 01/S17064. The authors gratefully acknowledge this support.
      </td>
  </tr>
</table>