# NLTK Example

*Install*:
```sh
pip install nltk
```

In [None]:
import nltk

text = u'Gerd suchte ca. 5 min. die 3 Freunde bzw. Kollegen. Sie warteten am 1. Mai in Berlin/ West: am Zoo.'

## Sentence Splitting

In [None]:
from nltk.tokenize import sent_tokenize

# install punkt sentence tokenizer
nltk.download('punkt')

# seems to be unnecessary: loading the german punkt tokenizer explicitly
#tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')

In [None]:
# split sentences
# NLTK has problems with ca., bzw.

sentences = sent_tokenize(text, language='german')
#sentences = tokenizer.tokenize(text)

for i, s in enumerate(sentences):
    print(i+1, '-->', s)

## Part of Speech Tagging

NLTK's POS tagging function `pos_tag()` only supports English.

To do POS tagging of German text it is necessary to
* use a different tool, like spaCy, pattern, or Stanford CoreNLP or
* train a POS tagger on an annotated German corpus. (see https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/). However, there is no publicly available German corpus which can be officially used for commercial purposes.

In [None]:
from nltk.tokenize import word_tokenize

nltk.download('averaged_perceptron_tagger')

**Wrong POS tagging of German text with English POS tags:**

In [None]:
def pos2string(tagged): return ' '.join(['/'.join(p) for p in tagged])

# tag each word in every sentence
for i, s in enumerate(sentences):
    tagged = nltk.pos_tag(word_tokenize(s, language='german'))
    print(i+1, '-->', pos2string(tagged))

In [None]:
def pos_filter(tagged, type = 'NN'): return [x[0] for x in tagged if x[1].startswith(type)]

print('Nouns:')
for i, s in enumerate(sentences):
    print(i+1, '-->', pos_filter(nltk.pos_tag(word_tokenize(s, language='german')), 'NN'))

print('Verbs:')
for i, s in enumerate(sentences):
    print(i+1, '-->', pos_filter(nltk.pos_tag(word_tokenize(s, language='german')), 'VB'))

## Named Entity Recognition

NLTK uses a max entropy NE chunker which was trained on an English corpus and which requires POS tags. (see also http://mattshomepage.com/articles/2016/May/23/nltk_nec/).

To do NER tagging of German text it is necessary to
* use a different tool, like spaCy, pattern, or Stanford CoreNLP or
* train a NER tagger on an annotated German corpus. However, there is no publically available German corpus which can be officially used for commercial purposes.


In [None]:
from nltk import ne_chunk, pos_tag

nltk.download('maxent_ne_chunker')
nltk.download('words')

**Wrong NER tagging of German text:**

In [None]:
def ner_tag(chunked): return [c[0]+'/0' if type(c) == tuple else c.leaves()[0][0]+'/'+c.label() for c in chunked]

for i, s in enumerate(sentences):
    chunked = ne_chunk(pos_tag(word_tokenize(s, language='german')))
    print(i+1, '-->', ' '.join(ner_tag(chunked)))