<a href="https://colab.research.google.com/github/onlyabhilash/spark-nlp-german/blob/main/pretrained_german_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


### German models specs

| Feature   | Description|
|:----------|:----------|
| **Lemma** | Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura`|
| **POS**   | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/de_hdt/index.html)|
| **NER**   | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities |

In [None]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline

from pyspark.sql.types import StringType

In [None]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.5.0
Apache Spark version:  2.4.4


In [None]:
dfTest = spark.createDataFrame([
    "Die Anfänge der EU gehen auf die 1950er-Jahre zurück, als zunächst sechs Staaten die Europäische Wirtschaftsgemeinschaft (EWG) gründeten.",
    "Angela[1] Dorothea Merkel (* 17. Juli 1954 in Hamburg als Angela Dorothea Kasner) ist eine deutsche Politikerin (CDU)."
], StringType()).toDF("text")

### Pretrained Pipelines in German
#### explain_document_md (glove_6B_300)

In [None]:
pipeline_exdo_md = PretrainedPipeline('explain_document_md', 'de')

explain_document_md download started this may take some time.
Approx size to download 449.1 MB
[OK!]


In [None]:
pipeline_exdo_md.transform(dfTest).show(2, truncate=10)

+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|      text|  document|  sentence|     token|     lemma|       pos|embeddings|       ner|  entities|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|Die Anf...|[[docum...|[[docum...|[[token...|[[token...|[[pos, ...|[[word_...|[[named...|[[chunk...|
|Angela[...|[[docum...|[[docum...|[[token...|[[token...|[[pos, ...|[[word_...|[[named...|[[chunk...|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+



In [None]:
pipeline_exdo_md.transform(dfTest).select("lemma.result").show(2, truncate=70)
pipeline_exdo_md.transform(dfTest).select("pos.result").show(2, truncate=70)
pipeline_exdo_md.transform(dfTest).select("entities.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfang, der, EU, gehen, auf, der, 1950er-Jahre, zurück, ,, al...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[DET, NOUN, DET, PROPN, VERB, ADP, DET, NOUN, ADP, PUNCT, ADP, ADV,...|
|[PROPN, PROPN, PROPN, X, NUM, PUNCT, NOUN, NUM, ADP, PROPN, ADP, PR...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                

#### explain_document_lg (glove_840B_300)

In [None]:
pipeline_exdo_lg = PretrainedPipeline('explain_document_lg', 'de')

explain_document_lg download started this may take some time.
Approx size to download 2.3 GB
[OK!]


In [None]:
pipeline_exdo_lg.transform(dfTest).show(2, truncate=10)

+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|      text|  document|  sentence|     token|     lemma|       pos|embeddings|       ner|  entities|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|Die Anf...|[[docum...|[[docum...|[[token...|[[token...|[[pos, ...|[[word_...|[[named...|[[chunk...|
|Angela[...|[[docum...|[[docum...|[[token...|[[token...|[[pos, ...|[[word_...|[[named...|[[chunk...|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+



In [None]:
pipeline_exdo_lg.transform(dfTest).select("lemma.result").show(2, truncate=70)
pipeline_exdo_lg.transform(dfTest).select("pos.result").show(2, truncate=70)
pipeline_exdo_lg.transform(dfTest).select("entities.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfang, der, EU, gehen, auf, der, 1950er-Jahre, zurück, ,, al...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[DET, NOUN, DET, PROPN, VERB, ADP, DET, NOUN, ADP, PUNCT, ADP, ADV,...|
|[PROPN, PROPN, PROPN, X, NUM, PUNCT, NOUN, NUM, ADP, PROPN, ADP, PR...|
+----------------------------------------------------------------------+

+-----------------------------------------------------------------+
|                                                     

#### entity_recognizer_md (glove_6B_300)

In [None]:
pipeline_entre_md = PretrainedPipeline('entity_recognizer_md', 'de')

entity_recognizer_md download started this may take some time.
Approx size to download 440 MB
[OK!]


In [None]:
pipeline_entre_md.transform(dfTest).show(2, truncate=10)

+----------+----------+----------+----------+----------+----------+----------+
|      text|  document|  sentence|     token|embeddings|       ner|  entities|
+----------+----------+----------+----------+----------+----------+----------+
|Die Anf...|[[docum...|[[docum...|[[token...|[[word_...|[[named...|[[chunk...|
|Angela[...|[[docum...|[[docum...|[[token...|[[word_...|[[named...|[[chunk...|
+----------+----------+----------+----------+----------+----------+----------+



In [None]:
pipeline_entre_md.transform(dfTest).select("token.result").show(2, truncate=70)
pipeline_entre_md.transform(dfTest).select("ner.result").show(2, truncate=70)
pipeline_entre_md.transform(dfTest).select("entities.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfänge, der, EU, gehen, auf, die, 1950er-Jahre, zurück, ,, a...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[O, O, O, I-ORG, O, O, O, O, O, O, O, O, O, I-LOC, O, I-ORG, I-ORG,...|
|[I-LOC, I-PER, I-PER, O, O, O, O, O, O, I-LOC, O, I-PER, I-PER, I-P...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                

#### entity_recognizer_lg (glove_840B_300)

In [None]:
pipeline_entre_lg = PretrainedPipeline('entity_recognizer_lg', 'de')

entity_recognizer_lg download started this may take some time.
Approx size to download 2.3 GB
[OK!]


In [None]:
pipeline_entre_lg.transform(dfTest).show(2, truncate=10)

+----------+----------+----------+----------+----------+----------+----------+
|      text|  document|  sentence|     token|embeddings|       ner|  entities|
+----------+----------+----------+----------+----------+----------+----------+
|Die Anf...|[[docum...|[[docum...|[[token...|[[word_...|[[named...|[[chunk...|
|Angela[...|[[docum...|[[docum...|[[token...|[[word_...|[[named...|[[chunk...|
+----------+----------+----------+----------+----------+----------+----------+



In [None]:
pipeline_entre_lg.transform(dfTest).select("token.result").show(2, truncate=70)
pipeline_entre_lg.transform(dfTest).select("ner.result").show(2, truncate=70)
pipeline_entre_lg.transform(dfTest).select("entities.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfänge, der, EU, gehen, auf, die, 1950er-Jahre, zurück, ,, a...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[O, O, O, I-ORG, O, O, O, O, O, O, O, O, O, I-LOC, O, I-ORG, I-ORG,...|
|[O, I-PER, I-PER, O, O, O, O, O, O, I-LOC, O, I-PER, I-PER, I-PER, ...|
+----------------------------------------------------------------------+

+-----------------------------------------------------------------+
|                                                     

### Pretrained Models in German

In [None]:
document = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

lemma = LemmatizerModel.pretrained('lemma', 'de')\
    .setInputCols(['token'])\
    .setOutputCol('lemma')

pos = PerceptronModel.pretrained('pos_ud_hdt', 'de') \
    .setInputCols(['sentence', 'token'])\
    .setOutputCol('pos')

embeddings = WordEmbeddingsModel.pretrained('glove_6B_300', 'xx')\
    .setInputCols(['sentence', 'token'])\
    .setOutputCol('embeddings')

ner_model = NerDLModel.pretrained('wikiner_6B_300', 'de')\
    .setInputCols(['sentence', 'token', 'embeddings'])\
    .setOutputCol('ner')


prediction_pipeline = Pipeline(stages=[
        document,
        sentence,
        token,
        lemma,
        pos,
        embeddings,
        ner_model
])

lemma download started this may take some time.
Approximate size to download 4 MB
[OK!]
pos_ud_hdt download started this may take some time.
Approximate size to download 5 MB
[OK!]
glove_6B_300 download started this may take some time.
Approximate size to download 426.2 MB
[OK!]
wikiner_6B_300 download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [None]:
prediction = prediction_pipeline.fit(dfTest).transform(dfTest)

In [None]:

prediction.select("lemma.result").show(2, truncate=70)
prediction.select("pos.result").show(2, truncate=70)
prediction.select("ner.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfang, der, EU, gehen, auf, der, 1950er-Jahre, zurück, ,, al...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[DET, NOUN, DET, PROPN, VERB, ADP, DET, NOUN, ADP, PUNCT, ADP, ADV,...|
|[PROPN, PROPN, PROPN, X, NUM, PUNCT, NOUN, NUM, ADP, PROPN, ADP, PR...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                