<a href="https://colab.research.google.com/github/murat-gunay/NLP/blob/master/Turkish_NER_Training_SparkNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turkish NER Training with Spark NLP & NLU

## Initializing `pyspark`, `jdk`, `spark-nlp` and Environment

In [None]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
Collecting pyspark==2.4.4
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 68kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 40.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130389 sha256=8dc3b903b009d74585d0b6a7bb5d6f2b041304fc60ed4cb5fb4825dbc016072e
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d18423005

## Starting Spark Session

In [None]:
import sparknlp

In [None]:
from pyspark.sql import SparkSession

- I customize my spark session to fit my use-case.

In [None]:
spark = SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "20G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.driver.maxResultSize", "20G") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.2").getOrCreate()

In [None]:
print("Version of SparkNLP:", sparknlp.version())
print("Version of Spark :", spark.version)

Version of SparkNLP: 2.6.2
Version of Spark : 2.4.4


## Downloading & Importing Necessary Libraries

In [None]:
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [None]:
import pandas as pd
from sparknlp.training import CoNLL
import pyspark.sql.functions as F

In [None]:
! pip install nlu  > /dev/null

In [None]:
import nlu

## Loading Labeled Raw DataSet and Starting Pre-processing

In [None]:
with open("Turkish_NER_Tagged.txt", "r", encoding="utf-8") as file:
    text = file.read()
# I can not share the labeled data.

In [None]:
text[0:500]

'Müzik Şenliği \'ne hazırlanın  <b_enamexTYPE="ORGANIZATION">POZİTİF ve Açık Radyo<e_enamex>  işbirliğiyle düzenlenecek olan  <b_enamexTYPE="LOCATION">İstanbul<e_enamex>  Müzik Şenliği 2 , müzikseverlere Aralık ayında merhaba demeye hazırlanıyor\nTek çatı altında dokuz ayrı salonda gerçekleştirilecek Şenlik kapsamında doksanın üzerinde etkinlik yer alacak\nHalk müziğinden caza , klasik batı müziğinden klasik Türk müziğine , rock \'tan etnik müziğe uzanan konserlerin yanı sıra , atölye çalışmaları , p'

- As we can see the sample of this data set, this corpus is actually tokenized and separated by spaces.
- The labels are in HTML tag format. In order to capture the whole tag and the labeled tokens, we should remove the spaces inside the tags as in follows.
   - `<b_enamex TYPE=` → `<b_enamexTYPE=`
- We need to separate the document sentence by sentence due to set correct form of CONLL file. So we should read the file line by line.

In [None]:
with open("Turkish_NER_Tagged.txt", "r", encoding="utf-8") as file:
    text_ln = file.readlines()

In [None]:
len(text_ln)  # We have 27.556 sentences (lines)

27556

In [None]:
text = [i.replace('<b_enamex TYPE=', '<b_enamexTYPE=') for i in text_ln]

In [None]:
text[0:10]

['Müzik Şenliği \'ne hazırlanın  <b_enamexTYPE="ORGANIZATION">POZİTİF ve Açık Radyo<e_enamex>  işbirliğiyle düzenlenecek olan  <b_enamexTYPE="LOCATION">İstanbul<e_enamex>  Müzik Şenliği 2 , müzikseverlere Aralık ayında merhaba demeye hazırlanıyor\n',
 'Tek çatı altında dokuz ayrı salonda gerçekleştirilecek Şenlik kapsamında doksanın üzerinde etkinlik yer alacak\n',
 "Halk müziğinden caza , klasik batı müziğinden klasik Türk müziğine , rock 'tan etnik müziğe uzanan konserlerin yanı sıra , atölye çalışmaları , panel ve söyleşiler , çocuk etkinlikleri , CD ve kitap satışı gibi etkinlikler Şenlik 'i destekleyecek\n",
 'Geçtiğimiz yıl ilki büyük heyecan yaratan , müzik ile ilgili her kesimden insanı tek bir çatı altında , keyifli bir ortamda buluşturmayı , müziği ve müzisyeni ön plana çıkarmayı , Türk müziğinin binbir tınısını dünyaya yayabilmek için gerekli ortamı yaratabilmeyi amaçlayan Şenlik yine  <b_enamexTYPE="LOCATION">Askeri Müze ve Kültür Sitesi<e_enamex> \'nde ağırlayacak konuklar

- We will use NLU libray for creating the CONLL file to train our model.

In [None]:
pipe = nlu.load("tokenize")

In [None]:
df = pipe.predict(text, output_level="token")
# We need token columns in our DataFrame, so we set "token" for output level

In [None]:
df  # Great! We have 513216 tokens in "token" column and 
    # "origin_index" represents the sentences.

Unnamed: 0_level_0,token
origin_index,Unnamed: 1_level_1
0,Müzik
0,Şenliği
0,'
0,ne
0,hazırlanın
...,...
27555,YANLIŞ
27555,giden
27555,bir
27555,şeyler


In [None]:
df["sent_id"] = df.index  # We create a new column named "sent_id" to use later.

In [None]:
df.reset_index(drop = True, inplace = True)

In [None]:
df

Unnamed: 0,token,sent_id
0,Müzik,0
1,Şenliği,0
2,',0
3,ne,0
4,hazırlanın,0
...,...,...
513211,YANLIŞ,27555
513212,giden,27555
513213,bir,27555
513214,şeyler,27555


In [None]:
df["ner"] = "O"  # We create a new column to collect the ner labels

- Let's start to extract and collect the ner labels into "ner" column.

In [None]:
count = 0
for tok in df.token.values:

    if ('<b_enamexTYPE="LOCATION">'  in tok):
        if ('<e_enamex>' in tok) : df.ner.iloc[count] = "B-LOC"
        if ('<e_enamex>' not in tok) : df.ner.iloc[count] = "BB-LOC"
    if ('<b_enamexTYPE="ORGANIZATION">'  in tok):
        if ('<e_enamex>' in tok): df.ner.iloc[count] = "B-ORG"
        if ('<e_enamex>' not in tok): df.ner.iloc[count] = "BB-ORG"
    if ('<b_enamexTYPE="PERSON">'  in tok):
        if ('<e_enamex>' in tok): df.ner.iloc[count] = "B-PER"
        if ('<e_enamex>' not in tok): df.ner.iloc[count] = "BB-PER"
    if ('<b_enamexTYPE=' not in tok) and ('<e_enamex>' in tok): 
        df.ner.iloc[count] = "I-XXX"

    count+=1

In [None]:
df

Unnamed: 0,token,sent_id,ner
0,Müzik,0,O
1,Şenliği,0,O
2,',0,O
3,ne,0,O
4,hazırlanın,0,O
...,...,...,...
513211,YANLIŞ,27555,O
513212,giden,27555,O
513213,bir,27555,O
513214,şeyler,27555,O


In [None]:
for indx in range(len(df.ner.values)):

    if df.ner.iloc[indx] == "BB-PER" : 
        count = 0
        while df.ner.iloc[indx] != "I-XXX":
            indx += 1
            count += 1
    
        for i in range(indx, indx-count,-1):
            df.ner.iloc[i] = "I-PER"

    if df.ner.iloc[indx] == "BB-ORG" : 
        count = 0
        while df.ner.iloc[indx] != "I-XXX":
            indx += 1
            count += 1
    
        for i in range(indx, indx-count,-1):
            df.ner.iloc[i] = "I-ORG"

    if df.ner.iloc[indx] == "BB-LOC" : 
        count = 0
        while df.ner.iloc[indx] != "I-XXX":
            indx += 1
            count += 1
    
        for i in range(indx, indx-count,-1):
            df.ner.iloc[i] = "I-LOC"

In [None]:
df

Unnamed: 0,token,sent_id,ner
0,Müzik,0,O
1,Şenliği,0,O
2,',0,O
3,ne,0,O
4,hazırlanın,0,O
...,...,...,...
513211,YANLIŞ,27555,O
513212,giden,27555,O
513213,bir,27555,O
513214,şeyler,27555,O


- It's time to get rid of the ner labels (HTML Tags) in "token" column.

In [None]:
rep_dic = {'<b_enamexTYPE="LOCATION">':"", '<b_enamexTYPE="ORGANIZATION">':"", '<b_enamexTYPE="PERSON">':"", '<e_enamex>':""}

In [None]:
df.replace({"token" : rep_dic, "ner" : {"BB-" : "B-"}}, regex=True, inplace=True)

In [None]:
df["pos1"] = "NN"
df["pos2"] = "NN"

In [None]:
df = df[["token", "pos1", "pos2", "ner", "sent_id"]]

In [None]:
df.head()

Unnamed: 0,token,pos1,pos2,ner,sent_id
0,Müzik,NN,NN,O,0
1,Şenliği,NN,NN,O,0
2,',NN,NN,O,0
3,ne,NN,NN,O,0
4,hazırlanın,NN,NN,O,0


- Now, we should put empty lines between each sentences to fit CONLL file format.

In [None]:
df_dummy = pd.DataFrame()
df_blank = pd.DataFrame({"token":[np.nan], "pos1":[np.nan], "pos2":[np.nan], \
                         "ner" : [np.nan], "sent_id":[np.nan]})

for i in range(df.sent_id.nunique()):
    df_part = df[df["sent_id"] == i]
    df_dummy = pd.concat([df_dummy, df_blank, df_part], ignore_index=True)

In [None]:
df_conll = df_dummy[["token","pos1","pos2","ner"]]

In [None]:
df_conll.columns = ["-DOCSTART-", "-X-", "-X-", "O"]

In [None]:
df_conll.head(55)

Unnamed: 0,-DOCSTART-,-X-,-X-.1,O
0,,,,
1,Müzik,NN,NN,O
2,Şenliği,NN,NN,O
3,',NN,NN,O
4,ne,NN,NN,O
5,hazırlanın,NN,NN,O
6,POZİTİF,NN,NN,B-ORG
7,ve,NN,NN,I-ORG
8,Açık,NN,NN,I-ORG
9,Radyo,NN,NN,I-ORG


In [None]:
 df_conll.to_csv("Turkish_NER.conll", index=False, sep=" ")

 # CONLL file is ready to use!..

## Training Turkish Named Entity Recognition (NER) Model with Glove

- Finally we produced our CONLL file for use. Let's start to train our model using Multi Language ``GloVe Embeddings("glove_840B_300")``

In [None]:
data = CoNLL().readDataset(spark, "Turkish_NER.conll")

In [None]:
data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Müzik Şenliği ' n...|[[document, 0, 16...|[[document, 0, 16...|[[token, 0, 4, Mü...|[[pos, 0, 4, NN, ...|[[named_entity, 0...|
|Tek çatı altında ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 2, Te...|[[pos, 0, 2, NN, ...|[[named_entity, 0...|
|Halk müziğinden c...|[[document, 0, 24...|[[document, 0, 24...|[[token, 0, 3, Ha...|[[pos, 0, 3, NN, ...|[[named_entity, 0...|
|Geçtiğimiz yıl il...|[[document, 0, 34...|[[document, 0, 34...|[[token, 0, 9, Ge...|[[pos, 0, 9, NN, ...|[[named_entity, 0...|
|Bu yıl farklı ola...|[[document, 0, 32...|[[document, 0, 32...|[[token, 0, 1, Bu...|[[pos, 0, 1, NN, ..

In [None]:
train_data, test_data = data.randomSplit([0.8, 0.2], seed = 100)

In [None]:
train_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth")).groupBy('ground_truth').count().orderBy('count', ascending=False).show(100,truncate=False)

+------------+------+
|ground_truth|count |
+------------+------+
|O           |367645|
|B-PER       |13017 |
|B-LOC       |8732  |
|B-ORG       |8076  |
|I-PER       |6190  |
|I-ORG       |5913  |
|I-LOC       |1298  |
+------------+------+



In [None]:
glove_embeddings = WordEmbeddingsModel().pretrained('glove_840B_300','xx')\
  .setInputCols(["document",'token'])\
  .setOutputCol("embeddings")\
  .setCaseSensitive(True)

glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]


In [None]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(2)\
  .setLr(0.002)\
  .setPo(0.005)\
  .setBatchSize(16)\
  .setRandomSeed(0)\
  .setVerbose(1)\
  .setValidationSplit(0.2)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setIncludeConfidence(True)\
  .setOutputLogsPath('ner_logs')

- Creating a NER training pipeline

In [None]:
ner_pipeline_glove = Pipeline(stages=[
          glove_embeddings,
          nerTagger
 ])

In [None]:
ner_model_glove = ner_pipeline_glove.fit(train_data)

In [None]:
ner_model_glove.stages[1].write().overwrite().save("Tr_NER_Glove_20201019")

In [None]:
test_data = glove_embeddings.transform(test_data)

predictions = ner_model_glove.transform(test_data)

from sklearn.metrics import classification_report

preds_df = predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).toPandas()

print (classification_report(preds_df['ground_truth'], preds_df['prediction'], digits=4))

              precision    recall  f1-score   support

       B-LOC     0.9103    0.8885    0.8993      3257
       B-ORG     0.8668    0.9048    0.8854      2993
       B-PER     0.9570    0.8717    0.9123      4722
       I-LOC     0.7296    0.7732    0.7508       485
       I-ORG     0.8068    0.8979    0.8499      2097
       I-PER     0.9502    0.9119    0.9306      2258
           O     0.9919    0.9933    0.9926    136668

    accuracy                         0.9824    152480
   macro avg     0.8875    0.8916    0.8887    152480
weighted avg     0.9827    0.9824    0.9824    152480



## Let's make some predictions

- First, creating prediction pipeline

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

glove_embeddings = WordEmbeddingsModel().pretrained('glove_840B_300','xx')\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("embeddings")\
  .setCaseSensitive(True)
    
loaded_ner_model = NerDLModel.load("Tr_NER_Glove_20201019")\
 .setInputCols(["sentence", "token", "embeddings"])\
 .setOutputCol("ner")

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        glove_embeddings,
        loaded_ner_model,
        converter])

glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]


In [None]:
empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)

In [None]:
prediction_model.transform(empty_data)

- We use `LightPipeline` to display the result of prediction easily. 

In [None]:
from sparknlp.base import LightPipeline

In [None]:
light_model = LightPipeline(prediction_model)

- We will use `ner_highlighter` for better visualization.

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/utils/ner_highlighter.py

In [None]:
import ner_highlighter

## Samples of the results

In [None]:
text = "İlk defa Güneydoğu Anadolu bölgesinde düzenlenecek olan Uluslararası Nöroloji Kongresi başkanlığını, bu yıl Esra Günay üstlenecek."

In [None]:
text = "Güzellikler Partisi İl Başkanlığı toplantısı bu yıl Ankara Çankaya merkezde icra edildi. Toplantı başkanlığını Ali Veli yürüttü."

In [None]:
text = "Türkiye Futbol Federasyonu başkanı Nihat Özdemir, Kocaeli Gölcük'e doğru yola çıktı."

In [None]:
text = "Oğlum Kerem çok iyi bir çocuktur. İstanbul Kadıköy sahilinde yaşamaktadır. Kerem, Power FM Radyosu'nda çalışmaktadır."

In [None]:
result = light_model.fullAnnotate(text)[0]

In [None]:
ner_highlighter.chunk_highlighter(result, entity_column="ner_span")

In [None]:
ner_highlighter.chunk_highlighter(result, entity_column="ner_span")

In [None]:
ner_highlighter.chunk_highlighter(result, entity_column="ner_span")

In [None]:
ner_highlighter.chunk_highlighter(result, entity_column="ner_span")

---

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=16ljQioOeX0IfcwGfMiJmP5LwXT22faIw" class="img-fluid" alt="sql"></p>