<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## Deep Learning NER

In the following example, we walk-through a LSTM NER model training and prediction. This annotator is implemented on top of TensorFlow.

This annotator will take a series of word embedding vectors, training CoNLL dataset, plus a validation dataset. We include our own predefined Tensorflow Graphs, but it will train all layers during fit() stage.

DL NER will compute several layers of BI-LSTM in order to auto generate entity extraction, and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction. 

### Spark `2.4` and Spark NLP `1.8.2`

#### 1. Call necessary imports and set the resource folder path.

In [None]:
import os
import sys
sys.path.append('../../')

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

import time
import zipfile
#Setting location of resource Directory
resource_path= "../../../src/test/resources/"

#### 2. Download CoNLL 2003 data if not present

In [None]:
# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request
url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/"
file_train="eng.train"
file_testa= "eng.testa"
file_testb= "eng.testb"
# https://github.com/patverga/torch-ner-nlp-from-scratch/tree/master/data/conll2003
if not Path(file_train).is_file():   
    print("Downloading "+file_train)
    urllib.request.urlretrieve(url+file_train, file_train)
if not Path(file_testa).is_file():
    print("Downloading "+file_testa)
    urllib.request.urlretrieve(url+file_testa, file_testa)

if not Path(file_testb).is_file():
    print("Downloading "+file_testb)
    urllib.request.urlretrieve(url+file_testb, file_testb)

#### 3. Download Glove embeddings and unzip, if not present

In [None]:
# Download Glove Word Embeddings
file = "glove.6B.zip"
if not Path("glove.6B.zip").is_file():
    url = "http://nlp.stanford.edu/data/glove.6B.zip"
    print("Start downoading Glove Word Embeddings. It will take some time, please wait...")
    urllib.request.urlretrieve(url, "glove.6B.zip")
    print("Downloading finished")
else:
    print("Glove data present.")
    
if not Path("glove.6B.100d.txt").is_file():
    zip_ref = zipfile.ZipFile(file, 'r')
    zip_ref.extractall("./")
    zip_ref.close()

#### 4. Create the spark session

In [None]:
spark = SparkSession.builder \
    .appName("DL-NER")\
    .master("local[*]")\
    .config("spark.driver.memory","8G")\
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:1.8.2")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

#### 6. Load parquet dataset and cache into memory

In [None]:
from sparknlp.training import CoNLL

conll = CoNLL(
    documentCol="document",
    sentenceCol="sentence",
    tokenCol="token",
    posCol="pos"
)

training_data = conll.readDataset(spark, './eng.train')
training_data.show()

#### 5. Create annotator components with appropriate params and in the right order. The finisher will output only NER. Put everything in Pipeline

In [None]:
glove = WordEmbeddings()\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("glove")\
  .setEmbeddingsSource("/home/saif/Downloads/glove.6B.100d.txt", 100, 2)

nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "glove"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(1)\
  .setRandomSeed(0)\
  .setVerbose(0)

converter = NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_span")
    
finisher = Finisher() \
    .setInputCols(["sentence", "token", "ner", "ner_span"]) \
    .setIncludeMetadata(True)

ner_pipeline = Pipeline(
    stages = [
    glove,
    nerTagger,
    converter,
    finisher
  ])


#### 7. Train the pipeline. (This will take some time)

In [None]:
start = time.time()
print("Start fitting")
ner_model = ner_pipeline.fit(training_data)
print("Fitting is ended")
print (time.time() - start)

#### 8. Lets predict with the model

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        ner_model
    ]
)

In [None]:
prediction_data = spark.createDataFrame([["Germany is a nice place"]]).toDF("text")
prediction_data.show()

In [None]:
prediction_model = prediction_pipeline.fit(prediction_data)
prediction_model.transform(prediction_data).show()

In [None]:
# We can be fast!

lp = LightPipeline(prediction_model)
result = lp.annotate("International Business Machines Corporation (IBM) is an American multinational information technology company headquartered in Armonk.")
list(zip(result['token'], result['ner']))

#### 9. Save both pipeline and single model once trained, on disk

In [None]:
prediction_pipeline.write().overwrite().save("./prediction_dl_pipeline")
prediction_model.write().overwrite().save("./prediction_dl_model")

#### 10. Load both again, deserialize from disk

In [None]:
from pyspark.ml import PipelineModel, Pipeline

loaded_prediction_pipeline = Pipeline.read().load("./prediction_dl_pipeline")
loaded_prediction_model = PipelineModel.read().load("./prediction_dl_model")

In [None]:
loaded_prediction_model.transform(prediction_data).show()