<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## CRF Named Entity Recognition
In the following example, we walk-through a Conditional Random Fields NER model training and prediction.

This challenging annotator will require the user to provide either a labeled dataset during fit() stage, or use external CoNLL 2003 resources to train. It may optionally use an external word embeddings set and a list of additional entities.

The CRF Annotator will also require Part-of-speech tags so we add those in the same Pipeline. Also, we could use our special RecursivePipeline, which will tell SparkNLP's NER CRF approach to use the same pipeline for tagging external resources.



### Spark `2.4` and Spark NLP `2.0.0`

#### 1. Call necessary imports and set the resource path to read local data files

In [None]:
import os
import sys
sys.path.append('../../')

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

import time
import zipfile
#Setting location of resource Directory
resource_path= "../../../src/test/resources/"


#### 2. Download training dataset if not already there

In [None]:
# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request

if not Path("eng.train").is_file():
    print("File Not found downloading!")
    url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train"
    urllib.request.urlretrieve(url, 'eng.train')
else:
    print("File already present.")


#### 3. Download Glove word embeddings if not already there

In [None]:
# Download Glove Word Embeddings
file = "glove.6B.zip"
if not Path("glove.6B.zip").is_file():
    url = "http://nlp.stanford.edu/data/glove.6B.zip"
    print("Start downoading Glove Word Embeddings. It will take some time, please wait...")
    urllib.request.urlretrieve(url, "glove.6B.zip")
    print("Downloading finished")
    prinnt("Unzipping the files now.")
else:
    print("File already present.")
    
if not Path("glove.6B.100d.txt").is_file():
    zip_ref = zipfile.ZipFile(file, 'r')
    zip_ref.extractall("./")
    zip_ref.close()

#### 4. Load SparkSession if not already there

In [None]:
spark = SparkSession.builder \
    .appName("CRF_NER")\
    .master("local[1]")\
    .config("spark.driver.memory","8G")\
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.0")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

#### 5. Create annotator components in the right order, with their training Params. Finisher will output only NER. Put all in pipeline.

In [None]:
glove = WordEmbeddingsLookup()\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("glove")\
  .setEmbeddingsSource("glove.6B.100d.txt", 100, 2)

nerTagger = NerCrfApproach()\
  .setInputCols(["sentence", "token", "pos", "glove"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMinEpochs(1)\
  .setMaxEpochs(1)\
  .setLossEps(1e-3)\
  .setL2(1)\
  .setC0(1250000)\
  .setRandomSeed(0)\
  .setVerbose(0)

finisher = Finisher() \
    .setInputCols(["ner"]) \
    .setIncludeMetadata(True)

pipeline = Pipeline(
    stages = [
    glove,
    nerTagger,
    finisher
  ])

#### 6. Load a dataset for prediction. Training is not relevant from this dataset.

In [None]:
from sparknlp.dataset import CoNLL
conll = CoNLL()
data = conll.readDataset('eng.train')
data.show()

#### 7. Training the model. Training doesn't really do anything from the dataset itself.

In [None]:
start = time.time()
print("Start fitting")
model = pipeline.fit(data)
print("Fitting has ended")
print (time.time() - start)

#### 8. Run the prediction

In [None]:
ner_data = model.transform(data)
ner_data.show(5)

#### 9. Save model and pipeline into disk after training

In [None]:
model.write().overwrite().save("./pip_wo_embedd/")

#### 10. Load the saved model and the pipeline

In [None]:
from pyspark.ml import PipelineModel, Pipeline

sameModel = PipelineModel.read().load("./pip_wo_embedd/")

sameModel.transform(data).show(5)