<a href="https://colab.research.google.com/github/nabinkhadka/spark-nlp-workshop/blob/master/NER_Training_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customize training pipeline in Annotation Lab
This is just a template to help understand how you can define the custom pipelines for Annotation Lab.

Annotation Lab only supports the change in training pipeline.

## IMPORTANT:
Do not update any cells under header with **(X)**

So make a copy of this notebook, update only the training pipeline section, download it to your computer and upload it to Annotation Lab from Project Setup Page.



# Setup Environment (X)

In [None]:
import sparknlp_jsl
# Setup environment for running spark nlp.
# https://nlp.johnsnowlabs.com/docs/en/quickstart
# https://github.com/JohnSnowLabs/spark-nlp-workshop
spark = sparknlp_jsl.start("{secret.code}")

# Prepare dataset (X)
- Save completions to disk
- Convert the Annotation Lab completions to Conll DataFrame format

In [2]:
# Based on type of project (NER/Classification/Assertion),
# we export completions (annotated data) as JSON file
def save_data_alab_completions():
    # Annotation Lab already has implemented this
    pass

saved_path = save_data_alab_completions()

In [3]:
# Now we create conll dataframe and take it as trainable dataset
def create_conll_df(path):
    # Fill here
    pass

train_dataset = create_conll_df(saved_path)

# Define Training Pipeline (Update Me)

In [5]:
# Imports needed for setting up the custom pipeline
from sparknlp.base import *
from sparknlp_jsl.annotator import *
from sparknlp.annotator import WordEmbeddingsModel

In [None]:
embeddings = (
    WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(["sentence", "token"])
    .setOutputCol("embeddings")
)

In [None]:
ner_tagger = (
    NerDLApproach()
    .setInputCols(["sentence", "token", "embeddings"])
    .setOutputCol("ner")
    .setLabelColumn("label")
    .setBatchSize(64)
    .setDropout(0.5)
    .setPo(0.005)
    .setLr(0.001)
    .setMaxEpochs(20)
    .setValidationSplit(0.2)
    .setRandomSeed(0)
    .setVerbose(1)
    .setEvaluationLogExtended(True)
)

In [None]:
training_pipeline = Pipeline(stages=[embeddings, ner_tagger])

# Start training (X)

In [None]:
# Get trained model
trained_model = training_pipeline.fit(train_dataset)
# After this Annotation Lab saves the models, training logs, etc.