## Train spaCy to Autodetect New Entities (NER)

https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/

When the pre-trained model needs to be updated with new keywords and tags, these steps can be done.


In [1]:
import pandas as pd
df = pd.read_csv("../input/2020-07-16_Tier-2-5-sponsor-guidance_Jul-2020_v1.0_section.csv", index_col=0); df

Unnamed: 0,section_title,raw_text
0,Contents,\nTier 2 and 5: Guidance for Sponsors - Adden...
1,About this guidance,This guidance is for organisations who want to...
2,Contacts,If you think that the guidance has factual err...
3,Clearance and publication,Below is information on when this version of t...
4,Glossary,Phrase \nWe or Us \nYou or Your \nMeaning \nTh...
...,...,...
245,How to apply for a licence,"To apply, you must complete the online applica..."
246,If your application is approved,If your application for a sponsor licence is a...
247,Assigning a CoS for RQF level 3 to 5 jobs,"If your application is approved, you will not ..."
248,If your application for a licence is refused,If we are not satisfied that you can offer gen...


## Simple cleaning

In [2]:
import re

def first_level_cleaning(doc):
    """
    Cleans the document from unneecessary chars/words, etc.
    """
    doc = str(doc)                               # in case numeric type exsisted 
    doc = re.sub(r"[\r\t\n]+", "", doc)          # find & replace \t and \n with empty string
    doc = re.sub(r"[^\x00-\x7F]+", " ", doc)     # remove non-ascii chars
    doc = re.sub(r" +", " ", doc)                # remove dublicate spaces
    doc = doc.strip()                            # strip leading/trailing spaces
    return doc

def first_level_pdf_cleaning(doc):
    """
    Cleans the document from unneecessary chars/words, etc.
    """
    doc = str(doc)                               # in case numeric type exsisted 
    doc = re.sub(r"[\r\t\n]+", "", doc)          # find & replace \t and \n with empty string
    doc = re.sub(r"[^\x00-\x7F]+", " ", doc)     # remove non-ascii chars
    doc = re.sub(r" +", " ", doc)                # remove dublicate spaces
    #doc = re.sub(r"(Page)\s\d+\s\w+\s\d+\s(Tiers 2 and 5: guidance for sponsors - version 07\/20)", "", doc) # @Eyzo
    doc = doc.strip()                            # strip leading/trailing spaces
    return doc

In [3]:
"""
apply first level cleaning to the PDF raw_texts
"""
df.raw_text = df.raw_text.apply(lambda x:first_level_pdf_cleaning(x))

In [4]:
df.raw_text[1]

"This guidance is for organisations who want to apply for a sponsor licence to sponsor migrants under Tier 2 and/or Tier 5 of the points-based system. It tells you what we expect if you are a licence holder, the processes you must follow when sponsoring a migrant and how to meet all of the duties and responsibilities associated with being a licensed sponsor. The guidance is subject to change and you should check the dates to make sure you have the latest version. A new points-based immigration system will come into effect from 1 January 2021. The future system will apply to both European Economic Area (EEA) nationals and non-EEA nationals. You should refer to Annex 9 of this guidance if you intend to apply for a licence to sponsor workers under the new system. Separate guidance exists on GOV.UK for UK education providers who wish to apply for and hold a licence to sponsor international students to come to the UK under Tier 4 to study. You can find the appendices mentioned in this guida

## Train spaCy to Autodetect New Entities (NER)
https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/

### Updating the Named Entity Recognizer

In [5]:
# Load a spacy model and chekc if it has ner
import spacy
nlp = spacy.load('en_core_web_sm')

print(nlp.pipe_names)

['tagger', 'parser', 'ner']


In [6]:
doc = nlp(df.raw_text[1])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities [('2', 'CARDINAL'), ('5', 'CARDINAL'), ('1 January 2021', 'DATE'), ('European', 'NORP'), ('Economic Area', 'LOC'), ('EEA', 'ORG'), ('Annex 9', 'ORG'), ('UK', 'GPE'), ('UK', 'GPE')]


In [7]:
# Getting the pipeline component
ner=nlp.get_pipe("ner")

In [8]:
"""
check list of named entity recogs -> https://spacy.io/api/annotation

TODO : Shorter sentences (?), and provide more samples for the model training
"""
TRAIN_DATA = [
("Tier 1 is the points-based system.", {"entities": [(0,7, "LAW")]}),
("Tier 5 is the points-based system.", {"entities": [(0,7, "LAW")]}),
("Tier 2, Tier 4 or Tier 5 are the points-based system.", {"entities": [(0,7, "LAW")]}),
("Tier 2, Tier 4 or Tier 5 are the points-based system.", {"entities": [(8,15, "LAW")]}),
("Tier 2, Tier 4 or Tier 5 are the points-based system.", {"entities": [(18,26, "LAW")]}),
("Tier 2 and/or 5 are the points-based system.", {"entities": [(0,16, "LAW")]}),
("This offer will form part of the Tier 2 (General) regime for skilled workers", {"entities": [(33,41, "LAW")]}),
("The Health and Care Visa is available for Tier 2 (General).", {"entities": [(42,46, "LAW")]}),
("Further information on your employers responsibilities as a sponsor can be found in the Tier 2 and 5 Sponsor Guidance.", {"entities": [(88,102, "LAW")]}),
("Further information can be found in the Tier 2 and 5 Sponsor Guidance.", {"entities": [(40,54, "LAW")]}),
("As a sponsor, further information can be found in the Tier 2 and 5 Sponsor Guidance.", {"entities": [(54,68, "LAW")]}),
("This guidance is to sponsor migrants under Tier 2 and/or Tier 5 of the points-based system.", {"entities": [(43,46, "LAW")]}),
("This guidance is to sponsor migrants under Tier 2 and/or Tier 5 of the points-based system.", {"entities": [(57,61, "LAW")]}),
("This guidance is for organisations to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(61,65, "LAW")]}),
("This guidance is for organisations to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(75,79, "LAW")]}),
("For organisations who want to apply for a sponsor licence should read Tier 2 and/or Tier 5.", {"entities": [(70,74, "LAW")]}),
("For organisations who want to apply for a sponsor licence should read Tier 2 and/or Tier 5.", {"entities": [(84,88, "LAW")]}),
("Separate guidance exists for UK education providers to sponsor international students to come to the UK under Tier 4.", {"entities": [(110,117, "LAW")]}),
("For UK education providers who hold a licence to sponsor international students to come to the UK under Tier 4 to study.", {"entities": [(104,108, "LAW")]}),
("For UK education providers should read Tier 4 to sponsor international students.", {"entities": [(39,43, "LAW")]}),
("This guidance is for organisations who want to apply for a sponsor licence to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(102, 108, "LAW")]}),
("This guidance is for organisations who want to apply for a sponsor licence to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(116, 127, "LAW")]}),
("Separate guidance exists on GOV.UK for UK education providers who wish to apply for and hold a licence to sponsor international students to come to the UK under Tier 4 to study.", {"entities": [(162, 168, "LAW")]}),
("Tiers 2 and 5 of the points-based system are the primary immigration routes for non-European Economic Area (EEA) migrants who wish to work in the UK.", {"entities": [(0, 13, "LAW")]}),
("These migrants must be sponsored by an organisation or company that holds a Tier 2 and/or Tier 5 licence.", {"entities": [(77, 83, "LAW")]}),
("These migrants must be sponsored by an organisation or company that holds a Tier 2 and/or Tier 5 licence.", {"entities": [(91, 97, "LAW")]}),
("With this trust comes a responsibility to act in accordance with the Immigration Rules and all parts of the Tiers 2 and 5: guidance for sponsors.", {"entities": [(109, 122, "LAW")]}),
]

In [9]:
# Adding labels to the `ner`
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

In [10]:
# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [11]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path

In [12]:
# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):
    # Training for 30 iterations
      for iteration in range(30):
        # shuufling examples  before every iteration
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update( texts,  # batch of texts
                        annotations,  # batch of annotations
                        drop=0.5,  # dropout - make it harder to memorise data
                        losses=losses)
            print("Losses", losses)

Losses {'ner': 21.887302372371778}


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


Losses {'ner': 52.857186913257465}
Losses {'ner': 71.00240720476722}
Losses {'ner': 84.06975274014258}
Losses {'ner': 96.83729724742443}


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


Losses {'ner': 113.71747561506402}
Losses {'ner': 124.24912075980089}
Losses {'ner': 13.308412173733814}
Losses {'ner': 29.85288134662551}
Losses {'ner': 44.142425013153115}
Losses {'ner': 58.965734785277164}
Losses {'ner': 86.94225114639266}
Losses {'ner': 99.57422224347647}
Losses {'ner': 109.38287449853851}
Losses {'ner': 8.748390611920001}
Losses {'ner': 22.890550127649874}
Losses {'ner': 36.1327738280192}
Losses {'ner': 41.3862133497218}
Losses {'ner': 50.24958363003856}
Losses {'ner': 57.17838934832611}
Losses {'ner': 79.08588255613603}
Losses {'ner': 16.45012913730949}
Losses {'ner': 27.849731308372213}
Losses {'ner': 44.89363362332057}
Losses {'ner': 50.49208768214294}
Losses {'ner': 57.618470570563204}
Losses {'ner': 64.60587316765623}
Losses {'ner': 71.99818597204325}
Losses {'ner': 6.540698558126223}
Losses {'ner': 14.237029584957327}
Losses {'ner': 31.35576104828533}
Losses {'ner': 34.30297849680528}
Losses {'ner': 52.74918207762266}
Losses {'ner': 62.227266968987806}
Losse

In [13]:
doc = nlp(df.raw_text[1])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities [('Tier', 'LAW')]


### Training completely new entity type in spaCy

In [14]:
# Import and load the spacy model
import spacy
nlp=spacy.load("en_core_web_sm") 

# Getting the ner component
ner=nlp.get_pipe('ner')

In [15]:
doc = nlp(df.raw_text[1])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities [('2', 'CARDINAL'), ('5', 'CARDINAL'), ('1 January 2021', 'DATE'), ('European', 'NORP'), ('Economic Area', 'LOC'), ('EEA', 'ORG'), ('Annex 9', 'ORG'), ('UK', 'GPE'), ('UK', 'GPE')]


In [16]:
# New label to add
LABEL = "STATUS"
"""
TODO : Shorter sentences (?), and provide more samples for the model training
"""
TRAIN_DATA = [
("Tier 1 is the points-based system.", {"entities": [(0,7, "STATUS")]}),
("Tier 5 is the points-based system.", {"entities": [(0,7, "STATUS")]}),
("Tier 2, Tier 4 or Tier 5 are the points-based system.", {"entities": [(0,7, "STATUS")]}),
("Tier 2, Tier 4 or Tier 5 are the points-based system.", {"entities": [(8,15, "STATUS")]}),
("Tier 2, Tier 4 or Tier 5 are the points-based system.", {"entities": [(18,26, "STATUS")]}),
("Tier 2 and/or 5 are the points-based system.", {"entities": [(0,16, "STATUS")]}),
("This offer will form part of the Tier 2 (General) regime for skilled workers", {"entities": [(33,41, "STATUS")]}),
("The Health and Care Visa is available for Tier 2 (General).", {"entities": [(42,46, "STATUS")]}),
("Further information on your employers responsibilities as a sponsor can be found in the Tier 2 and 5 Sponsor Guidance.", {"entities": [(88,102, "STATUS")]}),
("Further information can be found in the Tier 2 and 5 Sponsor Guidance.", {"entities": [(40,54, "STATUS")]}),
("As a sponsor, further information can be found in the Tier 2 and 5 Sponsor Guidance.", {"entities": [(54,68, "STATUS")]}),
("This guidance is to sponsor migrants under Tier 2 and/or Tier 5 of the points-based system.", {"entities": [(43,46, "STATUS")]}),
("This guidance is to sponsor migrants under Tier 2 and/or Tier 5 of the points-based system.", {"entities": [(57,61, "STATUS")]}),
("This guidance is for organisations to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(61,65, "STATUS")]}),
("This guidance is for organisations to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(75,79, "STATUS")]}),
("For organisations who want to apply for a sponsor licence should read Tier 2 and/or Tier 5.", {"entities": [(70,74, "STATUS")]}),
("For organisations who want to apply for a sponsor licence should read Tier 2 and/or Tier 5.", {"entities": [(84,88, "STATUS")]}),
("Separate guidance exists for UK education providers to sponsor international students to come to the UK under Tier 4.", {"entities": [(110,117, "STATUS")]}),
("For UK education providers who hold a licence to sponsor international students to come to the UK under Tier 4 to study.", {"entities": [(104,108, "STATUS")]}),
("For UK education providers should read Tier 4 to sponsor international students.", {"entities": [(39,43, "STATUS")]}),
("This guidance is for organisations who want to apply for a sponsor licence to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(102, 108, "STATUS")]}),
("This guidance is for organisations who want to apply for a sponsor licence to sponsor migrants under Tier 2 and/or Tier 5.", {"entities": [(116, 127, "STATUS")]}),
("Separate guidance exists on GOV.UK for UK education providers who wish to apply for and hold a licence to sponsor international students to come to the UK under Tier 4 to study.", {"entities": [(162, 168, "STATUS")]}),
("Tiers 2 and 5 of the points-based system are the primary immigration routes for non-European Economic Area (EEA) migrants who wish to work in the UK.", {"entities": [(0, 13, "STATUS")]}),
("These migrants must be sponsored by an organisation or company that holds a Tier 2 and/or Tier 5 licence.", {"entities": [(77, 83, "STATUS")]}),
("These migrants must be sponsored by an organisation or company that holds a Tier 2 and/or Tier 5 licence.", {"entities": [(91, 97, "STATUS")]}),
("With this trust comes a responsibility to act in accordance with the Immigration Rules and all parts of the Tiers 2 and 5: guidance for sponsors.", {"entities": [(109, 122, "STATUS")]}),
]

In [17]:
# Add the new label to ner
ner.add_label(LABEL)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [18]:
# Importing requirements
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :
    sizes = compounding(1.0, 4.0, 1.001)
    # Training for 30 iterations     
    for itn in range(30):
        # shuffle examples before training
        random.shuffle(TRAIN_DATA)
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=sizes)
        # dictionary to store losses
        losses = {}
        for batch in batches:
            texts, annotations = zip(*batch)
            # Calling update() over the iteration
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


Losses {'ner': 5.577948105847099}
Losses {'ner': 20.937150565462716}
Losses {'ner': 59.21721561943307}
Losses {'ner': 64.37208173520776}


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


Losses {'ner': 77.80416535840934}
Losses {'ner': 94.3820334194947}
Losses {'ner': 101.44842720795138}
Losses {'ner': 114.50626193631658}
Losses {'ner': 124.7722750901244}
Losses {'ner': 146.72324225463464}
Losses {'ner': 164.42968554988153}
Losses {'ner': 170.82439990268824}


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


Losses {'ner': 192.6631929921056}
Losses {'ner': 203.71737181512168}
Losses {'ner': 219.98760290609448}
Losses {'ner': 234.21420354246834}


  gold = GoldParse(doc, **gold)


Losses {'ner': 241.58438963824705}
Losses {'ner': 270.1146766089771}
Losses {'ner': 294.6643658073081}
Losses {'ner': 303.34812188603433}


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


Losses {'ner': 314.5150367822329}
Losses {'ner': 333.1143033169396}
Losses {'ner': 335.60715806954494}
Losses {'ner': 346.582786798669}
Losses {'ner': 361.4059538738392}


  gold = GoldParse(doc, **gold)
  gold = GoldParse(doc, **gold)


Losses {'ner': 373.7176912696622}
Losses {'ner': 387.91189083949115}
Losses {'ner': 14.366648097988218}
Losses {'ner': 38.07036127734682}
Losses {'ner': 51.81678832385819}
Losses {'ner': 59.21601493958019}
Losses {'ner': 78.20888422232906}
Losses {'ner': 87.44217640261718}
Losses {'ner': 99.75185645658144}
Losses {'ner': 120.29510440783974}
Losses {'ner': 140.14569326579635}
Losses {'ner': 147.92076169482948}
Losses {'ner': 185.0736892164457}
Losses {'ner': 199.14036785175608}
Losses {'ner': 215.9197133883606}
Losses {'ner': 223.54267282445448}
Losses {'ner': 243.00672666192054}
Losses {'ner': 258.8253556976467}
Losses {'ner': 274.5658085218072}
Losses {'ner': 281.8566516745393}
Losses {'ner': 293.0222832808993}
Losses {'ner': 299.6082093597704}
Losses {'ner': 310.69650304865877}
Losses {'ner': 324.3371023576599}
Losses {'ner': 334.8801829904676}
Losses {'ner': 347.9677360331532}
Losses {'ner': 359.05283624857634}
Losses {'ner': 370.22062975050153}
Losses {'ner': 375.49236431857366}
Lo

Losses {'ner': 250.69962588810813}
Losses {'ner': 258.1811111834504}
Losses {'ner': 267.4795157509179}
Losses {'ner': 2.406989870779398}
Losses {'ner': 7.686838289443585}
Losses {'ner': 18.73782403154928}
Losses {'ner': 30.24329721036065}
Losses {'ner': 37.34613706502296}
Losses {'ner': 45.675540536137305}
Losses {'ner': 48.3486952523848}
Losses {'ner': 61.40346914640338}
Losses {'ner': 73.91909215068097}
Losses {'ner': 96.58787732377766}
Losses {'ner': 101.97037772425335}
Losses {'ner': 112.25662143212628}
Losses {'ner': 122.01088769749856}
Losses {'ner': 143.08184223891237}
Losses {'ner': 157.77675589189982}
Losses {'ner': 166.26775053997642}
Losses {'ner': 177.2307996526846}
Losses {'ner': 201.34646020207617}
Losses {'ner': 214.83522489720087}
Losses {'ner': 225.62464202725857}
Losses {'ner': 237.8233585317579}
Losses {'ner': 247.79273856362158}
Losses {'ner': 259.45752738114277}
Losses {'ner': 266.6923648709384}
Losses {'ner': 298.1541381213068}
Losses {'ner': 309.7233926627272}
Lo

Losses {'ner': 235.66901444012728}
Losses {'ner': 238.54099476160445}
Losses {'ner': 254.96693944381514}
Losses {'ner': 273.45496521032146}
Losses {'ner': 288.0246342268183}
Losses {'ner': 14.748048703651875}
Losses {'ner': 21.274054341061856}
Losses {'ner': 37.13810742885107}
Losses {'ner': 49.426688941719476}
Losses {'ner': 58.20426802994916}
Losses {'ner': 82.85340249352157}
Losses {'ner': 91.15290721622296}
Losses {'ner': 113.77333523076959}
Losses {'ner': 127.95092553761788}
Losses {'ner': 131.46829144400544}
Losses {'ner': 138.68692649970762}
Losses {'ner': 150.7261826521717}
Losses {'ner': 160.98874889081344}
Losses {'ner': 172.88097041472793}
Losses {'ner': 183.57107981666923}
Losses {'ner': 193.4408528675558}
Losses {'ner': 198.26939389307518}
Losses {'ner': 200.83886499668006}
Losses {'ner': 213.46750149095897}
Losses {'ner': 233.20846149354475}
Losses {'ner': 239.10902937140781}
Losses {'ner': 249.0065732152434}
Losses {'ner': 257.6124668047996}
Losses {'ner': 270.5394781388

In [19]:
doc = nlp(df.raw_text[1])
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities []


## Flashtext to get character offsets

For training purposes we need some sentences including the **keywords** (Tiers, Annex, Level,...) that we want to **tag** (LAW, ORG,...)

By using these sentences, the model can be trained to identify them within NER.

In [20]:
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Tier 1', ('PRODUCT', 'Tier 1'))
keyword_processor.add_keyword('Tier 2', ('PRODUCT', 'Tier 2'))
keyword_processor.add_keyword('Tier 3', ('PRODUCT', 'Tier 3'))
keyword_processor.add_keyword('Tier 4', ('PRODUCT', 'Tier 4'))
keyword_processor.add_keyword('Tier 5', ('PRODUCT', 'Tier 5'))
keyword_processor.add_keyword('non-EEA', ('NORP', 'non-EEA'))
keyword_processor.add_keyword('EEA', ('NORP', 'EEA'))
keyword_processor.add_keyword('European Economic Area', ('NORP', ''))
# spacy.tokens.doc.Doc to str
keywords_extracted = keyword_processor.extract_keywords(str(doc), span_info=True)
keywords_extracted

[(('PRODUCT', 'Tier 2'), 101, 107),
 (('PRODUCT', 'Tier 5'), 115, 121),
 (('NORP', ''), 585, 607),
 (('NORP', 'EEA'), 609, 612),
 (('NORP', 'non-EEA'), 628, 635),
 (('PRODUCT', 'Tier 4'), 931, 937)]

## Sentence Tokenizer

Once we have the character offsets, the entries can be populated such as  

("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

To do this, we’ll need example texts and the character offsets and labels of each entity contained in the texts.

In [21]:
doc = df.raw_text[1]
# sentence tokenizer
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['f', 'fr', 'k', 'u.k.', 'gov.uk.', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.train(str(doc))
doc = tokenizer.tokenize((str(doc)))
doc

['This guidance is for organisations who want to apply for a sponsor licence to sponsor migrants under Tier 2 and/or Tier 5 of the points-based system.',
 'It tells you what we expect if you are a licence holder, the processes you must follow when sponsoring a migrant and how to meet all of the duties and responsibilities associated with being a licensed sponsor.',
 'The guidance is subject to change and you should check the dates to make sure you have the latest version.',
 'A new points-based immigration system will come into effect from 1 January 2021.',
 'The future system will apply to both European Economic Area (EEA) nationals and non-EEA nationals.',
 'You should refer to Annex 9 of this guidance if you intend to apply for a licence to sponsor workers under the new system.',
 'Separate guidance exists on GOV.UK for UK education providers who wish to apply for and hold a licence to sponsor international students to come to the UK under Tier 4 to study.',
 "You can find the appen