<a href="https://colab.research.google.com/github/ludwigwittgenstein2/Research/blob/master/Name_Entity_Recognition_SPACY_%26_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Title: Name Entity Recognition using SPACY and BERT model 
##by Rick Rejeleene

Name Entity Recognition (NER) is an Information Extraction task in Natural Language Processing.

In this, we approach NER using SPACY and Transformer based Model, BERT

#Objective 

>a. BERT NER 

>b. SPACY NER


>We compare both with metrics. 

>c. Results/Publish

Paper link: https://docs.google.com/document/d/1cidVb2qUS_9KrUWrzngaxXPADxBz09matYFjx1RhPwY/edit

## What dataset are we using for this task? 

> U.S Census dataset 

# Link for Dataset: 

https://drive.google.com/drive/folders/1ozzGFfKH2LFz7Gc2RZm6d4s6SYVw0jGb?usp=sharing


##How to do Name Entity Recognition on Custom Dataset using SPACY3? 

1. Dataset 
2. Spacy Library
3. Training

First We require to annotate our custom dataset: 

a. NER Annotator 

https://tecoholic.github.io/ner-annotator/

b. We require our dataset to be in txt file to be uploaded

c. In the NER annotator, apply labels that you desire 

d. Use Export Annotation to export your annotated dataset



# What next? 

a. Once you have your training dataset

b. Use Spacy3




#Objective: SPACY NER on custom dataset

In [30]:
! pip install -U spacy -q
!pip install torch
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [34]:
import torch
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

## Load custom dataset 

> Upload annotation json dataset

> Re-name to training_data.json

> Run code

In [32]:
import json
f = open('training_data.json')
TRAIN_DATA = json.load(f)

In [33]:
for text, annot in tqdm(TRAIN_DATA['annotations']): 
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents 
    db.add(doc)

db.to_disk("./training_data.spacy") # save the docbin object

100%|██████████| 50/50 [00:00<00:00, 676.25it/s]


In [35]:
class NerDataset(torch.utils.data.Dataset):
  """
  Custom dataset implementation to get (text,labels) tuples
  Inputs:
   - df : dataframe with columns [tags, sentence]
  """
  
  def __init__(self, df):
    if not isinstance(df, pd.DataFrame):
      raise TypeError('Input should be a dataframe')
    
    if "tags" not in df.columns or "sentence" not in df.columns:
      raise ValueError("Dataframe should contain 'tags' and 'sentence' columns")

     
    
    tags_list = [i.split() for i in df["tags"].values.tolist()]
    texts = df["sentence"].values.tolist()

    self.texts = [tokenizer(text, padding = "max_length", truncation = True, return_tensors = "pt") for text in texts]
    self.labels = [match_tokens_labels(text, tags) for text,tags in zip(self.texts, tags_list)]

  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    batch_text = self.texts[idx]
    batch_labels = self.labels[idx]

    return batch_text, torch.LongTensor(batch_labels)

In [36]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [37]:
!python -m spacy train config.cfg --output ./ --paths.train ./training_data.spacy --paths.dev ./training_data.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[2022-11-21 21:27:35,687] [INFO] Set up nlp object from config
INFO:spacy:Set up nlp object from config
[2022-11-21 21:27:35,708] [INFO] Pipeline: ['tok2vec', 'ner']
INFO:spacy:Pipeline: ['tok2vec', 'ner']
[2022-11-21 21:27:35,716] [INFO] Created vocabulary
INFO:spacy:Created vocabulary
[2022-11-21 21:27:35,720] [INFO] Finished initializing nlp object
INFO:spacy:Finished initializing nlp object
[2022-11-21 21:27:36,085] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
INFO:spacy:Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     53.67    0.00    0.00    

In [38]:
nlp_ner = spacy.load("/content/model-best")

## Test out addresss from dataset using SPACY

>uncomment and add any input for checking

In [41]:
#doc = nlp_ner('Mr. Breonia Adam , PhD, QC, "	"513 LANDWYCK LN, FLOWER MOUND ,TX 75028') # input sample text
doc = nlp_ner('Mr. Jasper Bauer , Jr, IQCP, 11096 GINGERWOOD WAY, RANCHO CORDOVA ,CA 95670')

In [13]:
#doc = nlp_ner('MS. Lilly Alldredge , Jr, CLIA, 8243 WORMWOOD RD, JACKSONVILLE ,FL 32210')

In [15]:
#doc = nlp_ner ('Mr. Madelyn Adams , Sr, CLIA, 102 HILLVIEW DR, CLOVERDALE ,CA 95425')

#Results from SPACY NER

In [39]:
#doc = nlp_ner ('Miss Shazia Aman, Sr, Dr, 105 West HillCrest Rock Drive, Little Rock, AR, 72004')

In [42]:
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter

#Name Entity Recognition using Transformer based models

In [43]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [44]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-PREFIX",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-GIVEN",  # Miscellaneous entity
    "B-FAMILY",   # Beginning of a person's name right after another person's name
    "I-NAME",   # Person's name
    "B-STREET",   # Beginning of an organisation right after another organisation
    "I-CITY",   # Organisation
    "B-STATE",   # Beginning of a location right after another location
    "I-ZIP"    # Location
]

sequence = "Mr. Breonia Adam , PhD, QC, "	"513 LANDWYCK LN, FLOWER MOUND ,TX 75028"

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

"""
MR -> Prefix Title
BREONIA -> Given Name
ADAM -> Family Name
PHD QC -> Suffix Title
513 -> Street Number
LANDWYCK -> Street Name
LN -> Street Suffix
FLOWER MOUND -> City Name
TX -> State Name
75028 -> Zip Code

"""



[('[CLS]', 'O'), ('Mr', 'O'), ('.', 'O'), ('B', 'I-NAME'), ('##re', 'I-NAME'), ('##onia', 'I-NAME'), ('Adam', 'I-NAME'), (',', 'O'), ('PhD', 'O'), (',', 'O'), ('Q', 'O'), ('##C', 'I-ZIP'), (',', 'O'), ('51', 'O'), ('##3', 'O'), ('LA', 'I-ZIP'), ('##ND', 'I-CITY'), ('##W', 'I-CITY'), ('##Y', 'I-CITY'), ('##C', 'I-CITY'), ('##K', 'I-CITY'), ('L', 'O'), ('##N', 'O'), (',', 'O'), ('FL', 'O'), ('##OW', 'O'), ('##ER', 'O'), ('M', 'O'), ('##O', 'O'), ('##UN', 'O'), ('##D', 'O'), (',', 'O'), ('TX', 'I-ZIP'), ('750', 'O'), ('##28', 'O'), ('[SEP]', 'O')]


'\nMR -> Prefix Title\nBREONIA -> Given Name\nADAM -> Family Name\nPHD QC -> Suffix Title\n513 -> Street Number\nLANDWYCK -> Street Name\nLN -> Street Suffix\nFLOWER MOUND -> City Name\nTX -> State Name\n75028 -> Zip Code\n\n'

In [46]:
f = open('/content/Cleaned_Simple_50.txt', 'r')
content = f.read()
print(content)

Number	Name	Address
2	"Dean. Alex Abadi , Jr, QA, "	"14646 RANCHERO RD, HESPERIA ,CA 92345"
3	"Dr. Lyla Abbatiello , Jr, CLIA, "	"517 N MOUNTAIN AVE, # 202, UPLAND ,CA 91786"
4	"Mr. Kylia Abbott , Jr, QC, "	"2731 PONKAN MEADOW DR, APOPKA ,FL 32712"
5	"Mrs. Bryce Abbott , Jr, QA, "	"237 SAINT TROPEZ LN, LINCOLN ,CA 95648"
6	"Ms. Daniela Abella , PhD, IQCP,"	"20054 CHICKEN DR, DECATUR ,AR 72722"
7	"Mr. Jacob Abraham , Jr, CLIA, "	"3006 E 17TH ST, # 6, OAKLAND ,CA 94601"
8	"Mr. Reily Acevedo , PhD, QC,"	"15123 BROOKHURST ST, APT 357, WESTMINSTER ,CA 92683"
9	"Col. Micaela Acevedo , PhD, QA, "	"1724 HORIZON HEIGHTS CIR, EL CAJON ,CA 92019"
10	"Mr. Breonia Adam , PhD, QC, "	"513 LANDWYCK LN, FLOWER MOUND ,TX 75028"
11	"Mr. Miranda Adam , Jr, QA,"	"3503 E CORNELL ST, LUBBOCK ,TX 79403"
12	"Mr. Madelyn Adams , Sr, CLIA"	"102 HILLVIEW DR, CLOVERDALE ,CA 95425"
13	"Mr. Jase Adams , PhD, QC, "	"1940 MARKET ST, FERNDALE ,CA 95536"
14	"Col. Jessica Adams , PhD, QA, "	"3120 11TH ST SW, LEHIGH ACRES

In [None]:
#use if you want to close file
f.close()

In [47]:
from transformers import pipeline
import pandas as pd
f = open('/content/Cleaned_Simple_50.txt', 'r')
content = f.read()

text = content

#text = 'Mr. Breonia Adam , PhD, QC, "	"513 LANDWYCK LN, FLOWER MOUND ,TX 75028'

tagger = pipeline(task='ner', aggregation_strategy='simple', grouped_entities=True)
named_ents = tagger(text)
stored_data = pd.DataFrame(named_ents)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
  "`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to"


In [48]:
print(stored_data)

   entity_group     score          word  start   end
0           PER  0.597720          Dean     23    27
1           PER  0.978966    Alex Abadi     29    39
2           PER  0.919571            Jr     42    44
3           LOC  0.654975            RA     59    61
4           ORG  0.598351    ##NCHERO R     61    69
..          ...       ...           ...    ...   ...
71          ORG  0.615709             L    988   989
72          LOC  0.381800          ##IG    991   993
73          ORG  0.581082          ACRE    995   999
74          PER  0.996841  Samuel Adams   1022  1034
75          PER  0.843554            Sr   1037  1039

[76 rows x 5 columns]


In [49]:
#output to a CSV File
stored_data.to_csv('TransformerNER.csv', sep='\t')

# Draft Transformer Code

In [None]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install spacy_conll 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy_conll
  Downloading spacy_conll-3.2.0-py3-none-any.whl (21 kB)
Installing collected packages: spacy-conll
Successfully installed spacy-conll-3.2.0


In [None]:
!pip install spacy-stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from spacy_conll import init_parser


# Initialise English parser, already including the ConllFormatter as a pipeline component.
# Indicate that we want to get the CoNLL headers in the string output.
# `use_gpu` and `verbose` are specific to stanza. These keywords arguments are passed onto their Pipeline() initialisation
nlp = init_parser("en",
                  "stanza",
                  parser_opts={"use_gpu": True, "verbose": False}, include_headers=True)
# Parse a given string
doc = nlp("Mr. Jasper Bauer , Jr, IQCP, 11096 GINGERWOOD WAY, RANCHO CORDOVA ,CA 95670")

# Get the CoNLL representation of the whole document, including headers
conll = doc._.conll_str
print(conll)


# sent_id = 1
# text = Mr. Jasper Bauer , Jr, IQCP, 11096 GINGERWOOD WAY, RANCHO CORDOVA ,CA 95670
1	Mr.	Mr.	PROPN	NNP	Number=Sing	0	root	_	_
2	Jasper	Jasper	PROPN	NNP	Number=Sing	1	flat	_	_
3	Bauer	Bauer	PROPN	NNP	Number=Sing	1	flat	_	_
4	,	,	PUNCT	,	_	1	punct	_	_
5	Jr	Jr	PROPN	NNP	Number=Sing	1	list	_	SpaceAfter=No
6	,	,	PUNCT	,	_	1	punct	_	_
7	IQCP	IQCP	PROPN	NNP	Number=Sing	1	list	_	SpaceAfter=No
8	,	,	PUNCT	,	_	1	punct	_	_
9	11096	11096	NUM	CD	NumForm=Digit|NumType=Card	7	list	_	_
10	GINGERWOOD	gingerwood	NOUN	NN	Number=Sing	11	compound	_	_
11	WAY	way	NOUN	NN	Number=Sing	1	list	_	SpaceAfter=No
12	,	,	PUNCT	,	_	1	punct	_	_
13	RANCHO	RANCHO	PROPN	NNP	Number=Sing	14	compound	_	_
14	CORDOVA	CORDOVA	PROPN	NNP	Number=Sing	1	list	_	_
15	,	,	PUNCT	,	_	16	punct	_	SpaceAfter=No
16	CA	CA	PROPN	NNP	Number=Sing	1	list	_	_
17	95670	95670	NUM	CD	NumForm=Digit|NumType=Card	1	list	_	SpaceAfter=No



In [None]:
import spacy
from spacy.training import offsets_to_biluo_tags
from spacy.tokens import Span

print("SPACY HELPER MODEL")
base_model = "en_core_web_sm"
nlp = spacy.load(base_model)
to_disable= ['parser', 'lemmatizer', 'ner']
_ = [nlp.remove_pipe(item) for item in to_disable]
print("Base model used: ", base_model)
print("Removed components: ", to_disable)
print("Enabled components: ", nlp.pipe_names)

# Assume text is already available as sentences...
# so no need for spaCy `sentencizer` or similar
print("\nDEMO SPACY DOC LIST BUILDING...", end="")
doc1 = nlp("iPhone X is coming.")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
doc2 = nlp("Space X is nice.")
doc2.ents = [Span(doc1, 0, 2, label="BRAND")]
docs = [doc1, doc2]
print("DONE!")

print("\nCoNLL 2003 CONVERSION:\n")
results = []
for doc in docs:
    # Preliminary: whole sentence
    whole_sentence = doc.text
    # 1st item (CoNLL 2003): word
    words = [token.text for token in doc]
    # 2nd item (CoNLL 2003): POS
    pos = [token.tag_ for token in doc]
    # 3rd item (CoNLL 2003): syntactic chunk tag
    # sct = pos  # Redundant, so will be left out
    # 4th item (CoNLL 2003): named entities
    spacy_entities = [
        (ent.start_char, ent.end_char, ent.label_)
        for ent in doc.ents
    ]
    biluo_entities = offsets_to_biluo_tags(doc, spacy_entities)
    results.append((whole_sentence, words, pos, biluo_entities))

for result in results:
    print(
        "\nDOC TEXT (NOT included in CoNLL 2003, just for demo): ",
        result[0], "\n"
    )
    print("-DOCSTART- -X- -X- O")
    for w,x,y,z in zip(result[1], result[2], result[2], result[3]):
        print(w,x,y,z)

SPACY HELPER MODEL
Base model used:  en_core_web_sm
Removed components:  ['parser', 'lemmatizer', 'ner']
Enabled components:  ['tok2vec', 'tagger', 'attribute_ruler']

DEMO SPACY DOC LIST BUILDING...DONE!

CoNLL 2003 CONVERSION:


DOC TEXT (NOT included in CoNLL 2003, just for demo):  iPhone X is coming. 

-DOCSTART- -X- -X- O
iPhone NNP NNP B-GADGET
X NNP NNP L-GADGET
is VBZ VBZ O
coming VBG VBG O
. . . O

DOC TEXT (NOT included in CoNLL 2003, just for demo):  Space X is nice. 

-DOCSTART- -X- -X- O
Space NN NN B-BRAND
X NNP NNP L-BRAND
is VBZ VBZ O
nice JJ JJ O
. . . O


In [None]:
def print_entities(pipeline, text):
    
    # Create a document 
    document = pipeline(text)
    
    # Entity text & label extraction
    for entity in document.ents:
        print(entity.text + '->', entity.label_)
        
        
def visualize_entities(pipeline, text):
    
    # Create a document 
    document = pipeline(text)
        
    # Show entities in pretty manner
    displacy.render(document, jupyter=True, style='ent')

In [None]:

import spacy
from spacy import displacy

# Load English large model
nlp_sm = spacy.load("en_core_web_lg")

In [None]:
short_text = '“Amy Schneider, an engineering manager from Oakland, California, became the first woman and the fourth person on “Jeopardy!” to earn more than $1 million in winnings on Friday’s episode”.'

In [None]:
long_text = 'Good news for consumers, undoubtedly, and good news also for investors. Apple’s recent results, covering the three months to December 31 2016, saw the company’s chief financial officer Luca Maestri announce: ‘We returned nearly $15 billion to investors through share re-purchases and dividends during the quarter.’ The quarterly dividend itself was 57 cents a share, identical to the dividend for the previous three quarters and up on the 52 cents paid for each of the four quarters before that.Business is brisk at Apple. On January 31, Tim Cook, Apple’s chief executive, said of the last three months of 2016: ‘We’re thrilled to report that our holiday quarter results generated Apple’s highest quarterly revenue ever, and broke multiple records along the way. We sold more iPhones than ever before and set all-time revenue records for iPhone, Services, Mac and Apple Watch'

In [None]:
addresss = "Dean. Alex Abadi , Jr, QA, "	"14646 RANCHERO RD, HESPERIA ,CA 92345"

In [None]:
print_entities(nlp_sm, short_text)

Amy Schneider-> PERSON
Oakland-> GPE
California-> GPE
first-> ORDINAL
fourth-> ORDINAL
Jeopardy-> WORK_OF_ART
more than $1 million-> MONEY
Friday-> DATE


In [None]:
print_entities(nlp_sm, addresss)

Dean-> PERSON
Alex Abadi-> PERSON
Jr-> PERSON


In [None]:
visualize_entities(nlp_sm, addresss)

In [None]:
visualize_entities(nlp_sm, short_text)

In [None]:
print_entities(nlp_sm, long_text)

Apple-> ORG
the three months to December 31 2016-> DATE
Luca Maestri-> PERSON
nearly $15 billion-> MONEY
the quarter-> DATE
quarterly-> DATE
57 cents-> MONEY
the previous three quarters-> DATE
the 52 cents-> MONEY
the four quarters-> DATE
Apple-> ORG
January 31-> DATE
Tim Cook-> PERSON
Apple-> ORG
the last three months of 2016-> DATE
our holiday quarter-> DATE
Apple-> ORG
quarterly-> DATE
iPhones-> ORG
iPhone, Services-> ORG
Mac-> ORG
Apple Watch-> ORG


In [None]:
visualize_entities(nlp_sm, short_text)

In [None]:
# Load the spacy transformer (roberta-base) model
roberta_nlp = spacy.load("en_core_web_trf")

OSError: ignored

In [None]:
print_entities(roberta_nlp, short_text)

In [None]:
visualize_entities(roberta_nlp, short_text)

In [None]:
!pip install spacy-transformers

In [None]:
print_entities(roberta_nlp, long_text)

In [None]:
print_entities(roberta_nlp, addresss)

In [None]:
visualize_entities(roberta_nlp, short_text)

In [None]:
visualize_entities(roberta_nlp, addresss)

In [None]:
visualize_entities(roberta_nlp, long_text)

In [2]:
!pip install flair


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[K     |████████████████████████████████| 401 kB 6.5 MB/s 
[?25hCollecting mpld3==0.3
  Downloading mpld3-0.3.tar.gz (788 kB)
[K     |████████████████████████████████| 788 kB 46.3 MB/s 
[?25hCollecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 55.2 MB/s 
[?25hCollecting bpemb>=0.3.2
  Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Collecting deprecated>=1.2.4
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting konoha<5.0.0,>=4.0.0
  Downloading konoha-4.6.5-py3-none-any.whl (20 kB)
Collecting conllu>=4.0
  Downloading conllu-4.5.2-py2.py3-none-any.whl (16 kB)
Collecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.gz (18 kB)
Collecting huggingface-hub
  Downloading huggingface_

In [3]:
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')



Downloading:   0%|          | 0.00/432M [00:00<?, ?B/s]

2022-11-21 20:46:33,537 loading file /root/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
2022-11-21 20:46:35,955 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [7]:
!pip install flair.data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement flair.data (from versions: none)[0m
[31mERROR: No matching distribution found for flair.data[0m


In [11]:
from flair import data.TaggedCorpus

SyntaxError: ignored

In [10]:
from flair

SyntaxError: ignored

In [6]:
from flair.data import TaggedCorpus
from flair.data import Sentence
from flair.data_fetcher import  NLPTaskDataFetcher, NLPTask
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings, CharacterEmbeddings
from segtok.segmenter import split_single
from typing import List
import torch

# Convert string to sentence, this will auto s
test_sentence = Sentence("Egg fried rice is delicious .")
tags = ["B-Food", "I-Food","L-Food", "O", "O", "O"]

for x in range(len(test_sentence)):
    test_sentence[x].add_tag('ner', tags[x])

test_sentence.to_tagged_string()
# Output
# 'Egg <B-Food> fried <I-Food> rice <L-Food> is delicious .'

# Let's assume you have already create lists which contain tagged sentences 
corpus = TaggedCorpus(sentences, testa, testb)

# Make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type="ner")
print(tag_dictionary.idx2item)

stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'), 
                                        FlairEmbeddings('news-forward'), 
                                        FlairEmbeddings('news-backward'),
                                       ])

# Initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=stacked_embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type="ner",
                                        use_crf=True)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# Start training
trainer.train('model',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=80)

ImportError: ignored