## Development of a Benchmark Corpus to Support Entity Recognition in Job Descriptions

Detecting and extracting entities is an important task in many real-world information extraction applications such as text classification,Search algorithms and content recommendations.
Recruiting, companies today benefit from automated systems to acquire up-to-date information about job roles and detailed candidate profiles in terms of skills, qualifications and experience.
However, the development of entity recognition (ER) to perform these tasks suffers severely due to the lack of publicly available datasets.
Most of the available datasets consist of general news articles (newspapers, etc.).

To develop better job matching tools, we need to address these issues.
First we establish a definition of the relevant factors.
Which are: skills, education, experience.
We used this data set: train_rav_1 (https://www.kaggle.com/datasets/airiddha/trainrev1) which shows UK job descriptions, which is publicly available
Build our data set above the data set defined above according to the three labels: skills, education, experience.
Initially, we used text entity extraction tools to independently classify these three labels.
Next, we present a benchmark between model x and model y.

In addition, the data set created is publicly available, and can be used for other models.
The source code can be found at (link to source code).


### Instructions before starting

In [2]:
##Before we start
#As you can see, the models were taken from a github repo with minor tuning done to make the model work. Every computer and its individual case. I run with a macbook m1, so small adjustments were needed for the python and the model to run smoothly. Everything is listed below.

#Uncomment the 2 cp commands! and follow the instrucations!

#Replace <<Downloaded_glove.6b.100d.txt_diretory>> with your dowloaded glove embeddings directory.
#!cp -r <<Downloaded_glove.6b.100d.txt_diretory>> /embedding/glove.6B.100d.txt

#Replace <<Downloaded_trainrev1_dataset>> with your dowloaded dataset Train_rev1_2.csv
##Download dataset https://www.kaggle.com/datasets/airiddha/trainrev1 
#!cp -Rf <<Downloaded_trainrev1_dataset>> dataset/Train_rev1_2.csv ##Paste the dataset under directory: dataset/Train_rev1_2.csv


## Dataset creation

### First we need to understand how to extract experience, qualification/education & skills from the dataset.
So, we need to create an extractors for all of the 3 labels we want to run on.
###### NOTE - We run this code already all night to get better IOB result file, SO take the sample i created already that called dataset/sample_dataset.conll

### Skills
We used ***SkillNer*** description can be found here: https://github.com/AnasAito/SkillNER

***SkillNer*** is an NLP module to automatically Extract skills and certifications from unstructured job postings, texts, and applicant's resumes. 
Skillner uses EMSI databse (an open source skill database) as a knowldge base linker to prevent skill duplications.

In [3]:
import os
from typing import List

os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
import spacy
from skillNer.general_params import SKILL_DB
from skillNer.skill_extractor_class import SkillExtractor
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_lg")
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

def extract_skills(job_description: str) -> List[str]:
    try:
        annotations = skill_extractor.annotate(job_description)
        full_matches = [x['doc_node_value'] for x in annotations['results']['full_matches']]
        ngram_scored = [x['doc_node_value'] for x in annotations['results']['ngram_scored']]
        return full_matches + ngram_scored
    except Exception as e:
        print(f"Exception {e} on {job_description}, moving forward")
        return []



loading full_matcher ...
loading abv_matcher ...
loading full_uni_matcher ...
loading low_form_matcher ...
loading token_matcher ...


### Education

We used an regex matching to find any case of education.
We listed the all options of education from job descriptions.

***NOTE - This is not the best practice, the best practice for now, is to build an education extraction via machine learning from text. Still in machine learning, there can be a problem that given a new education, he could not recognize it.
Therefore it must be a contemporary skilled model***

In [4]:
import os

os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
import spacy
import re
from nltk.corpus import stopwords

# load pre-trained model
nlp = spacy.load('en_core_web_sm')

# Grad all general stop words
STOPWORDS = set(stopwords.words('english'))

# Education Degrees
EDUCATION = [
            'BE','B.E.', 'B.E', 'BS', 'B.S','C.A.','c.a.','B.Com','B. Com','M. Com', 'M.Com','M. Com .',
            'ME', 'M.E', 'M.E.', 'MS', 'M.S',
            'BTECH', 'B.TECH', 'M.TECH', 'MTECH',
            'PHD', 'phd', 'ph.d', 'Ph.D.','MBA','mba','graduate', 'post-graduate','5 year integrated masters','masters',
            'SSC', 'HSC', 'CBSE', 'ICSE', 'X', 'XII',
            'Hospital','University','Institute','School','School','Academy', 'Bachelor', 'High School', 'College', ''
        ]

def extract_education(job_description: str):
    nlp_text = nlp(job_description)
    # Sentence Tokenizer
    nlp_text = [sent.text.strip() for sent in nlp_text.sents]
    edu = []
    for index, text in enumerate(nlp_text):
        for tex in text.split():
            tex = re.sub(r'[?|$|.|!|,]', r'', tex)
            if tex.upper() in EDUCATION and tex not in STOPWORDS:
                edu.append(tex)
    return edu




### Experience

We used Regex to match experience  to find any case in text that includes exerience(etc: years).

In [6]:
import re

rx = re.compile(r"(\d+(?:-\d+)?\+?)\s*(years?)", re.I)

def extract_experience(job_description: str):
    exp_temp = rx.search(string=job_description)
    if exp_temp:
        list_return = []
        list_return = list_return.append(' '.join(exp_temp.groups()))
        return list_return

### Dataset generation

Now that we have all the extractors, we can run on the dataset file to create our train dataset in IOB format.


In [13]:
import csv
from typing import List

import nltk

pos_chunk_mapping = {
    'NN': 'B-NP',
    'NNS': 'B-NP',
    'NNP': 'B-NP',
    'NNPS': 'B-NP',
    'VB': 'B-VP',
    'VBD': 'B-VP',
    'VBG': 'B-VP',
    'VBN': 'B-VP',
    'VBP': 'B-VP',
    'VBZ': 'B-VP',
    'IN': 'B-PP',
    'TO': 'B-VP',
    'JJ': 'B-ADJP',
    'JJR': 'B-ADJP',
    'JJS': 'B-ADJP',
    'RB': 'B-ADVP',
    'RBR': 'B-ADVP',
    'RBS': 'B-ADVP',
    'DT': 'B-NP',
    'CD': 'B-NP',
    'PRP': 'B-NP',
    'PRP$': 'B-NP',
    'POS': 'B-NP',
    'EX': 'B-NP',
    'WDT': 'B-NP',
    'WP': 'B-NP',
    'WP$': 'B-NP',
    'WRB': 'B-ADVP',
    'MD': 'B-VP',
    'RP': 'B-PRT',
    'CC': 'B-CONJP',
    'PDT': 'B-ADJP',
    'FW': 'B-ADJP',
    'UH': 'B-INTJ',
    'SYM': 'B-SYM',
    'LS': 'B-LST',
    'ADD': 'B-NP',
    'GW': 'B-NP',
    'AFX': 'B-NP',
    'HYPH': 'B-NP',
    'NFP': 'B-NP',
    'XX': 'B-NP',
    'BES': 'B-VP',
    'HVS': 'B-VP',
    'XX': 'B-NP',
    'I-NP': 'I-NP',
    'I-VP': 'I-VP',
    'I-PP': 'I-PP',
    'I-ADJP': 'I-ADJP',
    'I-ADVP': 'I-ADVP',
    'I-PRT': 'I-PRT',
    'I-CONJP': 'I-CONJP',
    'I-INTJ': 'I-INTJ',
    'I-SYM': 'I-SYM',
    'I-LST': 'I-LST',
    'O': 'O'
}


dataset = "dataset/Train_rev1_2.csv"
MAX_SIZE = 2

def extract_full_item(token_to_find: str, tokens: List[str], items: List[str])-> str:
    token_to_find_index = tokens.index(token_to_find)
    tokens = tokens[token_to_find_index:token_to_find_index + MAX_SIZE]
    for (idx,token) in enumerate(tokens):
        for item in items:
            splitted_item = item.split(" ")
            if token == splitted_item[0]:
                if len(splitted_item) == 1:
                    return item
                else:
                    next_element = idx + 1
                    for x in range(1, len(splitted_item)):
                        if len(tokens) <= next_element:
                            break
                        else:
                            next_token = tokens[next_element]
                            if next_token != splitted_item[x]:
                                break
                            else:
                                if len(splitted_item) == x + 1:
                                    return item
                                else:
                                    next_element += 1
    return ""

def append_to_coll_data(extracted_item, token, pos_tag, chunk_tag, label):
    conll_data = []
    extracted_item_splitted = extracted_item.split(" ")
    for (kdx, extracted_item_chunk) in enumerate(extracted_item_splitted):
        if len(token) >= 1:
            if kdx == 0:
                conll_data.append(f"{extracted_item_chunk}\t{pos_tag[1]}\t{chunk_tag}\tB-{label}")
            else:
                conll_data.append(f"{extracted_item_chunk}\t{pos_tag[1]}\t{chunk_tag}\tI-{label}")
    return conll_data

def handle_token_labels(exp_label: str,
                        skill_label: str,
                        edu_label: str,
                        tokens: List[str],
                        token: str,
                        pos_tag: str,
                        chunk_tag: str,
                        skills_items: List[str],
                        edu_items: List[str],
                        exp_items: List[str]):
        conll_data = []
        extracted_skills_item = extract_full_item(token, tokens, skills_items)
        extracted_edus_item = extract_full_item(token, tokens, edu_items)
        extracted_exps_item = extract_full_item(token, tokens, exp_items)
        if extracted_skills_item == "" and extracted_edus_item == "" and extracted_exps_item == "":
            conll_data += [f"{token}\t{pos_tag[1]}\t{chunk_tag}\tO"]
        elif extracted_exps_item != "":
            conll_data += append_to_coll_data(extracted_exps_item, token, pos_tag, chunk_tag, exp_label)
        elif extracted_skills_item != "":
            conll_data += append_to_coll_data(extracted_skills_item, token, pos_tag, chunk_tag, skill_label)
        elif extracted_edus_item != "":
            conll_data += append_to_coll_data(extracted_edus_item, token, pos_tag, chunk_tag, edu_label)
        return conll_data

def loop_on_chunks(chunk: List[List[str]]) -> List[str]:
    global conll_data_string
    conll_data = []
    # Create a chunk parser using a predefined chunk grammar

    # Perform chunking
    for (idx, row) in enumerate(chunk):
        print(f"on line: {idx}")
        full_description_lower = row[2].lower()
        full_description_lower_splitted: List[str] = full_description_lower.split(".") # Full description
        lines_tokens = [nltk.word_tokenize(line) for line in full_description_lower_splitted]
        for (line, tokens) in zip(full_description_lower_splitted, lines_tokens):
            skills: line[str] = extract_skills(job_description=line) or []
            educations: list[str] = extract_education(job_description=line) or []
            experiences: list[str] = extract_experience(job_description=line) or []
            pos_tags = nltk.pos_tag(tokens)
            jdx = 0
            for (token, pos_tag) in zip(tokens, pos_tags):
                try:
                    chunk_tag = pos_chunk_mapping[pos_tag[1]]
                except Exception as error:
                    chunk_tag = '0'

                token = tokens[jdx]
                pos_tag = pos_tags[jdx]
                conll_data_result = handle_token_labels(exp_label="EXP",
                                                        skill_label="SKILL",
                                                        edu_label="EDU",
                                                        tokens=tokens,
                                                        token=token,
                                                        pos_tag=pos_tag,
                                                        chunk_tag=chunk_tag,
                                                        skills_items=skills,
                                                        edu_items=educations,
                                                        exp_items=experiences)
                conll_data += ['\n']
                conll_data += conll_data_result
                jdx += len(conll_data_result)
                if jdx >= len(tokens):
                    break
            conll_data += ['\n']
    return conll_data

def generate_rows(output_path):
    N = 20
    with open(dataset, 'r') as file:
        csvreader = list(csv.reader(file))
        csvreader.pop(0)
        chunks: List[List[List[str]]] = [csvreader[x:x + N] for x in range(0, len(csvreader) - 1, N)]
        with open(output_path, 'w') as output_file:
            output_file.write('-DOCSTART- -X- -X- O')
            output_file.write('\n')
        with open(output_path, 'a') as output_file:
            for (idx, chunk) in enumerate(chunks):
                lines = loop_on_chunks(chunk=chunk)
                output_file.writelines(lines)
                print(f"Finished chunk number: {idx}/{ len(chunks)}")

generate_rows('dataset/dataset_2.conll')

on line: 0


  vec_similarity = token1.similarity(token2)


on line: 1
on line: 2
on line: 3
on line: 4
on line: 5
on line: 6
on line: 7
on line: 8
on line: 9
on line: 10
on line: 11
on line: 12
on line: 13
on line: 14
on line: 15
on line: 16
on line: 17
on line: 18
on line: 19
Finished chunk number: 0/12239
on line: 0
on line: 1
on line: 2
on line: 3
on line: 4
on line: 5
on line: 6
on line: 7
on line: 8
on line: 9
on line: 10
on line: 11
on line: 12
on line: 13
on line: 14
on line: 15
on line: 16
on line: 17
on line: 18
on line: 19
Finished chunk number: 1/12239
on line: 0
on line: 1
on line: 2
on line: 3
on line: 4
on line: 5
on line: 6
on line: 7
on line: 8
on line: 9
on line: 10
on line: 11
on line: 12
on line: 13
on line: 14
on line: 15
on line: 16


KeyboardInterrupt: 

### Seperate Dataset To Dev,Train,Test

So, after we created this sample_dataset.conll
We need to seperate it to train,dev,test

In [17]:
def split_dataset(dataset_filepath, train_filepath, dev_filepath, test_filepath):
    with open(dataset_filepath, 'r') as file:
        dataset = file.readlines()

        train_ratio = 0.7
        dev_ratio = 0.15
        test_ratio = 0.15

        total_samples = len(dataset)
        train_end = int(total_samples * train_ratio)
        dev_end = int(total_samples * (train_ratio + dev_ratio))

        train_set = dataset[:train_end]
        dev_set = dataset[train_end:dev_end]
        test_set = dataset[dev_end:]

        write_to_file(train_filepath, train_set)
        write_to_file(dev_filepath, dev_set)
        write_to_file(test_filepath, test_set)
        
def write_to_file(file_name, data):
    with open(file_name, 'w') as file:
        file.write("-DOCSTART- -X- -X- O")
        file.write("\n")
        file.writelines(data)
        
dataset_filepath = "dataset/sample_dataset.conll"
train_filepath = "dataset/split_dataset/conllpp_train.txt"
dev_filepath = "dataset/split_dataset/conllpp_dev.txt"
test_filepath = "dataset/split_dataset/conllpp_test.txt"
split_dataset(dataset_filepath, train_filepath, dev_filepath, test_filepath)
print("Data has been splitted!")

Data has been splitted!


## Models

### LM-LSTM-CRF
Github link: https://github.com/LiyuanLucasLiu/LM-LSTM-CRF/blob/master
Readme: https://github.com/LiyuanLucasLiu/LM-LSTM-CRF/blob/master/README.md

#### NLP Progress
- LM-LSTM-CRF (Liu et al., 2018)	
- F1: 91.24	

#### Description
![image.png](attachment:image.png)
As visualized above, they use conditional random field (CRF) to capture label dependencies, and adopt a hierarchical LSTM to leverage both char-level and word-level inputs. The char-level structure is further guided by a language model, while pre-trained word embeddings are leveraged in word-level. The language model and the sequence labeling model are trained at the same time, and both make predictions at word-level. Highway networks are used to transform the output of char-level LSTM into different semantic spaces, and thus mediating these two tasks and allowing language model to empower sequence labeling.

***NOTE - please follow those instructions in the readme file for specific versions of libraries.***

***NOTE - Download embeddings glove.6B.100d.txt and paste it under ./embedding/glove.6B.100d.txt
Run this code to run the model and get the F1 Score.***

In [51]:
##Run the model
##NOTE - in case of you get an error in set_device function from torch 
##       1.Please remove the set_device from the code in the repo.
##         Because, we do not use GPU here, the dataset is small.
##.      2.Remove .cuda() from every element uses it. (e.g. model.cuda() -> model)
!python3 models/LM-LSTM-CRF/train_wc.py --train_file dataset/split_dataset/conllpp_train.txt --dev_file dataset/split_dataset/conllpp_dev.txt --test_file dataset/split_dataset/conllpp_test.txt --checkpoint ./checkpoint/ner_ --caseless --fine_tune --high_way --co_train --least_iters 100 --emb_file embeddings/glove.6B.100d.txt --checkpoint models/LM-LSTM-CRF/checkpoint 


setting:
Namespace(batch_size=10, caseless=True, char_dim=30, char_hidden=300, char_layers=1, checkpoint='models/LM-LSTM-CRF/checkpoint', clip_grad=5.0, co_train=True, dev_file='dataset/split_dataset/conllpp_dev.txt', drop_out=0.55, emb_file='embeddings/glove.6B.100d.txt', epoch=200, eva_matrix='fa', fine_tune=False, gpu=0, high_way=True, highway_layers=1, lambda0=1, least_iters=100, load_check_point='', load_opt=False, lr=0.015, lr_decay=0.05, mini_count=5, momentum=0.9, patience=15, rand_embedding=False, shrink_embedding=False, small_crf=True, start_epoch=0, test_file='dataset/split_dataset/conllpp_test.txt', train_file='dataset/split_dataset/conllpp_train.txt', unk='unk', update='sgd', word_dim=100, word_hidden=300, word_layers=1)
loading corpus
constructing coding table
feature size: '3394'
loading embedding
embedding size: '400133'
constructing dataset
building model
  tg_energy = tg_energy.masked_select(mask).sum()
  partition.masked_scatter_(mask_idx, cur_partition.masked_select

TEST : total : test_f1: 0.8533 test_rec: 0.7958 test_pre: 0.9199 test_acc: 0.9686 | 

TEST : O : test_f1: 0.9848 test_rec: 0.9946 test_pre: 0.9752 test_acc: 0.0000 | {'I-SKILL': 128, 'B-SKILL': 234}

TEST : B-SKILL : test_f1: 0.9079 test_rec: 0.8807 test_pre: 0.9368 test_acc: 0.0000 | {'O': 751, 'I-SKILL': 133}

TEST : I-SKILL : test_f1: 0.4795 test_rec: 0.3616 test_pre: 0.7113 test_acc: 0.0000 | {'O': 930, 'B-SKILL': 205}

TEST : B-EDU : test_f1: 0.1053 test_rec: 0.0556 test_pre: 1.0000 test_acc: 0.0000 | {'O': 16, 'B-SKILL': 1}

(loss: 2.1198, epoch: 3, dev F1 = 0.8705, dev acc = 0.9738, F1 on test = 0.8533, acc on test= 0.9686), saving...
epoch: 3	 in 200 take: 9412.377148866653 s
DEV : total : dev_f1: 0.8737 dev_rec: 0.8210 dev_pre: 0.9336 dev_acc: 0.9749 | 

DEV : O : dev_f1: 0.9877 dev_rec: 0.9959 dev_pre: 0.9797 dev_acc: 0.0000 | {'I-SKILL': 114, 'B-SKILL': 165}

DEV : B-SKILL : dev_f1: 0.9281 dev_rec: 0.9037 dev_pre: 0.9539 dev_acc: 0.0000 | {'O': 583, 'I-SKILL': 97}

DEV : I-S

DEV : total : dev_f1: 0.9065 dev_rec: 0.8737 dev_pre: 0.9420 dev_acc: 0.9806 | 

DEV : O : dev_f1: 0.9905 dev_rec: 0.9955 dev_pre: 0.9856 dev_acc: 0.0000 | {'I-SKILL': 124, 'B-SKILL': 178, 'B-EDU': 1}

DEV : B-SKILL : dev_f1: 0.9531 dev_rec: 0.9503 dev_pre: 0.9560 dev_acc: 0.0000 | {'O': 295, 'I-SKILL': 56}

DEV : I-SKILL : dev_f1: 0.6095 dev_rec: 0.4875 dev_pre: 0.8129 dev_acc: 0.0000 | {'B-SKILL': 130, 'O': 691, 'B-EDU': 1}

DEV : B-EDU : dev_f1: 0.8235 dev_rec: 0.8750 dev_pre: 0.7778 dev_acc: 0.0000 | {'B-SKILL': 1}

TEST : total : test_f1: 0.8955 test_rec: 0.8612 test_pre: 0.9326 test_acc: 0.9769 | 

TEST : O : test_f1: 0.9890 test_rec: 0.9948 test_pre: 0.9832 test_acc: 0.0000 | {'I-SKILL': 102, 'B-SKILL': 244, 'B-EDU': 3}

TEST : B-SKILL : test_f1: 0.9428 test_rec: 0.9459 test_pre: 0.9397 test_acc: 0.0000 | {'O': 334, 'I-SKILL': 67}

TEST : I-SKILL : test_f1: 0.5668 test_rec: 0.4331 test_pre: 0.8200 test_acc: 0.0000 | {'O': 803, 'B-SKILL': 205}

TEST : B-EDU : test_f1: 0.7647 test

DEV : total : dev_f1: 0.9153 dev_rec: 0.8895 dev_pre: 0.9427 dev_acc: 0.9823 | 

DEV : O : dev_f1: 0.9913 dev_rec: 0.9952 dev_pre: 0.9874 dev_acc: 0.0000 | {'I-SKILL': 159, 'B-SKILL': 164, 'B-EDU': 1}

DEV : B-SKILL : dev_f1: 0.9582 dev_rec: 0.9576 dev_pre: 0.9589 dev_acc: 0.0000 | {'O': 263, 'I-SKILL': 35, 'B-EDU': 1}

DEV : I-SKILL : dev_f1: 0.6572 dev_rec: 0.5486 dev_pre: 0.8194 dev_acc: 0.0000 | {'B-SKILL': 126, 'O': 597, 'B-EDU': 1}

DEV : B-EDU : dev_f1: 0.8421 dev_rec: 1.0000 dev_pre: 0.7273 dev_acc: 0.0000 | {}

TEST : total : test_f1: 0.9051 test_rec: 0.8776 test_pre: 0.9344 test_acc: 0.9792 | 

TEST : O : test_f1: 0.9901 test_rec: 0.9948 test_pre: 0.9854 test_acc: 0.0000 | {'I-SKILL': 137, 'B-SKILL': 211, 'B-EDU': 1}

TEST : B-SKILL : test_f1: 0.9496 test_rec: 0.9526 test_pre: 0.9465 test_acc: 0.0000 | {'O': 295, 'I-SKILL': 56}

TEST : I-SKILL : test_f1: 0.6255 test_rec: 0.5045 test_pre: 0.8229 test_acc: 0.0000 | {'O': 693, 'B-SKILL': 188}

TEST : B-EDU : test_f1: 0.8824 test

DEV : total : dev_f1: 0.9212 dev_rec: 0.9008 dev_pre: 0.9424 dev_acc: 0.9836 | 

DEV : O : dev_f1: 0.9918 dev_rec: 0.9949 dev_pre: 0.9888 dev_acc: 0.0000 | {'I-SKILL': 206, 'B-SKILL': 140, 'B-EDU': 1}

DEV : B-SKILL : dev_f1: 0.9604 dev_rec: 0.9545 dev_pre: 0.9664 dev_acc: 0.0000 | {'O': 268, 'I-SKILL': 53}

DEV : I-SKILL : dev_f1: 0.7067 dev_rec: 0.6347 dev_pre: 0.7972 dev_acc: 0.0000 | {'B-SKILL': 94, 'O': 491, 'B-EDU': 1}

DEV : B-EDU : dev_f1: 0.8889 dev_rec: 1.0000 dev_pre: 0.8000 dev_acc: 0.0000 | {}

TEST : total : test_f1: 0.9103 test_rec: 0.8879 test_pre: 0.9340 test_acc: 0.9805 | 

TEST : O : test_f1: 0.9906 test_rec: 0.9945 test_pre: 0.9867 test_acc: 0.0000 | {'I-SKILL': 187, 'B-SKILL': 181, 'B-EDU': 1}

TEST : B-SKILL : test_f1: 0.9530 test_rec: 0.9509 test_pre: 0.9551 test_acc: 0.0000 | {'O': 291, 'I-SKILL': 73}

TEST : I-SKILL : test_f1: 0.6688 test_rec: 0.5759 test_pre: 0.7975 test_acc: 0.0000 | {'O': 604, 'B-SKILL': 150}

TEST : B-EDU : test_f1: 0.8824 test_rec: 0.8333 

##### LM-LSTM-CRF F1 score, is 91.3±94.3


### FlairNLP
Github link: https://github.com/flairNLP/flair
Readme: https://github.com/flairNLP/flair/blob/master/README.md

#### NLP Progress
- Flair embeddings (Akbik et al., 2018)♦		
- F1: 93.09
- Description: Contextual String Embeddings for Sequence Labeling

#### Description

A powerful NLP library. Flair allows to apply state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), sentiment analysis, part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, with support for a rapidly growing number of languages.

A text embedding library. Flair has simple interfaces that allow to use and combine different word and document embeddings, including proposed Flair embeddings and various transformers.

A PyTorch NLP framework

In [None]:
#Install requiremnts
!pip3 install flair

In [1]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings

columns = {0 : 'text', 3: 'ner'}

# 1. get the corpus
corpus: Corpus = ColumnCorpus("dataset/split_dataset", columns,
 train_file = 'conllpp_train.txt',
 test_file = 'conllpp_test.txt',
 dev_file = 'conllpp_dev.txt')

# 2. what label do we want to predict?
tag_type = 'ner'

# 3. make the label dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary)

embedding_types = [

    # GloVe embeddings
    WordEmbeddings('glove'),

    # contextual string embeddings, forward
    FlairEmbeddings('news-forward'),

    # contextual string embeddings, backward
    FlairEmbeddings('news-backward'),
]

embeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=tag_dictionary,
                        tag_type=tag_type)

# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer = ModelTrainer(tagger, corpus)

# 7. run training
trainer.train('resources/taggers/ner-english',
              train_with_dev=True,
              max_epochs=15)


2023-06-03 10:35:10,187 Reading data from dataset/split_dataset
2023-06-03 10:35:10,187 Train: dataset/split_dataset/conllpp_train.txt
2023-06-03 10:35:10,187 Dev: dataset/split_dataset/conllpp_dev.txt
2023-06-03 10:35:10,187 Test: dataset/split_dataset/conllpp_test.txt
Corpus: 14999 train + 3086 dev + 3096 test sentences
<flair.data.Dictionary object at 0x1077d0f70>
2023-06-03 10:35:15,399 ----------------------------------------------------------------------------------------------------
2023-06-03 10:35:15,399 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, inplace=False)
        (encoder): Embedding(300, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=300, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.05, in

  word_embedding = torch.FloatTensor(word_embedding)


2023-06-03 10:35:30,898 epoch 1 - iter 0/566 - loss 166.34434509 - samples/sec: 115.72
2023-06-03 10:44:16,001 epoch 1 - iter 56/566 - loss 21.31661368 - samples/sec: 3.41
2023-06-03 10:52:22,960 epoch 1 - iter 112/566 - loss 16.29596274 - samples/sec: 3.68
2023-06-03 11:02:04,450 epoch 1 - iter 168/566 - loss 14.16415755 - samples/sec: 3.08
2023-06-03 11:10:54,881 epoch 1 - iter 224/566 - loss 12.98463648 - samples/sec: 3.38
2023-06-03 11:19:05,705 epoch 1 - iter 280/566 - loss 11.99075442 - samples/sec: 3.65
2023-06-03 11:27:33,767 epoch 1 - iter 336/566 - loss 11.22239269 - samples/sec: 3.53
2023-06-03 11:35:41,562 epoch 1 - iter 392/566 - loss 10.63254863 - samples/sec: 3.67
2023-06-03 11:44:23,673 epoch 1 - iter 448/566 - loss 10.14448355 - samples/sec: 3.43
2023-06-03 11:51:52,684 epoch 1 - iter 504/566 - loss 9.73219645 - samples/sec: 3.99
2023-06-03 12:01:26,362 epoch 1 - iter 560/566 - loss 9.44994843 - samples/sec: 3.12
2023-06-03 12:01:53,881 --------------------------------

2023-06-03 18:05:19,550 epoch 7 - iter 280/566 - loss 3.11962576 - samples/sec: 7.07
2023-06-03 18:09:26,668 epoch 7 - iter 336/566 - loss 3.15338746 - samples/sec: 7.25
2023-06-03 18:13:46,588 epoch 7 - iter 392/566 - loss 3.12498777 - samples/sec: 6.90
2023-06-03 18:17:58,401 epoch 7 - iter 448/566 - loss 3.15919635 - samples/sec: 7.12
2023-06-03 18:21:53,540 epoch 7 - iter 504/566 - loss 3.13143735 - samples/sec: 7.62
2023-06-03 18:26:14,769 epoch 7 - iter 560/566 - loss 3.10467968 - samples/sec: 6.86
2023-06-03 18:26:33,011 ----------------------------------------------------------------------------------------------------
2023-06-03 18:26:33,013 EPOCH 7 done: loss 3.1016 - lr 0.1000
2023-06-03 18:26:33,013 BAD EPOCHS (no improvement): 0
2023-06-03 18:26:33,014 ----------------------------------------------------------------------------------------------------
2023-06-03 18:26:35,973 epoch 8 - iter 0/566 - loss 2.14779186 - samples/sec: 606.17
2023-06-03 19:52:10,907 epoch 8 - iter

2023-06-03 23:59:15,081 epoch 13 - iter 560/566 - loss 2.39789231 - samples/sec: 7.06
2023-06-03 23:59:28,321 ----------------------------------------------------------------------------------------------------
2023-06-03 23:59:28,321 EPOCH 13 done: loss 2.3944 - lr 0.1000
2023-06-03 23:59:28,321 BAD EPOCHS (no improvement): 0
2023-06-03 23:59:28,323 ----------------------------------------------------------------------------------------------------
2023-06-03 23:59:33,268 epoch 14 - iter 0/566 - loss 2.45629406 - samples/sec: 362.53
2023-06-04 00:03:22,931 epoch 14 - iter 56/566 - loss 2.21783257 - samples/sec: 7.80
2023-06-04 00:08:02,785 epoch 14 - iter 112/566 - loss 2.15376104 - samples/sec: 6.40
2023-06-04 00:11:53,876 epoch 14 - iter 168/566 - loss 2.26971875 - samples/sec: 7.76
2023-06-04 00:15:56,764 epoch 14 - iter 224/566 - loss 2.24642156 - samples/sec: 7.38
2023-06-04 00:19:52,061 epoch 14 - iter 280/566 - loss 2.26373967 - samples/sec: 7.62
2023-06-04 00:23:53,009 epoch 1

2023-06-04 01:37:19,716 ----------------------------------------------------------------------------------------------------


{'test_score': 0.8843,
 'dev_score_history': [],
 'train_loss_history': [9.41505411741169,
  5.404614973194607,
  4.421184256725513,
  3.9064887288181183,
  3.566614944395665,
  3.2917389246263267,
  3.1015725256908064,
  2.9049165091961098,
  2.790922972107102,
  2.6592265501249805,
  2.576242336917787,
  2.478924518233896,
  2.3944408998893767,
  2.3174353961177934,
  2.243378801922916],
 'dev_loss_history': []}

### Calculate results and and make benchmark

![image.png](attachment:image.png)

As you can see the LM-LSTM-CRF is better algorithm for Job description dataset IOB format. 

with nice score of F1.
