<H3 style="text-align:center;">🍀 IT skill NER Project using Spacy 🍀</H3>

<b style="font-size :20px;">Step 1 :</b>
<span style="font-size :18px;">Download the necessary modules and libraries 🌾</span>

<span style="font-size: 16px;">1. Check Python's version</span>

In [207]:
! python --version

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/bin/bash: /home/vkuai/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Python 3.9.13


<span style="font-size: 16px;">2. Install Pytorch for CUDA 11.6</span>

In [None]:
! pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116


In [1]:
import torch
print("Version :", torch.__version__)
print("Is available :", torch.cuda.is_available())
print("Device count :", torch.cuda.device_count())
print("Current device :", torch.cuda.current_device())
print("Info device 0 :", torch.cuda.device(0))
print("Get device 0 name :", torch.cuda.get_device_name(0))


Version : 2.0.0+cu118
Is available : True
Device count : 1
Current device : 0
Info device 0 : <torch.cuda.device object at 0x7fa9d045ebb0>
Get device 0 name : NVIDIA GeForce RTX 3090


<span style="font-size: 16px;">3. Install spaCy with the extras for your CUDA version and transformers.</span>


In [None]:
! export CUDA_PATH = "/opt/nvidia/cuda"
! pip install -U spacy[cuda116, transformers]

In [None]:
! conda install -c conda-forge cupy

<span style="font-size: 16px;">4. Download the trained transformer-based pipelines by Spacy : en_core_web_trf</span>


In [None]:
!python -m spacy download en_core_web_trf


<b style="font-size :20px;">Step 2 :</b></br>
<span style="font-size :18px;">From dictionary skill, use EntityRuler to automatically assign labels to provide training data IT NER model</H5>

<span style="font-size: 16px;">1. Import pandas , spacy and spacy API libraries</span><br>
<span style="font-size: 16px;">Spacy API : language.Language, matcher.Matcher, tokens.Span, tokens.DocBin</span>

In [3]:
import pandas as pd
import spacy
print(spacy.__version__)
from spacy.tokens import Span
from spacy.tokens import DocBin
from spacy.tokens.token import Token
from spacy.tokens.doc import Doc
from spacy.tokens.span import Span
from tqdm import tqdm
import re

3.5.2


<span style="font-size: 16px;">2. Use GPU for spacy training IT skill NER model</span>

In [4]:
# Allocate data and perform operations on GPU. Will raise an error if no GPU is available.
# Use the GPU, with memory allocations directed via PyTorch.
# This prevents out-of-memory errors that would otherwise occur from competing
# memory pools.
from thinc.api import set_gpu_allocator, require_gpu, set_active_gpu
#set_gpu_allocator("pytorch")
gpu = require_gpu(0)
#set_active_gpu(0)
gpu


True

<span style="font-size: 16px;">3. Load model en_core_web_trf , add pipeline "entity_ruler"</span>

In [3]:
nlp = spacy.load("en_core_web_trf")
nlp.pipe_names

['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [42]:
skill_pattern_path = "./data/EntityRuler_Patterns.jsonl"

if "entity_ruler" in nlp.pipe_names:
    ruler = nlp.get_pipe("entity_ruler")
else:
    ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.from_disk(skill_pattern_path)
nlp.pipe_names


['transformer',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'entity_ruler',
 'ner']

In [25]:
from pathlib import Path
def create_config(model_name: str, component_to_update: str, output_path: Path):
    nlp = spacy.load(model_name)

    # create a new config as a copy of the loaded pipeline's config
    config = nlp.config.copy()

    # revert most training settings to the current defaults
    default_config = spacy.blank(nlp.lang).config
    config["corpora"] = default_config["corpora"]
    config["training"]["logger"] = default_config["training"]["logger"]

    # copy tokenizer and vocab settings from the base model, which includes
    # lookups (lexeme_norm) and vectors, so they don't need to be copied or
    # initialized separately
    config["initialize"]["before_init"] = {
        "@callbacks": "spacy.copy_from_base_model.v1",
        "tokenizer": model_name,
        "vocab": model_name,
    }
    config["initialize"]["lookups"] = None
    config["initialize"]["vectors"] = None

    # source all components from the loaded pipeline and freeze all except the
    # component to update; replace the listener for the component that is
    # being updated so that it can be updated independently
    config["training"]["frozen_components"] = []
    for pipe_name in nlp.component_names:
        if pipe_name != component_to_update:
            config["components"][pipe_name] = {"source": model_name}
            config["training"]["frozen_components"].append(pipe_name)
        else:
            config["components"][pipe_name] = {
                "source": model_name,
                "replace_listeners": ["model.tok2vec"],
            }

    # save the config
    config.to_disk(output_path)

In [26]:
create_config("en_core_web_trf","ner",'./ner_config.cfg')

<b style="font-size :20px;">Step 3 :</b></br>
<span style="font-size :18px;">PoS Tagging , Noun Phrase , Combination Skills</H5>

In [None]:
index_ex = ["Senior","Junior","Trainee","Internship","Fresher","Sr","Jr"]
skill_phrase = ["software","algorithm","library", "model", "tool","module","platform","method","equipment","component","testing","engine",
                "management", "development", "methodology", "certify", "certification","programming","analysis","application","analysis","research",
                "technology", "technical", "technique", "language", "infrastructure", "design", "system","systems","network","networks","web","code","process"]
word_ing = ["overseeing","developing","transforming","building","implementing","analyzing","executing","writing","deploying","troubleshooting",
                "managing","configuring","designing","maintaining","evaluating","running","monitoring","solving","resolving","leveraging","standardizing",
                "operating","establishing","integrating","optimizing","coding","learning","updating","securing","fixing","training","testing",
                "utilizing","recommending","producing","defining","ng","mining","creating","supporting"]
word_after_remove = ["year","years","experience","knowledge","excellent"]
word_ing_remove = ["including","seeking","using","looking","following"]

In [5]:
def ing_rules(doc:Doc, idx: int) -> int:
    """
    Input:  doc: Doc 
            token: Token -> token is the verb in front of the noun phrase, in word_ing list
    Return: Number of tokens that satisfy the verb rule ing suffix after the noun phrase
    """
    t = 1
    remove = ["including","seeking","using","looking"]
    if (idx-t) < 0:
        return 0
    while True :
        if (idx-t) > 0 and re.search("ing$",doc[idx-t].text) and doc[idx-t].text not in remove: 
            t += 1
        elif (idx-t-1) > 0 and doc[idx-t].text in ["and","or",",","/","&","and/or"] and re.search("ing$",doc[idx-t-1].text):
            t += 2
        elif (idx-t-2) > 0 and doc[idx-t].text in ["and","or"] and doc[idx-t-1].text == "," and re.search("ing$",doc[idx-t-2].text):
            t += 3
        else : 
            return t

In [6]:
def check_token_to_add(token: Token, 
                         skill_phrase: list, 
                         word_ing: list, 
                         index_ex: list, 
                         token_in_noun_phrase: list) -> bool:
    if token.pos_ in ["NOUN","PROPN"] :
        return True
    if token.ent_type_ == "HARD-SKILL" :
        return True
    if token.lemma_ in skill_phrase :
        return True
    if token.lower_ in word_ing :
        return True
    if token.text in index_ex :
        return True
    if token in token_in_noun_phrase:
        return True
    return False

In [7]:
def check_token_break(token: Token) -> bool:
    if token.pos_ in ["PRON","DET","ADV","CCONJ","NUM"]:
        return True
    elif (not re.search("[a-zA-Z]",token.text)) :
        return True
    elif token.lower_ in ["strong","existing","experience","knowledge","excellent","new","other","that","full","e.g.","good","necessary","and","or","&","year","years"]:
        return True
    return False

In [8]:
def rule_dash(doc:Doc, index : int)-> bool:
    if index>0 and index<len(doc) and doc[index].text == "-":
        if doc.text[doc[index].idx-1] != " " and doc.text[doc[index].idx+1] != " " :
            return True
    return False

In [9]:
def check_if_roman_numeral(numeral):
    numeral = {c for c in numeral}
    validRomanNumerals = {c for c in "XVI"}
    return not numeral - validRomanNumerals

In [10]:
def remove_unnecessary_word_in_chunk(doc: Doc, chunk: Span, word_ing: bool = True) -> Span:
    """
    Remove unneccessary words in chunk
    Words have pos ["PRON","DET","ADV","CCONJ","NUM"], without letters and some exception words

    Input:  chunk: Span -> 1 noun phrase in doc.noun_chunks
    Return: Span -> Span removed all unneccessary words
    """
    len_chunk = len(chunk)
    start_char = chunk.start_char
    pre = 0
    while True :
        if len_chunk == 0 :
            break
        elif check_token_break(chunk[pre]):
            if chunk[pre].i < (len(doc)-1) and doc.text[start_char + len(chunk[pre].text)] != " ":
                start_char = start_char + len(chunk[pre].text)
            else:
                start_char = start_char + len(chunk[pre].text) + 1
            pre += 1
            len_chunk -= 1 
        else:
            break
    span = doc.char_span(start_char,chunk.end_char)
    if word_ing :
        if (len_chunk == 1 and span[0].pos_ not in ["NOUN","PROPN"]) or len_chunk == 0 :
            return None
    else:
        if (len_chunk == 1 and (not re.search("A-Z",span.text))) or len_chunk == 0 :
            return None
    return span

In [11]:
def list_tokens_in_noun_chunks(doc: Doc)-> list:
    """
    Remove unnecessary words in each chunk
    Then, get all tokens in nouns_chunk

    Input:  doc: Doc
    Return: list -> list contain all tokens in nouns_chunk
    """
    token_list = []
    for chunk in doc.noun_chunks:
        span = remove_unnecessary_word_in_chunk(doc,chunk)
        if span != None :
            for token in span :
                token_list.append(token)
    return token_list

In [None]:
def remove_element_duplicate(list_entities:list) -> list:
    # remove duplicate elements in list
    list_entities = list(set(list_entities))
    list_entities = sorted(list_entities, key=lambda a: a[0])

    # if len(list_ents)  <= 1 then no need to check
    if len(list_entities) > 1:
        list_remove_ent = []
        # iterate each element in array list_ents
        for tuple_ent in list_entities:
            # check each element in the array against all other elements in the array
            for check in list_entities:
                if check != tuple_ent:
                    if tuple_ent[0] >= check[0] and tuple_ent[1] <= check[1]:
                        list_remove_ent.append(tuple_ent)
                        break
                
        list_entities = [ent for ent in list_entities if ent not in list_remove_ent]

    if len(list_entities) > 1:
        list_remove_ent = []
        list_add_ent = []
        # iterate each element in array list_ents
        for tuple_ent in list_entities:
            # check each element in the array against all other elements in the array
            for check in list_entities:
                if check != tuple_ent:
                    if tuple_ent[0] <= check[0] and check[0] <= tuple_ent[1] and tuple_ent[1] <= check[1]:
                        tuple_ent_new = (tuple_ent[0], check[1], "HARD-SKILL")
                        list_add_ent.append(tuple_ent_new)
                        list_remove_ent.append(tuple_ent)
                        list_remove_ent.append(check)

        list_entities = [ent for ent in list_entities if ent not in list_remove_ent]
        for ent in list_add_ent:
            list_entities.append(ent)
    return list_entities

In [None]:
def add_other_ents(doc:Doc, list_entities:list, is_comma_rule:bool) -> list:
    list_label = ["ORG", "LANGUAGE", "GPE","TIME", "MONEY"]
    if is_comma_rule :
        list_label.append("HARD-SKILL")
    for ent in doc.ents:
        if ent.label_ in list_label:
            check = True
            for start_char, end_char, _ in list_entities:
                for i in range(ent.start_char, ent.end_char):
                    if i >= start_char and i <= end_char:
                        check = False
                        break
            if check:
                tuplee = (ent.start_char, ent.end_char, ent.label_)
                list_entities.append(tuplee)
    list_entities = sorted(list_entities, key=lambda a: a[0])
    return list_entities

In [None]:
def repeat_pre_token(doc:Doc,token:Token,token_noun_chunks:list) -> int:
    i = 0
    if rule_dash(doc,index=token.i-1):
        i = 3
    elif rule_dash(doc,index=token.i):
        i = 2
    elif re.search("ing$",token.text) \
        and token.pos_ == "VERB" \
        and token.text not in word_ing_remove:
        i = ing_rules(doc = doc, idx = token.i)
    elif check_token_to_add(token,token_noun_chunks) \
        and token.lower_ not in word_after_remove:
        i = 1
    return i

In [None]:
def repeat_after_token(doc:Doc,token:Token,token_noun_chunks:list) -> int:
    i = 0
    if rule_dash(doc,index=token.i+1):
        i = 3
    elif rule_dash(doc,index=token.i):
        i = 2
    elif token.text == "(" \
        and re.search("[A-Z]",doc[token.i + 1].text) \
        and doc[token.i + 1].lower_ not in ["required","preferred"] \
        and doc[token.i + 2].text == ")" :
        i = 3
    elif check_if_roman_numeral(token.text):
        i = 1
    elif (token.like_num and (not re.search("[a-zA-Z]",token.text))):
        if (token.i+1)<len(doc) and doc[token.i+1].text == "+" :
            i = 2
        else:
            i = 1
    elif check_token_to_add(token,token_noun_chunks)\
        and token.lower_ not in word_after_remove:
        i = 1
        return i

In [15]:
def get_list_ents(doc:Doc) -> list:
    list_entities = []

    # rule _ing (*)
    for chunk in doc.noun_chunks:
        span = None
        if doc[chunk[0].i-1].lower_ in word_ing :
            span = remove_unnecessary_word_in_chunk(doc,chunk,word_ing=True)
            if span != None :
                t = ing_rules(doc = doc, idx = chunk[0].i-1)
                span = doc.char_span(doc[chunk[0].i-t].idx, chunk.end_char)
        elif doc[chunk[0].i-1].lower_ in word_ing_remove:
            span = remove_unnecessary_word_in_chunk(doc,chunk,word_ing=False)
        if span != None :
            tuplee = (span.start_char, span.end_char, "HARD-SKILL")
            list_entities.append(tuplee)
    
    token_noun_chunks = list_tokens_in_noun_chunks(doc)
    for token in doc:
        if  token.ent_type_ == "HARD-SKILL" \
            or token.lemma_.lower() in skill_phrase \
            or token.text in index_ex : 
            pre = 1
            after = 1
            while True:
                if token.i - pre < 0:
                    break
                number = repeat_pre_token(doc,doc[token.i - pre],token_noun_chunks)
                if number != 0 and doc[token.i - pre].text != ".":
                    pre += number
                else:
                    break
            while True:
                if token.i + after >= len(doc):
                    break
                number = repeat_after_token(doc,doc[token.i + after],token_noun_chunks)
                if number != 0 and doc[token.i + after].text != ".":
                    after += number
                else:
                    break
                
            span = Span(doc, token.i - pre + 1, token.i + after, label="HARD-SKILL")
            if len(span) == 1 and span[0].ent_type_ != "HARD-SKILL":
                pass
            else:
                tuplee = (span.start_char, span.end_char, "HARD-SKILL")
                list_entities.append(tuplee)
            
    list_entities = remove_element_duplicate(list_entities)
    list_entities = add_other_ents(doc,list_entities,is_comma_rule=False)
    return list_entities

In [None]:
def append_ents_into_doc(doc:Doc,list_ents:list):
    ents_new = []
    for start,end,label in list_ents:
        span = doc.char_span(start,end,label)
        ents_new.append(span)
    if len(ents_new)>0:
        doc.ents = ents_new
    else:
        doc.ents = []
    return doc

<b style="font-size :20px;">Step 4 :</b></br>
<span style="font-size :18px;">Read dataset JobDescription.csv</H5>

In [12]:
df = pd.read_csv("./data/JobDescription.csv", sep="\t")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28334 entries, 0 to 28333
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   JD      28334 non-null  object
dtypes: object(1)
memory usage: 221.5+ KB


In [16]:
df.head(10)

Unnamed: 0,JD
0,Machine Learning / AI Internship. Summary: As ...
1,Junior AI/ML Engineer. We are seeking a dedica...
2,Artificial Intelligence (AI) / Machine Learnin...
3,Machine Learning Engineer. Botkeeper is an aut...
4,AI/ML engineer (Artificial intelligence / Mach...
5,Artificial Intelligence Developer. Stillwater ...
6,Vision Machine Learning Engineer - US Remote. ...
7,Machine Learning Engineer - US Remote. Descrip...
8,Machine Learning Researcher. A successful star...
9,"Python, Digita: Machine Learning, Digita: Arti..."


In [13]:
import random
list_JDs = list(df["JD"])
random.shuffle(list_JDs)
list_JDs[0]

"AI/ML Systems Lead. Full Job Description: The Position. Who We Are: Our Strategic Analytics & Intelligence (SAI) team isn't just deciphering data. We're here to help solve the world's most complex healthcare challenges and improve the lives of patients. With a mix of competitive intelligence, market research, data science, advanced analytics, access, and forecasting, SAI unlocks key insights for our internal partners that ultimately benefit healthcare providers and patients. Even if you've never worked in biotech, you'll establish yourself as an expert alongside other specialists. Plus, you can gain new experiences across marketing disciplines, therapeutic areas, and commercial operations. The entire time, you'll be surrounded by a diverse and inclusive team that aims to reflect the world we serve. The data science team within SAI exists to help the CMG (Commercial, Medical and Government) organization achieve its vision by unlocking value from data quicker and more effectively. As a 

In [15]:
sum(len(job) for job in list_JDs)/len(list_JDs)

3010.3756617491354

In [None]:

for jd in list_JDs[:1000]:
    doc = nlp(jd)
    
           

<b style="font-size :20px;">Step 5 :</b></br>
<span style="font-size :18px;">Testing rules</H5>

<b style="font-size :18px;">Testing _ing rule</b></br>

In [None]:
for jd in list_JDs[:1000]:
    doc = nlp(jd)
    # rule _ing
    for chunk in doc.noun_chunks:
        span = None
        if doc[chunk[0].i-1].lower_ in word_ing :
            span = remove_unnecessary_word_in_chunk(doc,chunk,word_ing=True)
            if span != None :
                t = ing_rules(doc = doc, idx = chunk[0].i-1)
                span = doc.char_span(doc[chunk[0].i-t].idx, chunk.end_char)
        elif doc[chunk[0].i-1].lower_ in ["using","seeking","looking"]:
            span = remove_unnecessary_word_in_chunk(doc,chunk,word_ing=False)
        if span != None :
            print(span)

<b style="font-size :18px;">Testing dash rule</b></br>

In [203]:
def testing(is_rule_dash=False,is_rule_parenthesis=False,is_rule_link_num=False):
    for jd in list_JDs[:1000]:
        doc = nlp(jd)
        token_noun_chunks = list_tokens_in_noun_chunks(doc)
        for token in doc:
            check_rule_dash = False
            check_rule_parenthesis = False
            check_link_num = False
            if  token.ent_type_ == "HARD-SKILL" \
                or token.lemma_.lower() in skill_phrase \
                or token.text in index_ex : 
                pre = 1
                after = 1
                while True:
                    if token.i - pre >= 0:
                        pre_token = doc[token.i - pre]
                        if rule_dash(doc,index=token.i-pre-1):
                            pre += 3
                            check_rule_dash = True
                        elif rule_dash(doc,index=token.i-pre):
                            pre += 2
                            check_rule_dash = True
                        elif re.search("ing$",pre_token.text) \
                            and pre_token.pos_ == "VERB" \
                            and pre_token.text not in ["including","seeking","using"]:
                            pre += ing_rules(doc = doc, idx = pre_token.i)
                        elif (pre_token.pos_ in ["NOUN","PROPN"] \
                            or check_token_to_add(pre_token,skill_phrase,word_ing,index_ex,token_noun_chunks)) \
                            and pre_token.lower_ not in word_after_remove:
                            pre += 1
                        else:
                            break
                    else:
                        break

                while True:
                    if token.i + after < len(doc):
                        after_token = doc[token.i + after]
                        if rule_dash(doc,index=token.i+after+1):
                            after += 3
                            check_rule_dash = True
                        elif rule_dash(doc,index=token.i+after):
                            after += 2
                            check_rule_dash = True
                        elif after_token.text == "(":
                            if  re.search("[A-Z]",doc[token.i + after + 1].text) and \
                                doc[token.i + after + 1].lower_ not in ["required","preferred"] and \
                                doc[token.i + after + 2].text == ")" :
                                after += 3
                                check_rule_parenthesis = True
                            else:
                                break
                        elif checkIfRomanNumeral(after_token.text):
                            after += 1
                            check_link_num = True
                        elif (after_token.like_num and (not re.search("[a-zA-Z]",after_token.text))):
                            if (token.i+after+1)<len(doc) and doc[token.i+after+1].text == "+" :
                                after += 1
                            after += 1
                            check_link_num = True
                        elif (after_token.pos_ in ["NOUN","PROPN"] \
                            or check_token_to_add(after_token,skill_phrase,word_ing,index_ex,token_noun_chunks))\
                                and after_token.lower_ not in word_after_remove:
                            after += 1
                        else:
                            break
                    else:
                        break
                span = Span(doc, token.i - pre + 1, token.i + after, label="HARD-SKILL")
                if len(span) == 1 and span[0].ent_type_ != "HARD-SKILL":
                    pass
                else:
                    if is_rule_dash and check_rule_dash:
                        print(span)
                    elif is_rule_parenthesis and check_rule_parenthesis:
                        print(span)
                    elif is_rule_link_num and check_link_num:
                        print(span)

In [None]:
testing(is_rule_dash=True)

<b style="font-size :18px;">Testing parenthesis rule</b></br>

In [None]:
testing(is_rule_parenthesis=True)

<b style="font-size :18px;">Testing number rule</b></br>

In [None]:
testing(is_rule_link_num=True)

<b style="font-size :20px;">Step 6 :</b></br>
<span style="font-size :18px;">Testing accuracy on a job description</H5>

In [None]:
doc = nlp(list_JDs[0])
spacy.displacy.render(doc, style="ent", jupyter=True)

In [None]:
list_ents = []
for start,end,label in get_list_ents(doc):
    span = doc.char_span(start,end,label)
    list_ents.append(span)
doc.ents = list_ents
spacy.displacy.render(doc, style="ent", jupyter=True)

In [None]:
doc = nlp(list_JDs[0])
print(doc)
list_ents = get_list_ents(doc)
print("Total : "+str(len(list_ents))+" ents")
print("-"*70)
for start, end, label in list_ents:
    print("{:45} | {:5} | {:5} | {:8}".format(
        doc.char_span(start, end).text, str(start), str(end), label))


<b style="font-size :20px;">Step 7 :</b></br>
<span style="font-size :18px;">Creating train.spacy and dev.spacy </H5>

In [19]:
TRAINING_DATA = list()
for job_des in tqdm(list_JDs[:25500]):
    try:
        doc = nlp(job_des)
        list_ents = get_list_ents(doc)
        TRAINING_DATA.append((doc.text, {"entities": list_ents}))
    except:
       print("bug")


 20%|█▉        | 5042/25500 [12:53<1:06:42,  5.11it/s]

Time Out


 20%|█▉        | 5043/25500 [13:17<36:05:39,  6.35s/it]

Time Out


 33%|███▎      | 8398/25500 [21:54<39:07,  7.29it/s]   

Time Out


 33%|███▎      | 8400/25500 [22:18<19:29:30,  4.10s/it]

Time Out


 57%|█████▋    | 14633/25500 [38:31<20:49,  8.70it/s]  

Time Out
Time Out


 57%|█████▋    | 14635/25500 [38:57<16:08:48,  5.35s/it]

Time Out


 59%|█████▊    | 14950/25500 [39:45<19:01,  9.24it/s]   

Time Out


 59%|█████▊    | 14950/25500 [40:01<19:01,  9.24it/s]

Time Out


 59%|█████▊    | 14954/25500 [40:21<10:38:05,  3.63s/it]

Time Out


 61%|██████▏   | 15646/25500 [42:05<18:59,  8.65it/s]   

Time Out


 63%|██████▎   | 16048/25500 [43:34<5:50:01,  2.22s/it]

Time Out


 83%|████████▎ | 21078/25500 [56:30<2:32:52,  2.07s/it]

Time Out


 85%|████████▌ | 21723/25500 [58:00<07:18,  8.61it/s]  

bug


 87%|████████▋ | 22277/25500 [59:33<1:51:20,  2.07s/it]

Time Out


100%|██████████| 25500/25500 [1:07:33<00:00,  6.29it/s]


In [21]:
DEV_DATA = list()
for job_des in tqdm(list_JDs[25500:]):
    try:
        doc = nlp(job_des)
        list_ents = get_list_ents(doc)
        DEV_DATA.append((doc.text, {"entities": list_ents}))
    except:
        print("bug")

  4%|▍         | 122/2834 [00:28<1:27:52,  1.94s/it]

Time Out


 63%|██████▎   | 1795/2834 [04:46<35:11,  2.03s/it] 

Time Out


100%|██████████| 2834/2834 [07:17<00:00,  6.48it/s]


In [20]:
from spacy.training import Example
db = DocBin()
for text, entities in tqdm(TRAINING_DATA):
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, entities)
    db.add(example.reference)

db.to_disk("./config/train.spacy") 

100%|██████████| 25499/25499 [00:32<00:00, 776.45it/s]


In [22]:
from spacy.training import Example
db = DocBin()
for text, entities in tqdm(DEV_DATA):
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, entities)
    db.add(example.reference)

db.to_disk("./config/dev.spacy")  # save the docbin object

100%|██████████| 2834/2834 [00:03<00:00, 778.01it/s]


<b style="font-size :20px;">Step 8 :</b></br>
<span style="font-size :18px;">Training model IT_skill_NER with train.spacy and dev.spacy </H5>

In [None]:
!python -m spacy init fill-config ./config/base_config.cfg ./config/config.cfg

In [None]:
!python -m spacy debug config ./config/config.cfg

In [None]:
!python -m spacy debug data ./config/config.cfg

In [None]:
!python -m spacy init labels ./config/config.cfg ./config/labels

In [None]:
!python -m spacy train ./config/config.cfg --output ./IT_skill_NER --gpu-id 0

<b style="font-size :20px;">Step 9 :</b></br>
<span style="font-size :18px;">Load IT_skill_NER model</H5>

In [5]:
nlp = spacy.load("./IT_skill_NER/model-best")
nlp.pipe_names

['transformer', 'ner']

In [46]:
nlp = spacy.load("en_core_web_trf")

In [6]:
ner = nlp.get_pipe("ner")
ner.labels

('GPE', 'HARD-SKILL', 'LANGUAGE', 'MONEY', 'ORG', 'TIME')

In [7]:
doc = nlp(list_JDs[17])
spacy.displacy.render(doc, style="ent", jupyter=True)

NameError: name 'list_JDs' is not defined

In [None]:
[token.dep_ for token in doc[:10]]

<b style="font-size :20px;">Step 10 :</b></br>
<span style="font-size :18px;">CommaRule</H5>

In [8]:
doc = nlp("Computer science, software engineering, various operating systems, information security fundamentals or general IT, and procedures.")

In [10]:
def comma_rule_token_len_1_2(doc:Doc, list_entities:list,count:int, start:int, end:int)-> list:
    is_break = False
    if start == end :
        return list_entities,is_break
    if count == 1 and doc[start].lower_ not in word_ing_remove:
        span = Span(doc,start, end, label="HARD-SKILL")
        if re.search("A-Z",span.text):
            tuplee = (span.start_char,span.end_char,"HARD-SKILL")
            list_entities.append(tuplee)
    elif count == 2:
        if doc[start].lower_ not in word_ing_remove:
            span = Span(doc,start, end, label="HARD-SKILL")
            tuplee = (span.start_char,span.end_char,"HARD-SKILL")
            list_entities.append(tuplee)
        else :
            if start+1 != end:
                span = Span(doc,start+1, end, label="HARD-SKILL")
                tuplee = (span.start_char,span.end_char,"HARD-SKILL")
                list_entities.append(tuplee)
            is_break = True
    return list_entities,is_break

In [11]:
def comma_rule(doc:Doc):
    doc = append_ents_into_doc(doc,get_list_ents(doc))
    word = ["and","or",",","/","&","and/or"]
    token_noun_chunks = list_tokens_in_noun_chunks(doc)
    list_entities = [] 
    for ent in doc.ents :
        if ent.label_ == "HARD-SKILL":
            if doc[ent.start-1].text in word:
                count = 0
                step_pre = 0
                step = 0
                check_exit = False
                while True :
                    if ent.start-step >= 0 :
                        break
                    if doc[ent.start-step].text in word or count == 8 or doc[ent.start-step].text == ".":
                        if count != 0 :
                            start = ent.start-step+1
                            end = ent.start-step_pre
                            if count <= 2 :
                                list_entities,is_break = comma_rule_token_len_1_2(doc,list_entities,count,start,end)
                                if is_break:
                                    break
                            else :
                                i = end - 1
                                while i >= start :
                                    number = repeat_pre_token(doc,doc[i],token_noun_chunks)
                                    if number != 0:
                                        i -= number
                                    else :
                                        if (i+1) != end :
                                            span = Span(doc,(i+1), end, label="HARD-SKILL")
                                            if len(span) == 1 :
                                                if re.search("A-Z",span.text):
                                                    tuplee = (span.start_char,span.end_char,"HARD-SKILL")
                                                    list_entities.append(tuplee)
                                            else:
                                                tuplee = (span.start_char,span.end_char,"HARD-SKILL")
                                                list_entities.append(tuplee)
                                        check_exit = True
                                        break 
                            if count == 8 or doc[ent.start-step].text == "." or check_exit:
                                break   
                        step_pre = step
                        count = 0
                    else :
                        count += 1
                    step += 1
                    
            if doc[ent.end].text in word:
                count = 0
                step_pre = 0
                step = 0
                check_exit = False
                while True :
                    if ent.end+step < len(doc) :
                        break
                    if doc[ent.end+step].text in word or count == 8 or doc[ent.end+step].text == ".":
                        if count != 0 :
                            start = ent.end+step_pre+1
                            end = ent.end+step
                            if count <= 2 :
                                list_entities,is_break = comma_rule_token_len_1_2(doc,list_entities,count,start,end)
                                if is_break:
                                    break
                            else :
                                i = start
                                while i < end :
                                    number = repeat_after_token(doc,doc[i],token_noun_chunks)
                                    if number != 0 :
                                        i += number
                                    else:
                                        if i != start:
                                            span = Span(doc,start, i, label="HARD-SKILL")
                                            if len(span) == 1 :
                                                if re.search("A-Z",span.text):
                                                    tuplee = (span.start_char,span.end_char,"HARD-SKILL")
                                                    list_entities.append(tuplee)
                                            else:
                                                tuplee = (span.start_char,span.end_char,"HARD-SKILL")
                                                list_entities.append(tuplee)
                                        check_exit = True
                                        break 
        
                            if count == 8 or doc[ent.end+step].text == "." or check_exit:
                                break   
                        step_pre = step
                        count = 0
                    else :
                        count += 1
                    step += 1

    list_entities = remove_element_duplicate(list_entities)
    list_entities = add_other_ents(doc,list_entities,is_comma_rule=True)
    return list_entities

In [12]:
doc = append_ents_into_doc(doc,comma_rule(doc))

NameError: name 'append_ents_into_doc' is not defined

In [9]:
spacy.displacy.render(doc, style="ent", jupyter=True)