# Named Entity Recognition Model
In this python notebook, a NER model will be created by using the data scraped from LinkedIn jobs.
The steps are:
1. Data Cleaning
2. Data Annotating
3. Data Preparation
4. Data Modelling
5. Evaluation
6. Predict on Test Data
7. Visualize data using Tableau

## Imports

In [22]:
%reload_ext autoreload
%autoreload 2
import os
import sys
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())
sys.path.append(os.getenv("PROJECT_DIR"))
import pandas as pd
import re
import spacy
import numpy as np
import json
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from tqdm import tqdm
from src.data import DataLoader
from src.data import DataPreprocessor
from src.models import RuleBasedNER

## 1. Data Preprocessing

In [3]:
# Load data
df = DataLoader.load_data("../data/raw/linkedin_jobs_train.csv")
df = df.drop_duplicates().reset_index(drop=True)

# Preprocess Job Descriptions
texts = []
data_preprocessor = DataPreprocessor(model_name="en_core_web_sm")
for i, row in tqdm(df.iterrows()):
    texts.append(data_preprocessor.preprocess_job_desc(row['job_description'])+"\n")

# Save Jobs in form of jobs.txt
with open("../data/interim/jobs_train.txt", "w") as f:
    f.writelines(texts)

214it [00:03, 67.81it/s]


## 2. Data Annotating
To reduce effort we can perform rule based entity recognition. The idea is to annotate skills that are easily to identified, such as `python`, `machine learning`, `r`, `mysql`, etc.

In [31]:
# # Run this code just once
# # Create rule based entity recognition
# rule_based_ner = RuleBasedNER()

# # Access the hard skills and soft skills
# with open("../data/external/hardskills_dictionary.txt", 'r') as f:
#     hard_skills = f.readlines()
# with open("../data/external/softskills_dictionary.txt", 'r') as f:
#     soft_skills = f.readlines()
    
# # Add entity rule
# rule_based_ner.add_rule(hard_skills, 'HARD SKILL')
# rule_based_ner.add_rule(soft_skills, 'SOFT SKILL')

# # Save the rule for reused
# rule_based_ner.to_disk("../data/interim/patterns.jsonl")

In [32]:
# Load entity 
nlp = RuleBasedNER(pattern_path="../data/interim/patterns.jsonl").nlp

# Perform Entity Rule Based tagging
with open("../data/interim/jobs_train.txt", "r") as f:
    jobs = f.readlines()

pre_annotations_dict = {
    "classes": ["HARD SKILL", "SOFT SKILL"],
    "annotations": []
}

for job in tqdm(jobs):
    text = job.strip()+"\r"  # add the /r
    doc = nlp(text)
    ents_dict = {"entities": [[ent.start_char, ent.end_char, ent.label_] for ent in doc.ents]}
    pre_annotations_dict['annotations'].append([text, ents_dict])

# Save annotations
with open("../data/interim/pre_annotations.json" , 'w') as f:
    json.dump(pre_annotations_dict, f)

100%|██████████| 214/214 [00:03<00:00, 69.43it/s]


Annotate the rest of the unlabeled entities by using this tool https://tecoholic.github.io/ner-annotator/

## 3. Data Preparation 

Split data for train set, development set and test set.

In [3]:
# Prepare Train and Dev
# Open the annotations
with open("../data/interim/annotations.json", "r") as f:
    annotations_json = json.load(f)

# Set params for random select data
total_size = len(annotations_json['annotations'])
train_len, dev_len, test_len = int(total_size * 0.7), int(total_size * 0.15), int(total_size * 0.15) 
data = []
for aj in annotations_json['annotations']:
    data.append((aj[0], aj[1]))    

# Randomized the data
np.random.seed(42)
np.random.shuffle(data)

# Prepare Train Data
train_data = data[:train_len] 
db = DataPreprocessor.convert_to_doc_bin(train_data)
db.to_disk("../data/processed/train.spacy")

# Prepare Dev Data
dev_data = data[train_len : train_len + dev_len] 
db = DataPreprocessor.convert_to_doc_bin(dev_data)
db.to_disk("../data/processed/dev.spacy")

# Prepare Test Data
test_data = data[train_len + dev_len:] 
db = DataPreprocessor.convert_to_doc_bin(test_data)
db.to_disk("../data/processed/test.spacy")

## 4. Data Modelling (From Scratch)
Steps
1. Download base configuration from spacy website according to NER modelling needs, create configuration for our training.
2. Train the model using the configuration along with the train data and test data.

In [None]:
# 1
!python -m spacy init fill-config ../config/base_config.cfg ../config/config.cfg
# 2
!python -m spacy train ../config/config.cfg --output ../models --paths.train ../data/processed/train.spacy --paths.dev ../data/processed/dev.spacy

## 5. Model Evaluation
Pick the best model

![image.png](../reports/figures/evaluation.png)

## 6. Predict on Test Data & Evaluate

In [74]:
from spacy.tokens import DocBin
from spacy.training import Example
from spacy.scorer import Scorer
test_data = DocBin().from_disk("../data/processed/test.spacy")
nlp = spacy.load("en_core_web_sm")

# Access the Test Data and load Spacy Model
best_model = spacy.load("../models/model-best")
i = 0
hard_skills = []
soft_skills = []

# Create an array of Example objects between the predicted and the reference
examples = []
for test_doc in test_data.get_docs(nlp.vocab):
    pred_doc = best_model(test_doc.text)
    examples.append(Example(pred_doc, test_doc))  # for evaluation purposes
    for doc in pred_doc.ents:
        text = doc.text
        text = re.sub(" - ", "-", text)
        text = re.sub(" / ", "/", text)
        text = re.sub(" ", "_", text)
        if doc.label_ == "HARD SKILL":
            hard_skills.append(text)
        elif doc.label_ == "SOFT SKILL":
            text = re.sub("-", "_", text)
            soft_skills.append(text)

In [67]:
scores = Scorer.score_spans(examples, 'ents')
print(f"""
Precision: {scores['ents_p']:.2f}
Recall: {scores['ents_r']:.2f}
F1-Score: {scores['ents_f']:.2f}
""")

for k, v in scores['ents_per_type'].items():
    print(f"""
Precision for {k} entity: {v['p']:.2f}
Recall for {k}: {v['r']:.2f}
F1-Score for {k}: {v['f']:.2f}
    """)


Precision: 0.71
Recall: 0.68
F1-Score: 0.70


Precision for DEGREE entity: 0.83
Recall for DEGREE: 0.84
F1-Score for DEGREE: 0.83
    

Precision for SOFT SKILL entity: 0.66
Recall for SOFT SKILL: 0.59
F1-Score for SOFT SKILL: 0.63
    

Precision for HARD SKILL entity: 0.71
Recall for HARD SKILL: 0.68
F1-Score for HARD SKILL: 0.69
    


In [77]:
# save data
hard_skill_df = pd.DataFrame({"hard_skill": hard_skills})
soft_skill_df = pd.DataFrame({"soft_skill": soft_skills})
hard_skill_df.to_csv("../reports/hard_skill.csv", index=False)
soft_skill_df.to_csv("../reports/soft_skill.csv", index=False)

## 7. Visualize in Tableau
Link: https://public.tableau.com/views/DataScienceJobSkillsPlatform/Frontpage?:language=en-US&:display_count=n&:origin=viz_share_link