# AASD 4005 - Adv. Mathematical Concepts for Machine Learning

Details:
- Version: 1.0. 
- Date: 2022-11-20

Group: 

 

## Resume Parsing and Entity Recognition (NER) using Transformers and Pytorch

We will use **RoBERTa** model for NER.
- **Named-entity recognition (NER)** (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. (Wikipedia: https://en.wikipedia.org/wiki/Named-entity_recognition)
- **RoBERTa stands for Robustly Optimized BERT Pre-training Approach**. It was presented by researchers at Facebook and Washington University. The goal of this paper was to optimize the training of BERT architecture in order to take lesser time during pre-training. (GeekforGeeks: https://www.geeksforgeeks.org/overview-of-roberta-model/)
- This code is an implementation based on the approach explained in you tube: "Resume (CV) Parsing using Spacy 3 | NER Training in Spacy v3". https://www.youtube.com/watch?v=WpaioLNsoGI
- This implementation intends to feed the resume classification Final project of the course "Advanced Mathematical Concepts for Machine Learning" at George Brown College.
- It was run in Google colab using Premium GPU runtime.
## Architecture:
Defined in **"config.cfg" file**:
- **Tokenizer:** "components.ner.model.tok2vec".
- **NER:** "spacy.TransitionBasedParser.v2". 
- **Transformer Model:** "spacy-transformers.TransformerModel.v3".
- **Optimizer:** "Adam.v1"

### Installing Tranformers

In [1]:
!pip install spacy_transformers
!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy_transformers
  Downloading spacy_transformers-1.1.8-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.0 MB/s 
Collecting transformers<4.22.0,>=3.4.0
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 82.4 MB/s 
[?25hCollecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 74.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 75.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 61.9 MB/s 
Ins

### Importing libraries and checking for mandatory GPU

In [2]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm 
import json

In [3]:
# Check Spacy version
spacy.__version__

'3.4.3'

In [4]:
# Check GPU to use (GPU is MANDATORY for training the model; otherwise it will thrown an error when starting the training)
!nvidia-smi

Mon Nov 21 00:02:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    49W / 400W |    658MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Setting up the environment to train the RoBERTa Model

In [5]:
# Mounting Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Cloning the resume data to train the model as well as the base configuration file 
!git clone https://github.com/laxmimerit/CV-Parsing-using-Spacy-3.git
# move git clone repository to a directory inside the mounted drive

Cloning into 'CV-Parsing-using-Spacy-3'...
remote: Enumerating objects: 82, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (78/78), done.[K
remote: Total 82 (delta 16), reused 5 (delta 0), pack-reused 0[K
Unpacking objects: 100% (82/82), done.


In [7]:
#  Fetching the data to train the model
# cv_data = json.load(open('/content/CV-Parsing-using-Spacy-3/data/training/train_data.json', 'r'))
cv_data = json.load(open('/content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/training/train_data.json', 'r'))

In [8]:
#  200 resumes manually annotated using "Label Studio" https://labelstud.io/
len(cv_data )

200

In [9]:
# Example of data
cv_data [0]

['Govardhana K Senior Software Engineer  Bengaluru, Karnataka, Karnataka - Email me on Indeed: indeed.com/r/Govardhana-K/ b2de315d95905b68  Total IT experience 5 Years 6 Months Cloud Lending Solutions INC 4 Month • Salesforce Developer Oracle 5 Years 2 Month • Core Java Developer Languages Core Java, Go Lang Oracle PL-SQL programming, Sales Force Developer with APEX.  Designations & Promotions  Willing to relocate: Anywhere  WORK EXPERIENCE  Senior Software Engineer  Cloud Lending Solutions -  Bangalore, Karnataka -  January 2018 to Present  Present  Senior Consultant  Oracle -  Bangalore, Karnataka -  November 2016 to December 2017  Staff Consultant  Oracle -  Bangalore, Karnataka -  January 2014 to October 2016  Associate Consultant  Oracle -  Bangalore, Karnataka -  November 2012 to December 2013  EDUCATION  B.E in Computer Science Engineering  Adithya Institute of Technology -  Tamil Nadu  September 2008 to June 2012  https://www.indeed.com/r/Govardhana-K/b2de315d95905b68?isid=rex-

# Configuration for the model
The recommended way to train your spaCy pipelines is via the spacy train command on the command line. It only needs a single config.cfg configuration file that includes all settings and hyperparameters. You can optionally overwrite settings on the command line, and load in a Python file to register custom functions and architectures. This quickstart widget helps you generate a starter config with the recommended settings for your specific use case. It’s also available in spaCy as the init config command. https://spacy.io/usage/training#basics

## Training configuration files 
Training config files include all settings and hyperparameters for training your pipeline. Instead of providing lots of arguments on the command line, you only need to pass your config.cfg file to spacy train. Under the hood, the training config uses the configuration system provided by our machine learning library Thinc. This also makes it easy to integrate custom models and architectures, written in your framework of choice. Some of the main advantages and features of spaCy’s training config are:

- **Structured sections**. The config is grouped into sections, and nested sections are defined using the . notation. For example, [components.ner] defines the settings for the pipeline’s named entity recognizer. The config can be loaded as a Python dict.
- **References** to registered functions. Sections can refer to registered functions like model architectures, optimizers or schedules and define arguments that are passed into them. You can also register your own functions to define custom architectures or methods, reference them in your config and tweak their parameters.
- **Interpolation**. If you have hyperparameters or other settings used by multiple components, define them once and reference them as variables.
Reproducibility with no hidden defaults. The config file is the “single source of truth” and includes all settings.
- **Automated checks and validation**. When you load a config, spaCy checks if the settings are complete and if all values have the correct types. This lets you catch potential mistakes early. In your custom architectures, you can use Python type hints to tell the config which types of data to expect. https://spacy.io/usage/training#config

In [10]:
# Creating "config.cfg" file from "base_config.cfg"
!python -m spacy init fill-config /content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/training/base_config.cfg /content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/training/config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/training/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [11]:
!cat /content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/training/config.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1

## Preparing Training Data

Training data for NLP projects comes in many different formats. For some common formats such as CoNLL, spaCy provides converters you can use from the command line. In other cases you’ll have to prepare the training data yourself.

When converting training data for use in spaCy, the main thing is to create Doc objects just like the results you want as output from the pipeline. For example, if you’re creating an NER pipeline, loading your annotations and setting them as the .ents property on a Doc is all you need to worry about. On disk the annotations will be saved as a DocBin in the .spacy format, but the details of that are handled automatically.

(https://spacy.io/usage/training#training-data)

In [12]:
def get_spacy_doc (file, data):
  nlp = spacy.blank('en')
  db = DocBin() # docbin object

  for text, annot in tqdm(data):
    doc =nlp.make_doc(text)
    annot = annot ['entities']

    ents = []
    entity_indices = []

    # skip overlapping annotated entitites
    for start, end, label in annot:
      skip_entity = False

      for idx in range(start, end):
        if idx in entity_indices:
          skip_entity = True
          break
        if skip_entity == True:
          continue

        entity_indices  = entity_indices + list(range(start, end))

        try:
          span = doc.char_span(start, end, label=label, alignment_mode='strict')    
        except:
          continue
        
        if span == None:
          err_data = str([start, end]) +  "   " + str(text) + '\n'
          file.write(err_data)
        else:
          ents.append(span)
      
      try:
        doc.ents = ents
        db.add(doc)
      except:
        pass

  return db   

In [13]:
# Generating training and testing data
from sklearn.model_selection import train_test_split
train, test = train_test_split (cv_data, test_size=0.3)


In [14]:
len(train), len(test)

(140, 60)

In [15]:
file =open('error.txt', 'w')  # Potencial errors in CVs entity annotation are saved in 'error.txt'

db = get_spacy_doc(file, train)
db.to_disk('train_data.spacy')

db = get_spacy_doc(file, test)
db.to_disk('test_data.spacy')

file.close()

100%|██████████| 140/140 [00:05<00:00, 24.68it/s]
100%|██████████| 60/60 [00:01<00:00, 32.16it/s]


In [16]:
db.tokens  # data stored into db.tokens

[array([[12496960360823244079, 15552771500443605288,                    0,
         ...,                    0,                    0,
                            0],
        [12045089575812823300,   616671652016297689,                    0,
         ...,                    0,                    0,
                            0],
        [11009051309222302246, 10044989871309213945,                    0,
         ...,                    0,                    0,
                            0],
        ...,
        [ 1535274886070011314, 12064335484295216706,                    0,
         ...,                    0,                    0,
                            0],
        [ 9236818973772229989,  9236818973772229989,                    0,
         ...,                    0,                    0,
                            0],
        [ 8930149225908759990,  8930149225908759990,                    0,
         ...,                    0,                    0,
                            0

In [17]:
len(db.tokens)

728

In [18]:
%%time
# !python -m spacy train /content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/config.cfg --output ./output --paths.train ./train_data.spacy --paths.dev ./test_data.spacy --gpu-id 0
!python -m spacy train /content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/training/config.cfg --output /content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/output --paths.train ./train_data.spacy --paths.dev ./test_data.spacy --gpu-id 0


[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-11-21 00:03:32,200] [INFO] Set up nlp object from config
INFO:spacy:Set up nlp object from config
[2022-11-21 00:03:32,212] [INFO] Pipeline: ['transformer', 'ner']
INFO:spacy:Pipeline: ['transformer', 'ner']
[2022-11-21 00:03:32,217] [INFO] Created vocabulary
INFO:spacy:Created vocabulary
[2022-11-21 00:03:32,218] [INFO] Finished initializing nlp object
INFO:spacy:Finished initializing nlp object
Downloading config.json: 100% 481/481 [00:00<00:00, 471kB/s]
Downloading vocab.json: 100% 878k/878k [00:01<00:00, 666kB/s]
Downloading merges.txt: 100% 446k/446k [00:01<00:00, 413kB/s]
Downloading tokenizer.json: 100% 1.29M/1.29M [00:01<00:00, 1.22MB/s]
Downloading pytorch_model.bin: 100% 478M/478M [00:06<00:00, 73.2MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.

## Quick Model Testing

In [19]:
# Loading the best trained model
nlp = spacy.load('/content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/output/model-best')

In [20]:
# Short example
doc = nlp('My name is John Smith. I worked at IBM. I have 15 years of experience')
for ent in doc.ents:
    print (ent.text, "   ->>>>   ", ent.label_)
    


## Preparing to read a real PDF resume
## using PyMuPDF:
With PyMuPDF you can access files with extensions like “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2”, “.mobi” or “.epub”. In addition, about 10 popular image formats can also be opened and handled like documents. https://pymupdf.readthedocs.io/en/latest/

In [21]:
# Installing the library
!pip install PyMuPDF 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyMuPDF
  Downloading PyMuPDF-1.21.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB)
[K     |████████████████████████████████| 14.0 MB 20.4 MB/s 
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.21.0


In [22]:
import sys, fitz

In [23]:
# Checking with a real uploaded Resume
fname = '/content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/ResumeAdvMath.pdf'
doc =fitz.open(fname)

In [24]:
# doc =[page.getText() for page in doc]

In [25]:
doc

Document('/content/drive/MyDrive/KaggleCompetitions/CV-Parsing-using-Spacy-3/data/ResumeAdvMath.pdf')

In [26]:
# Generating the text file 
text =''
for page in doc:
  text =text + str(page.get_text())
text = text.strip()

In [27]:
# text.split()

In [28]:
text

"Juan Perez \nARTIFICIAL INTELLIGENCE \nMBA - CYBERSECURITY \nELECTRONICS ENGINEER \n              \n \nMobile: +1 182-647-9981 \nE-mail: juanperez1981@georgebrown.ca\n \nSUMMARY OF SKILLS:  \nMachine Learning, Neural Networks, Deep Learning, NLP, TensorFlow, Python, C++, SQL, Tableau, AWS, Azure, Cloud, \nCybersecurity, Kali Linux, Leadership, Technical Sales, English (Advanced), French (Intermediate), Spanish (Native).  \n \nCERTIFICATIONS:  \n• \nAWS Certified Machine Learning Specialty (2022)  \n• \nAWS Certified Cloud Solutions Architect Associate (2020)  \n• \nAzure Fundamentals (2022) \n• \nHCIA Huawei Certified ICT Associate - Cloud Services (2021) \n \nEDUCATION: \nGeorge Brown College  \n \n \n \n \n \n \n \nToronto, Canada \nApplied A.I. Solutions Development Program (Postgraduate) \n \n  \n \nAugust 2023 \n \nCentennial College  \n \n \n \n \n \n \n \nToronto, Canada \nGraduate Certificate in Cybersecurity (with Honours)  \n  \n \n \nMay 2022 \n \nTexas A&M University, Mays

In [29]:
# Entity Recognition of PDF Resume
doc = nlp(text)
for ent in doc.ents:
    print (ent.text, "    ->>>>   ", ent.label_)

Machine Learning, Neural Networks, Deep Learning, NLP, TensorFlow, Python, C++, SQL, Tableau, AWS, Azure, Cloud, 
Cybersecurity, Kali Linux, Leadership, Technical Sales, English (Advanced), French (Intermediate), Spanish (Native)     ->>>>    Skills
George Brown College     ->>>>    College Name
Toronto     ->>>>    Location
Centennial College     ->>>>    College Name
Toronto     ->>>>    Location
Texas A&M University, Mays School of Business     ->>>>    College Name
Centennial College     ->>>>    College Name
George Brown College     ->>>>    College Name
George Brown College     ->>>>    College Name
Centennial College     ->>>>    College Name
Texas A&M University     ->>>>    College Name
Lima     ->>>>    Location
