<a href="https://colab.research.google.com/github/hafizhry/NER-resume/blob/main/Resume_Projects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [45]:
'''
Requirement.txt
Python 3.7.11
Spacy 2.1.4
PyMuPDF 1.18.15
'''

In [32]:
!pip install spacy==2.1.4



In [33]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
!pip install PyMuPDF



In [35]:
import spacy
import random
import json
import re
import sys, fitz
import os
from google.colab import files
from spacy.util import minibatch, compounding
#from spacy.gold import GoldParse
from spacy.lang.en import English

In [36]:
def open_dataset(file_path):
  '''
    Function to open JSON dataset and assign it to a list which consist
    of the resume and it's NER entities. The function recieve an input with the form of
    JSON file of the dataset and gives an output of dataset in form of a list.
    The output list has the structure of [text, {'entities':[(start_point, end_point, 'labels')]}]

    The dataset has 220 items of which 220 items have been manually labeled.
    The labels are divided into following 10 categories:
    Name
    College Name
    Degree
    Graduation Year
    Years of Experience
    Companies worked at
    Designation
    Skills
    Location
    Email Address
    
    Source : https://www.kaggle.com/dataturks/resume-entities-for-ner
  '''
  dataset = []
  lines = []
  with open(file_path,) as f:
    lines = f.readlines() 
    print('Number of data: ' + str(len(lines)))
    for line in lines:
      data = json.loads(line) # loads the JSON files
      text = data['content'] # assign the content of the resume to list (as text)
      entities = []
      for annotation in data['annotation']: # loop for assigning each label, start and end points of the NER label to the list (as entities)
        point = annotation['points'][0]
        labels = annotation['label']
        if not isinstance(labels, list):
          labels = list(labels)
        for label in labels:
          entities.append((point['start'], point['end']+1, label))
      
      dataset.append((text, {'entities':entities}))
  return dataset

In [49]:
data = open_dataset('/content/Entity Recognition in Resumes.json')

Number of data: 220


In [50]:
data[0]


("Abhishek Jha\nApplication Development Associate - Accenture\n\nBengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a\n\n• To work for an organization which provides me the opportunity to improve my skills\nand knowledge for my individual and company's growth in best possible ways.\n\nWilling to relocate to: Bangalore, Karnataka\n\nWORK EXPERIENCE\n\nApplication Development Associate\n\nAccenture -\n\nNovember 2017 to Present\n\nRole: Currently working on Chat-bot. Developing Backend Oracle PeopleSoft Queries\nfor the Bot which will be triggered based on given input. Also, Training the bot for different possible\nutterances (Both positive and negative), which will be given as\ninput by the user.\n\nEDUCATION\n\nB.E in Information science and engineering\n\nB.v.b college of engineering and technology -  Hubli, Karnataka\n\nAugust 2013 to June 2017\n\n12th in Mathematics\n\nWoodbine modern school\n\nApril 2011 to March 2013\n\n10th\n\nKendriya Vidyalaya\n

In [39]:
def cleaning_dataset(data):
  '''
    Function to clean the dataset. This function recived an input of the dataset
    in the form of list and gives an output of a list as well. The function itself
    clean the '/n' element from the resume text, lowered the case, and also
    checks the validation of the entity start and end points to the text of the resume.

    This function is inspired by https://www.kaggle.com/mohamedtaha7/ner-on-resumes-using-spacy 
  '''
  invalid_span_tokens = re.compile(r'\s')
  cleaned_data = []
  for text, annotations in data:
      entities = annotations['entities']
      valid_entities = []
      cleaned_text = ' '.join(text.split('\n')) # replacing the '\n' element to whitespaces
      cleaned_text = cleaned_text.lower() # lower case the text data
      for start, end, label in entities:
          valid_start = start
          valid_end = end
          # this loop validates the start and end point of the entities to the text dataset 
          while valid_start < len(text) and invalid_span_tokens.match( 
                  text[valid_start]):
              valid_start += 1
          while valid_end > 1 and invalid_span_tokens.match(
                  text[valid_end - 1]):
              valid_end -= 1
          valid_entities.append([valid_start, valid_end, label])
      cleaned_data.append([cleaned_text, {'entities': valid_entities}])
  return cleaned_data

In [51]:
cleaned_data = cleaning_dataset(data)

In [41]:
def train_data_spacy(train_data):
  '''
    This function trains the NLP model using Spacy. The model recieves an input
    of training data in the form of a list. The training data consist of a resume
    text and entities that describes the labels (annotation) of the text. 
  '''
  nlp = English()
  if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
  else: 
    ner = nlp.get_pipe('ner')

  for _, annotation in train_data:
      for annot in annotation['entities']:
        ner.add_label(annot[2])
  
  other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
  with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()
    n_iter = 50
    
    for itn in range(n_iter):
      print('Iteration ' + str(itn))
      random.shuffle(train_data)
      losses = {}
      for batch in spacy.util.minibatch(train_data, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch] 
        try:
          nlp.update(
                texts, 
                annotations,
                drop=0.3,
                sgd=optimizer,
                losses=losses)
        except Exception as e:
          pass
      print("Losses", losses)
    
  nlp.to_disk('models')

In [42]:
train_data_spacy(cleaned_data)

Iteration 0
Losses {'ner': 31600.52429742074}
Iteration 1
Losses {'ner': 21137.05883979446}
Iteration 2
Losses {'ner': 18011.905975277114}
Iteration 3
Losses {'ner': 17350.174877381356}
Iteration 4
Losses {'ner': 15705.336480165304}
Iteration 5
Losses {'ner': 14178.736576130508}
Iteration 6
Losses {'ner': 12968.93688420851}
Iteration 7
Losses {'ner': 11808.980309726967}
Iteration 8
Losses {'ner': 10396.981286817761}
Iteration 9
Losses {'ner': 11289.080773398444}
Iteration 10
Losses {'ner': 10991.040005158824}
Iteration 11
Losses {'ner': 10396.361354378258}
Iteration 12
Losses {'ner': 10179.293207149705}
Iteration 13
Losses {'ner': 9042.717718382433}
Iteration 14
Losses {'ner': 9761.41774862583}
Iteration 15
Losses {'ner': 9401.479164277207}
Iteration 16
Losses {'ner': 8367.054381859714}
Iteration 17
Losses {'ner': 8708.115090990948}
Iteration 18
Losses {'ner': 7719.056371388364}
Iteration 19
Losses {'ner': 10943.083217549494}
Iteration 20
Losses {'ner': 9058.893553562737}
Iteration 21


In [52]:
def upload_resume():
  '''
  This function enables user to upload and test the models with their own resume.
  It only recieves a resume with the format of PDF
  '''
  base_dir = '/content'
  uploaded = files.upload()
  for fn in uploaded.keys():
    print(os.path.join(base_dir, fn))
    doc = fitz.open(os.path.join(base_dir, fn))
  text = ''
  for page in doc:
    text = text + str(page.getText())
    tx = ' '.join(text.split('\n'))
  
  nlp_model = spacy.load('models') # load the previous trained model
  test = nlp_model(tx) # predict the input resume based on the model
  for ent in test.ents:
    print(f'{ent.label_.upper():{20}}- {ent.text}')

In [53]:
upload_resume()

Saving Yusuf, Hafizh Rahmatdianto_Resume_OVO.pdf to Yusuf, Hafizh Rahmatdianto_Resume_OVO (1).pdf
/content/Yusuf, Hafizh Rahmatdianto_Resume_OVO.pdf
NAME                - Hafizh Rahmatdianto
DESIGNATION         - Yusuf
SKILLS              - programming, database manipulation, analytical thinking, and creative problem solving. Experienced at creating predictive models  for health-related problems. Strong attention to detail and possessing a significant ability to work in team environments, having led  student union organization during college period. Looking for a position related to data analytics or business intelligence.
DEGREE              - Bachelor of Biomedical Engineering
SKILLS              - Programming
SKILLS              - Language
SKILLS              - Databases  (MySQL)
COMPANIES WORKED AT - Microsoft
