<a href="https://colab.research.google.com/github/hrishikeshmalkar/Resume_parsing_with_spaCy_sparknlp/blob/main/Resume_parser_spaCy_Spark_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Resume Parser using spaCy and Spark-NLP**
#### --- By Hrishikesh Malkar

##### Importing Required Liberaries

In [1]:
import pandas as pd
import spacy
import pickle
import random
import re
import json
import random
import logging

##### Reading data


In [2]:
df = pd.read_json('Entity Recognition in Resumes.json', lines=True)

##### Preview of Data

In [3]:
df.head(5)

Unnamed: 0,content,annotation,extras
0,Abhishek Jha\nApplication Development Associat...,"[{'label': ['Skills'], 'points': [{'start': 12...",
1,Afreen Jamadar\nActive member of IIIT Committe...,"[{'label': ['Email Address'], 'points': [{'sta...",
2,"Akhil Yadav Polemaina\nHyderabad, Telangana - ...","[{'label': ['Skills'], 'points': [{'start': 37...",
3,Alok Khandai\nOperational Analyst (SQL DBA) En...,"[{'label': ['Skills'], 'points': [{'start': 80...",
4,Ananya Chavan\nlecturer - oracle tutorials\n\n...,"[{'label': ['Degree'], 'points': [{'start': 20...",


##### Removing '/n' characters from data. (i.e. New Line Character)

In [4]:
for i in range(len(df)):
    df["content"][i] = df["content"][i].replace("\n", " ")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


##### Preview after removing new line characters

In [5]:
df.head()

Unnamed: 0,content,annotation,extras
0,Abhishek Jha Application Development Associate...,"[{'label': ['Skills'], 'points': [{'start': 12...",
1,Afreen Jamadar Active member of IIIT Committee...,"[{'label': ['Email Address'], 'points': [{'sta...",
2,"Akhil Yadav Polemaina Hyderabad, Telangana - E...","[{'label': ['Skills'], 'points': [{'start': 37...",
3,Alok Khandai Operational Analyst (SQL DBA) Eng...,"[{'label': ['Skills'], 'points': [{'start': 80...",
4,Ananya Chavan lecturer - oracle tutorials Mum...,"[{'label': ['Degree'], 'points': [{'start': 20...",


##### Function for Converting DataFrame into required format. (i.e. Datatrucks to spaCy)

In [6]:
def convert_to_spacy(JSON_FilePath):
    try:
        training_data = []
        lines=[]
        with open(JSON_FilePath,encoding="utf8") as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['content'].replace("\n", " ")
            entities = []
            data_annotations = data['annotation']
            if data_annotations is not None:
                for annotation in data_annotations:
                    #only a single point in text annotation.
                    point = annotation['points'][0]
                    labels = annotation['label']
                    # handle both list of labels or a single label.
                    if not isinstance(labels, list):
                        labels = [labels]

                    for label in labels:
                        point_start = point['start']
                        point_end = point['end']
                        point_text = point['text']
                        
                        lstrip_diff = len(point_text) - len(point_text.lstrip())
                        rstrip_diff = len(point_text) - len(point_text.rstrip())
                        if lstrip_diff != 0:
                            point_start = point_start + lstrip_diff
                        if rstrip_diff != 0:
                            point_end = point_end - rstrip_diff
                        entities.append((point_start, point_end + 1 , label))
            training_data.append((text, {"entities" : entities}))
        return training_data
    except Exception as e:
        logging.exception("Unable to process " + JSON_FilePath + "\n" + "error = " + str(e))
        return None

##### Function for Removing leading & trailing white spaces from entity spans.


> **Args: data (list)**: The data to be cleaned in spaCy JSON format.

> **Returns: list**: The cleaned data and stored it in a List.      

In [7]:
def trim_entity_spans(data: list) -> list:
    invalid_span_tokens = re.compile(r'\s')

    cleaned_data = []
    for text, annotations in data:
        entities = annotations['entities']
        valid_entities = []
        for start, end, label in entities:
            valid_start = start
            valid_end = end
            while valid_start < len(text) and invalid_span_tokens.match(
                    text[valid_start]):
                valid_start += 1
            while valid_end > 1 and invalid_span_tokens.match(
                    text[valid_end - 1]):
                valid_end -= 1
            valid_entities.append([valid_start, valid_end, label])
        cleaned_data.append([text, {'entities': valid_entities}])
    return cleaned_data      

##### Converting our Data in Required Format

In [8]:
conv_data=convert_to_spacy('Entity Recognition in Resumes.json')

##### Preview of sample cleaned data in our required format

In [9]:
data = trim_entity_spans(conv_data)
data[0]

["Abhishek Jha Application Development Associate - Accenture  Bengaluru, Karnataka - Email me on Indeed: indeed.com/r/Abhishek-Jha/10e7a8cb732bc43a  • To work for an organization which provides me the opportunity to improve my skills and knowledge for my individual and company's growth in best possible ways.  Willing to relocate to: Bangalore, Karnataka  WORK EXPERIENCE  Application Development Associate  Accenture -  November 2017 to Present  Role: Currently working on Chat-bot. Developing Backend Oracle PeopleSoft Queries for the Bot which will be triggered based on given input. Also, Training the bot for different possible utterances (Both positive and negative), which will be given as input by the user.  EDUCATION  B.E in Information science and engineering  B.v.b college of engineering and technology -  Hubli, Karnataka  August 2013 to June 2017  12th in Mathematics  Woodbine modern school  April 2011 to March 2013  10th  Kendriya Vidyalaya  April 2001 to March 2011  SKILLS  C (Le

#####Creating NLP Blank Model with the help of spaCy

In [10]:
nlp = spacy.blank('en')

##### Function for training our model

In [11]:
def train_model(train_data):
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last = True)
    
        for _, annotation in train_data:
            for ent in annotation['entities']:
                ner.add_label(ent[2])
            
    
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(50):
            print("Statring iteration " + str(itn))
            random.shuffle(train_data)
            losses = {}
            index = 0
            for text, annotations in train_data:
                try:
                    nlp.update(
                        [text],  # batch of texts
                        [annotations],  # batch of annotations
                        drop=0.2,  # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses)
                except Exception as e:
                    pass
                
            print(losses)

##### Model Training

In [12]:
train_model(data)

Statring iteration 0
{'ner': 16465.25536238796}
Statring iteration 1
{'ner': 15781.170626723966}
Statring iteration 2
{'ner': 11655.217217006186}
Statring iteration 3
{'ner': 10356.949478027476}
Statring iteration 4
{'ner': 9056.847333362526}
Statring iteration 5
{'ner': 8167.247733932967}
Statring iteration 6
{'ner': 7052.855012010317}
Statring iteration 7
{'ner': 7046.052151583607}
Statring iteration 8
{'ner': 7772.087388741171}
Statring iteration 9
{'ner': 5940.375102305628}
Statring iteration 10
{'ner': 6444.602800411875}
Statring iteration 11
{'ner': 6142.710391694642}
Statring iteration 12
{'ner': 5186.33132863332}
Statring iteration 13
{'ner': 5560.764955562939}
Statring iteration 14
{'ner': 6170.0963053789765}
Statring iteration 15
{'ner': 4828.416610520465}
Statring iteration 16
{'ner': 5565.6608713211435}
Statring iteration 17
{'ner': 5056.673536251201}
Statring iteration 18
{'ner': 5082.950300696691}
Statring iteration 19
{'ner': 4034.710897199544}
Statring iteration 20
{'ne

##### Saving our model into disk.

In [13]:
nlp.to_disk('/content/my_nlp_model')

##### Loading Saved Model with the help of spaCy

In [14]:
nlp_model = spacy.load('my_nlp_model')

##### Model Testing

In [15]:
data[1][0]

"Nazish Alam Consultant - SAP ABAP  Ghaziabad, Uttar Pradesh - Email me on Indeed: indeed.com/r/Nazish-Alam/ b06dbac9d6236221  Willing to relocate to: Delhi, Delhi - Noida, Uttar Pradesh - Lucknow, Uttar Pradesh  WORK EXPERIENCE  Consultant  SAP ABAP -  Noida, Uttar Pradesh -  November 2016 to Present  Credence Systems, Noida  Credence Systems is IT Infrastructure Management Company, offers end-to-end solutions. Combining deep domain expertise with new technologies and a cost effective on-site/ offshore model. Helping companies integrate key business processes, improving their operational efficiencies and extracting, better business value from their investment.  PROJECT UNDERTAKEN Client ECC Version Role and Responsibilities Welspun Group Plate & Coil Mills Division SAP ECC 6.0  Consultant  SAP ABAP -  January 2016 to Present  Reports: • Designed technical program specifications based on business requirements. • Generated basic lists and Interactive Reports for information in the MM/SD

##### Function for cleaning text data

In [16]:
def clean_data(df):
    clean_df = re.sub(r"[\(\[].*?[\)\]]", "", df)
    return (clean_df)

In [17]:
data1=clean_data(data[1][0])
doc = nlp_model(data1)

##### Using Displacy

In [18]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

#####Getting Entities

In [19]:
for ent in doc.ents:
    print(f'{ent.label_.upper():{30}}- {ent.text}')

NAME                          - Nazish Alam
DESIGNATION                   - Consultant
COMPANIES WORKED AT           - SAP ABAP
LOCATION                      - Ghaziabad
EMAIL ADDRESS                 - indeed.com/r/Nazish-Alam/ b06dbac9d6236221
DESIGNATION                   - Consultant
COMPANIES WORKED AT           - SAP ABAP
DESIGNATION                   - Consultant
COMPANIES WORKED AT           - SAP ABAP
SKILLS                        - SAP , ABAP , ADBC , C++ , DATA MODELING
SKILLS                        - • Trained on SAP S4 HANA. • Having knowledge of Code Push down, CDS view and it's consumption in ABAP. • Data Modeling, creation of different type of views. • AMDP. • ADBC connectivity. • Familiar with SQL, DDL, DML syntaxes. • Work on Windows 7, Windows XP, Windows 8, Windows 10 OS, can work on C, C++
GRADUATION YEAR               - 2015
DEGREE                        - Master of Computer Application
COLLEGE NAME                  - UPTU. India


#**Spark-NLP** 

##### Installing dependencies

In [20]:
!pip install spark-nlp==3.0.0
!pip install pyspark



##### Importing Required Libraries

In [21]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

##### Starting Sparknlp session

In [22]:
spark = sparknlp.start()

##### Preview of Sample Data

In [23]:
data[1][0]

"Nazish Alam Consultant - SAP ABAP  Ghaziabad, Uttar Pradesh - Email me on Indeed: indeed.com/r/Nazish-Alam/ b06dbac9d6236221  Willing to relocate to: Delhi, Delhi - Noida, Uttar Pradesh - Lucknow, Uttar Pradesh  WORK EXPERIENCE  Consultant  SAP ABAP -  Noida, Uttar Pradesh -  November 2016 to Present  Credence Systems, Noida  Credence Systems is IT Infrastructure Management Company, offers end-to-end solutions. Combining deep domain expertise with new technologies and a cost effective on-site/ offshore model. Helping companies integrate key business processes, improving their operational efficiencies and extracting, better business value from their investment.  PROJECT UNDERTAKEN Client ECC Version Role and Responsibilities Welspun Group Plate & Coil Mills Division SAP ECC 6.0  Consultant  SAP ABAP -  January 2016 to Present  Reports: • Designed technical program specifications based on business requirements. • Generated basic lists and Interactive Reports for information in the MM/SD

##### Creating Spark DataFrame

In [24]:
data2= spark.createDataFrame([[data[1][0]]]).toDF('text')

In [25]:
data2.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [26]:
document = DocumentAssembler().setInputCol('text').setOutputCol('document').setCleanupMode('shrink')

In [27]:
sentence = SentenceDetector().setInputCols('document').setOutputCol('sentence')

In [28]:
sentence.setExplodeSentences(True)

SentenceDetector_b049295c6a93

##### Tokenizing data

In [29]:
tokenizer = Tokenizer().setInputCols('sentence').setOutputCol('token')

In [30]:
tokenizer.setExceptions(['e-mail'])

Tokenizer_eedf2261a320

In [31]:
checker= NorvigSweetingModel.pretrained().setInputCols(['token']).setOutputCol('checked')

spellcheck_norvig download started this may take some time.
Approximate size to download 4.2 MB
[OK!]


##### Embedding Data by WordEmbeddings method

In [32]:
embeddings=WordEmbeddingsModel.pretrained().setInputCols(['sentence','checked']).setOutputCol('embeddings')

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


##### NER

In [33]:
ner = NerDLModel.pretrained().setInputCols(['sentence','checked','embeddings']).setOutputCol('ner')

ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [34]:
converter = NerConverter().setInputCols(['sentence','checked','ner']).setOutputCol('chunk')

##### Importing Pipeline Pakage from pyspark.ml

In [35]:
from pyspark.ml import Pipeline

##### Creating Spark-NLP Pipeline

In [36]:
pipeline = Pipeline().setStages([document, sentence, tokenizer, checker, embeddings, ner, converter])

##### Model Building

In [37]:
model = pipeline.fit(data2)

In [38]:
result = model.transform(data2)

##### Outcome

In [39]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|             checked|          embeddings|                 ner|               chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Nazish Alam Consu...|[{document, 0, 32...|[{document, 0, 40...|[{token, 0, 5, Na...|[{token, 0, 5, Na...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 21, N...|
|Nazish Alam Consu...|[{document, 0, 32...|[{document, 406, ...|[{token, 406, 414...|[{token, 406, 414...|[{word_embeddings...|[{named_entity, 4...|                  []|
|Nazish Alam Consu...|[{document, 0, 32...|[{document, 506, ...|[{token, 506, 512...|[{token, 506, 512...|[{word_embeddings...|[{named_entity, 5...|  

In [40]:
result.select('sentence.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-----------------

In [41]:
result.select('checked.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [42]:
result.select('ner.result').show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[B-ORG, I-ORG, I-ORG, O, B-ORG, I-ORG, I-ORG, O, O, B-LOC, O, O, O, O, O, O, O, O, O, O, O, O, O, B-LOC, O, B-L

In [43]:
result.select('ner.begin', 'ner.end').show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|begin                                                                                                                                                                                                                                                                                                                       |end                                         

In [44]:
result.select('chunk.result','chunk.begin', 'chunk.end').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|result                                                                                                                                                                          |begin                                                   |end                                                      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|[Nazish Alam Consultant, SAP ABAP Ghaziabad, Uttar Pradesh, Delhi, Delhi, Pradesh, Lucknow, Pradesh, SAP ABAP, Uttar 

##### Sample testing

In [45]:
Light=LightPipeline(model)

In [46]:
Light.annotate(data[2][0])

{'checked': ['Nitin',
  'Tr',
  'PeopleSoft',
  'Consultant',
  'Bangalore',
  'Urban',
  ',',
  'Karnataka',
  '-',
  'Email',
  'me',
  'on',
  'Indeed',
  ':',
  'indeed.com/r/Nitin-Tr/e7e3a2f5b4c1e24e',
  'An',
  'ecommerce',
  'website',
  'I',
  'built',
  'as',
  'my',
  'college',
  'project',
  '.',
  'The',
  'website',
  'contains',
  'all',
  'the',
  'basic',
  'elements',
  'of',
  'an',
  'ecommerce',
  'website',
  'which',
  'are',
  'The',
  'landing',
  'page',
  ',',
  'categorization',
  'of',
  'items',
  'based',
  'on',
  'filters',
  ',',
  'basic',
  'session',
  'level',
  'security',
  ',',
  'product',
  'page',
  ',',
  'part',
  ',',
  'share',
  'button',
  ',',
  'empty',
  'cart',
  'button',
  ',',
  'pagination',
  'etc',
  '.',
  'It',
  'consists',
  'of',
  'a',
  'separate',
  'seller',
  'accounts',
  'where',
  'sellers',
  'can',
  'register',
  'and',
  'later',
  'upload',
  'their',
  'products',
  'to',
  'be',
  'sold',
  ',',
  'which',
