<a href="https://colab.research.google.com/github/mk7890/Resume-Parsing-System/blob/main/ResumeParsingSystem_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing Dependancies

In [None]:
!pip install spacy transformers flair torch
!python -m spacy download en_core_web_sm

Collecting flair
  Downloading flair-0.15.1-py3-none-any.whl.metadata (12 kB)
Collecting boto3>=1.20.27 (from flair)
  Downloading boto3-1.36.20-py3-none-any.whl.metadata (6.7 kB)
Collecting conllu<5.0.0,>=4.0 (from flair)
  Downloading conllu-4.5.3-py2.py3-none-any.whl.metadata (19 kB)
Collecting ftfy>=6.1.0 (from flair)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting langdetect>=1.0.9 (from flair)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mpld3>=0.3 (from flair)
  Downloading mpld3-0.5.10-py3-none-any.whl.metadata (5.1 kB)
Collecting pptree>=3.1 (from flair)
  Downloading pptree-3.1.tar.gz (3.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytorch-revgrad>=0.2.0 (from flair)
  Downloading pytorch_revgrad-0.2.0-py3-none-any.whl.metadata (1.7 kB)


# Named Entity Recognition using Hybrid Approach: spaCy and BERT

Steps in the Hybrid Approach:
- Use spaCy for well-defined entity types (Name, Phone, Email, LinkedIn, Certifications).
- Use BERT (or Flair) for contextual entities (Job Role, Skills, Experience, Companies Worked For).
- Merge results to get a comprehensive resume parsing model.

In [None]:
import spacy
import re
import torch
import pandas as pd
import json
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from flair.data import Sentence
from flair.models import SequenceTagger

# Load spaCy NER model
nlp_spacy = spacy.load("en_core_web_sm")

# Load BERT-based NER model (Hugging Face)
model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
bert_ner = pipeline("ner", model=model, tokenizer=tokenizer)

# Load Flair NER model
flair_tagger = SequenceTagger.load("flair/ner-english")

# Regular expressions for key entities
EMAIL_REGEX = r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+"
PHONE_REGEX = r"\+?\d{10,15}"
LINKEDIN_REGEX = r"(https?:\/\/)?(www\.)?linkedin\.com\/[a-zA-Z0-9\-_/]+"

# Extract structured entities using spaCy
def extract_structured_entities(text):
    doc = nlp_spacy(text)
    name, email, phone, linkedin, certifications = None, None, None, None, []

    for ent in doc.ents:
        if ent.label_ == "PERSON" and name is None:  # Get only the first person entity
            name = ent.text
        elif ent.label_ == "ORG":
            certifications.append(ent.text)

    email = re.search(EMAIL_REGEX, text)
    phone = re.search(PHONE_REGEX, text)
    linkedin = re.search(LINKEDIN_REGEX, text)

    return {
        "Applicant Name": name,
        "Phone": phone.group(0) if phone else None,
        "Email": email.group(0) if email else None,
        "LinkedIn Profile": linkedin.group(0) if linkedin else None,
        "Certifications": ", ".join(set(certifications)) if certifications else None
    }

# Extract contextual entities using BERT
def extract_contextual_entities(text):
    job_role, skills, companies, experience = None, [], [], None

    # Run BERT NER
    bert_results = bert_ner(text)
    for entity in bert_results:
        entity_text = entity["word"].replace("##", "")  # Fix BERT tokenization issues
        entity_type = entity["entity"].replace("B-", "").replace("I-", "")

        if entity_type == "JOB_TITLE" and job_role is None:
            job_role = entity_text
        elif entity_type == "ORG" and entity_text.lower() not in ["linkedin", "gmail"]:
            companies.append(entity_text)
        elif entity_type == "SKILL":
            skills.append(entity_text)
        elif entity_type == "DATE" and "year" in entity_text.lower():
            experience = entity_text

    return {
        "Job Role": job_role,
        "Skills": ", ".join(set(skills)) if skills else None,
        "Companies Worked For": ", ".join(set(companies)) if companies else None,
        "Years of Work Experience": experience
    }

# Extract education details using Flair
def extract_flair_entities(text):
    sentence = Sentence(text)
    flair_tagger.predict(sentence)

    education, institutions = [], []

    for entity in sentence.get_spans("ner"):
        if entity.tag == "ORG":
            institutions.append(entity.text)
        elif entity.tag in ["MISC", "EDUCATION"]:
            education.append(entity.text)

    return {
        "Education Background": ", ".join(set(education)) if education else None,
        "Education Institutions": ", ".join(set(institutions)) if institutions else None
    }

# Full resume parsing function
def parse_resume(text):
    structured_data = extract_structured_entities(text)
    contextual_data = extract_contextual_entities(text)
    flair_data = extract_flair_entities(text)

    # Merge results
    parsed_resume = {**structured_data, **contextual_data, **flair_data, "Referees": None}

    return parsed_resume

# Example usage
resume_text = """
MOSES MUGAMBI NJERU
Electrical, Electronics and Instrumentation Engineer
Phone +254718695260 LinkedIn linkedin.com/in/moses-mugambi-njeru
Email mugambimoses2@gmail.com Location Nairobi, Kenya
SUMMARY
Self-motivated Electrical, Electronics, and Instrumentation Engineer with 4+ years of experience designing, debugging, installing, and testing electrical and electronics systems, with a keen interest in industry 4.0. Possess excellent written and verbal communication skills, time management, and a wide range of technical skills. Enjoy being part of a productive team to attain set goals, and thrive in high-pressure and challenging environments. Adept at using CAD software for schematics/PCB design and systems testing utilizing advanced electrical tools such as digital multimeters, clamp meters, oscilloscopes, signal generators, process meters, megger insulation tester, and a HART communicator.
EXPERIENCE
Hilftech Solutions, Nairobi, Kenya
Electrical Installation, Consultation, Repair Engineer August 2022 – Present
• Troubleshooted a faulty solar power system, identified a failed solar charge controller and advised a client on a better MPPT (Maximum Power Point Tracking) charge controller instead of a PWM (Pulse Width Modulation) based charging system.
• Designed and built a 24V lithium battery array, and a battery management system (BMS) for a power backup workshop project.
• Installed four CCTV monitoring systems.
• Performed residential electrical installations.
• Offered consultation advice on 20hp motor and Star-Delta control circuitry replacement for a concrete mixture equipment.
• Conducted electrical and electronics systems hardware repairs and software installation and debugging of smart TVs, smartphones, PCs, Audio systems, Ovens, and Refrigerators.
Golden Africa Kenya Ltd, Mombasa Rd, Kenya
Electrical, Electronics & Instrumentation Engineer Intern, April 2022 – July 2022
• Participated in the installation, wiring, and testing of 9 induction motors and 3 servo motors successfully.
• Assisted in the interpretation of a wiring diagram of a control panel for an automatic blow molding and injection molding machine, helping solve proximity and temperature sensor faults.
• Assisted in the maintenance of a high voltage/power solar system (650V, 1.1 Megawatts).
• Successfully fault-corrected 3 Variable Frequency Drives for motor control systems.
• Supported in the troubleshooting of a faulty printed circuit board (PCB) for a packaging machine, successfully identifying and replacing a faulty solid-state relay.
• Helped install and successfully test a motor drive.
• Proficiently employed electrical tools such as digital multimeter, clamp meter, process meter, HART communicator, megger meter insulation tester, and infrared contactless thermometer to gather and modify relevant sensor/system data as necessary.
• Aided in the repair of faulty electronic equipment including high voltage insect repellants, welding machine, angle grinders, hand dryers, ventilation fans, incinerator, and product labelling machine.
• Facilitated the successful installation and calibration of temperature sensors, flow meters, level sensors, hydraulic and pneumatic sensors, and gauges.
• Provided Support in the programming of PID (Proportional Integral and Differential) controllers to regulate the operation of a steam power plant.
• Participated in the installation and programming of a PLC (Programmable Logic Controller) using ladder diagram logic to control an oil-water filtration process.
• Wrote documentation for process readings including; power plant kilowatt-hour daily tabulation, and periodic sensor values for flow, level, pressure, and temperature for relevant processes.
Hilftech Solutions, Nairobi, Kenya
Installation and Repair Electrical and Electronics Engineer, Feb 2020 – Mar 2022
• Carried out general electrical wiring for 5 residential houses.
• Successfully Installed a Solar Power System for a residential home.
• PCB design, prototyping, and fabrication of custom LED (Light Emitting Diode) drive and control boards for advertisement.
• Troubleshoot and successfully repaired faulty electronic equipment including; TVs, Sub Woofers, Public Address Systems, Ovens, Cookers, Solar Inverters, Microwave ovens, Power Supplies (Drives) for LEDs, and ATX Computer power supplies among others.
• Advised clients on the most suitable Electrical and Electronic equipment to purchase for residential uses, such as solar panels, solar batteries, solar charge controllers and inverters, and lighting systems.
Diesel Power Company Ltd, Eldoret, Kenya
Sales, and Customer Service Agent, Sept 2019 – Jan 2020
• Contributed to improving the sale of motor vehicle products such as power steering oil, engine lubricants, brake oil, and lead acid battery electrolytes by over 20%.
Schindler Ltd, Nairobi, Kenya
Electrical and Electronics Engineer Intern, Sept 2018 – Dec 2018
• Participated in wiring, installation, and testing of elevator components including motors, motor control drives (Variable Speed Drive), Load Cells, Proximity Sensors, Automatic Evacuation Systems, HMI (Human Machine Interface) control panels, backup inverters, control electronics cards, batteries, and switch mode power supplies.
• Participated in the commissioning of 4 elevators.
SKILLS
• Strong understanding of electricity fundamentals.
• Skilled at handling high power/Current/Voltage systems (500V – 800V).
• Strong understanding of properties of both high-power AC and DC currents
• Programming skills (PLC ladder logic, C++, Python).
• Schematics and Printed Circuit Board design using CAD software.
• BOM (Bill of Materials) generation and budgeting.
• Proficient in the use of advanced tools used within the Electrical field like multimeters, clamp meter, process meter, HART communicator, and megger insulation tester meter.
• AutoCAD 2D and 3D modelling.
• Detail-oriented.
• Strong organizational skills.
• Teamwork.
EDUCATION
B.Eng. Electrical and Electronics Engineering (Second Class Honors Upper Division)
The Technical University of Kenya, Nairobi, Kenya
2013 – 2018
The Kenya Certificate of Secondary Education, Grade A plain of 81 points
Moi High School Mbiruri, Embu, Kenya
2008 – 2011
REFEREES
Robert Kioko
Electrical Supervisor, Golden Africa Kenya Ltd
0722466299
Michael Kayeka
Instrumentation Engineer, Golden Africa Kenya Ltd
0708397319
Francis Mwaniki
Electrical Engineer, Hilftech Solutions
0713630601
Conrad Ambani
Electrical and Electronics Engineer (Field Supervisor), Schindler Ltd Kenya
0787989012
"""

parsed_resume = parse_resume(resume_text)

# Print formatted JSON output
print(json.dumps(parsed_resume, indent=4))

# Convert to DataFrame if multiple resumes
df = pd.DataFrame([parsed_resume])
print(df)


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


2025-02-13 23:45:12,225 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
{
    "Applicant Name": "Location Nairobi",
    "Phone": "+254718695260",
    "Email": "mugambimoses2@gmail.com",
    "LinkedIn Profile": "linkedin.com/in/moses-mugambi-njeru",
    "Certifications": "Electrical and Electronics, Power Supplies, SKILLS, Ovens, Public Address Systems, EDUCATION, Moi High School Mbiruri, \u2022 Schematics, Customer Service, control electronics cards, PID, Electrical, Electronics, Sub Woofers, Maximum Power Point Tracking, Star-Delta, PCB, ATX Computer, AC, Proximity Sensors, \u2022 Programming, MPPT, CAD, \u2022 Wrote, \u2022 Teamwork, Instrumentation Engineer, Audio, \u2022 Successfully Installed a Solar Power System, NJERU\nElectrical, Electronics and Instrumentation Engineer\nPhone, The Technical University, \u2022 Assisted, Hilftech Solutio

In [None]:
!pip install spacy flair transformers sentence-transformers scikit-learn joblib pandas
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
!pip install tqdm
!pip install pymupdf


Collecting pymupdf
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.3


In [None]:
import spacy
import pandas as pd
import joblib
import fitz  # PyMuPDF for PDF parsing
import re
from flair.models import SequenceTagger
from flair.data import Sentence
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
# Load NLP Models
nlp_spacy = spacy.load("en_core_web_sm")
tagger_flair = SequenceTagger.load("flair/ner-english-large")
ner_pipeline_bert = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Regex patterns for structured extraction
EMAIL_PATTERN = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
PHONE_PATTERN = r"\+?\d[\d -]{8,15}\d"
LINKEDIN_PATTERN = r"https?:\/\/(www\.)?linkedin\.com\/[a-zA-Z0-9-_/]+"



2025-02-14 00:25:13,198 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [None]:
def extract_text_from_pdf(pdf_path):
    """Extracts text from a given PDF resume."""
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text("text") + "\n"
    return text

def extract_text_features(text):
    """Extracts structured features using regex & NLP models"""
    doc_spacy = nlp_spacy(text)
    sentence_flair = Sentence(text)
    tagger_flair.predict(sentence_flair)
    entities_flair = {entity.text: entity.tag for entity in sentence_flair.get_spans("ner")}
    entities_bert = {ent["word"]: ent["entity"] for ent in ner_pipeline_bert(text)}

    extracted_data = {
        "Applicant Name": None,
        "Job Role": None,
        "Phone": None,
        "Email": None,
        "Companies Worked For": [],
        "Years of Work Experience": None,
        "Skills": [],
        "Referees": [],
        "LinkedIn Profile": None,
        "Certifications": [],
        "Education Background": None,
        "Education Institutions": [],
    }

    # Extract using spaCy & Flair
    for ent in doc_spacy.ents:
        if ent.label_ == "PERSON" and not extracted_data["Applicant Name"]:
            extracted_data["Applicant Name"] = ent.text
        elif ent.label_ == "ORG":
            extracted_data["Companies Worked For"].append(ent.text)
        elif ent.label_ == "GPE":
            extracted_data["Education Institutions"].append(ent.text)
        elif ent.label_ == "DATE" and "years" in ent.text:
            extracted_data["Years of Work Experience"] = ent.text

    # Extract job role using Flair/BERT
    for word, tag in entities_flair.items():
        if tag in ["MISC", "WORK_OF_ART"]:
            extracted_data["Job Role"] = word

    for word, tag in entities_bert.items():
        if "MISC" in tag or "JOB_TITLE" in tag:
            extracted_data["Job Role"] = word

    # Extract contact details
    extracted_data["Email"] = re.search(EMAIL_PATTERN, text)
    extracted_data["Phone"] = re.search(PHONE_PATTERN, text)
    extracted_data["LinkedIn Profile"] = re.search(LINKEDIN_PATTERN, text)

    # Extract skills
    extracted_data["Skills"] = [word for word, tag in entities_flair.items() if tag == "SKILL"]

    return extracted_data

# 🔹 **Step 1: Process Resume Dataset**
def process_resumes(df):
    cleaned_data = []
    for _, row in df.iterrows():
        parsed_data = extract_text_features(row["text"])
        parsed_data["Raw Text"] = row["text"]
        parsed_data["Job Role"] = row["Job Role"]
        cleaned_data.append(parsed_data)

    return pd.DataFrame(cleaned_data)

# 🔹 **Step 2: Train Resume Classification Model**
def train_resume_classifier(df):
    tfidf = TfidfVectorizer(max_features=5000)
    X = tfidf.fit_transform(df["Raw Text"]).toarray()
    y = df["Job Role"]

    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X, y)

    # Save model & vectorizer
    joblib.dump(clf, "resume_classifier.pkl")
    joblib.dump(tfidf, "tfidf_vectorizer.pkl")
    print("✅ Resume classifier saved!")

# 🔹 **Step 3: Score Resume Against Job Description**
def score_resume_against_job_description(resume_text, job_description):
    tfidf = joblib.load("tfidf_vectorizer.pkl")
    vectorized = tfidf.transform([resume_text, job_description]).toarray()

    similarity_score = cosine_similarity([vectorized[0]], [vectorized[1]])[0][0]

    return round(similarity_score * 100, 2)

# 🔹 **Step 4: Process New PDF Resume**
def process_new_resume(pdf_path, job_description):
    resume_text = extract_text_from_pdf(pdf_path)
    extracted_data = extract_text_features(resume_text)

    # Load classification model
    clf = joblib.load("resume_classifier.pkl")
    tfidf = joblib.load("tfidf_vectorizer.pkl")

    # Predict job role
    X_test = tfidf.transform([resume_text]).toarray()
    predicted_role = clf.predict(X_test)[0]
    extracted_data["Job Role"] = predicted_role

    # Score against job description
    score = score_resume_against_job_description(resume_text, job_description)

    return extracted_data, score

# 🔹 **Step 5: Save Processed Data**
def save_parsed_data(df):
    df.to_csv("cleaned_parsed_resume_data2.csv", index=False)
    print("✅ Cleaned Resume Data Saved!")

In [None]:
# 🚀 **Run Full Pipeline**
if __name__ == "__main__":
    # Load structured resume dataset
    resume_df = pd.read_csv("/content/final_clean_resume.csv")

    # Extract features from structured resumes
    parsed_df = process_resumes(resume_df)

    # Train and save classifier
    train_resume_classifier(parsed_df)

    # Save parsed resume data
    save_parsed_data(parsed_df)

    # Process new PDF resume and classify job role
    sample_pdf = "/content/MOSES_MUGAMBI_Electrical_Electronics_&_Instrumentation_Engineer_CV.pdf"
    job_desc = '''About the job
    Description

    Zutari: Co-creating an engineered impact.

    Zutari is a well-established, management-owned engineering firm with almost 90 years' experience. As human-centred engineering consultants and advisors, we are trusted by our clients, business partners, communities and other stakeholders across Africa.

    We co-create engineering solutions that have a positive impact and improve people's lives. Zutari values inclusion and recognises the importance of a diverse, talented workforce, believing that people need other people to succeed.

    What kind of talent do we pursue?

    We employ people with the right attitude and a positive mindset, who are motivated by doing the right thing, getting things done and share a sense of urgency. People who have an impact in our teams and broader community. People who think differently and connect with those around them to co-create new opportunities and leave a meaningful legacy.

    Role Responsibilities

    Will support the Electrical Design team in accurate and quality driven delivery of various projects
    Communicates and coordinates with all Disciplines and Stakeholders on a daily basis, and when required the ability to collaborate with teams across other Zutari Global offices.
    Demonstrates an understanding of budgeting and supporting requirements for project / bids
    Excellent communication skills and the ability to liaise directly with Clients, Statutory Authorities, and other 3rd parties as required.
    Demonstrates ownership qualities, for the team quality, and the timely delivery of projects
    Demonstrates sound understanding of other related building design disciplines
    A good understanding of local, international codes and standards, including those applicable to South Africa, Middle East & East Africa.
    Representation at client and professional meetings and being able to present work and engage in technical conversations.
    Analysing project requirements in the context of various codes and standard and develop conceptual designs through to a set of construction information.
    Working with other teams to ensure compliant designs are delivered to agreed programmes.
    Manage and coordinate interdisciplinary interfaces


    Minimum Requirements

    Electrical Engineering Degree
    3-5 years of post-graduate experience
    Minimum of five (2) within the East Africa Countries
    Full understanding of standard industry software including but not limited to BIM 360.
    Experience in International Standards including but not limited to Eurocodes, US design codes, Kenya building regulation is a an added advantage.
    Proficiency in Design and BIM / Digital software including Revit and other electrical design software.
    Experience of working with International Consultants
    Excellent written and verbal communication skills.
    Computer literate and knowledgeable in the use of Ms Office Suite.
    Excellent people skills including interpersonal, communication and presentation skills.
    Be a team player, approachable and confident and willing to collaborate across multiple disciplines and international geographies.


    We believe that a diverse workforce is key to our business success. We seek the best people for our
    jobs based on their skills, qualifications, and experience. We embrace the principle of equal
    opportunity in employment and work towards eliminating all forms of unlawful discrimination in our
    employment practices.'''
    extracted_info, resume_score = process_new_resume(sample_pdf, job_desc)

    print(f"🔍 Extracted Info: {extracted_info}")
    print(f"📊 Resume Score: {resume_score}%")

KeyError: 'text'