<a href="https://colab.research.google.com/github/mk7890/Resume-Parsing-System/blob/main/ResumeParser_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Data Collection
Collect a variety of resumes in PDF format. You'll need a diverse dataset to ensure your model can generalize well.

2. Preprocessing
Convert PDF to text: You can use libraries like PyMuPDF or pdfminer.

Clean the text: Remove unnecessary characters and normalize the text.

3. Feature Extraction
Tokenization: Split the text into individual words or tokens.

Named Entity Recognition (NER): Use NER to identify and classify entities in the text. Libraries like spaCy are excellent for this task.

Regular Expressions: For identifying specific patterns like phone numbers and emails.

4. Building the Model
Use a pre-trained language model like BERT or fine-tune it for your specific use case.

Train the model on annotated resumes where entities like name, job role, etc., are labeled.

5. Model Evaluation
Use metrics like precision, recall, and F1-score to evaluate your model's performance.

6. Saving and Deployment
Save the trained model using a library like joblib or pickle.

Deploy the model using Streamlit for an interactive web application.

# Loading Libraries

In [None]:
import pandas as pd
import numpy as np

PDF to Text Conversion

In [None]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.3


In [None]:
!pip install spacy transformers pdfplumber joblib pickle5


Collecting pdfplumber
  Using cached pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
Collecting pickle5
  Using cached pickle5-0.0.11.tar.gz (132 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six==20231228 (from pdfplumber)
  Using cached pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Using cached pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
Using cached pdfplumber-0.11.5-py3-none-any.whl (59 kB)
Using cached pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Using cached pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
Building wheels for collected packages: pickle5
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates fro

In [None]:
!pip install pdfplumber

Collecting pdfplumber
  Using cached pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Using cached pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Using cached pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
Using cached pdfplumber-0.11.5-py3-none-any.whl (59 kB)
Using cached pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Using cached pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
Installing collected packages: pypdfium2, pdfminer.six, pdfplumber
Successfully installed pdfminer.six-20231228 pdfplumber-0.11.5 pypdfium2-4.30.1


In [None]:
import pdfplumber

def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text.strip()


Data Collection & Preprocessing

In [None]:
# Load Skills and Job Role Data

def load_keywords(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return [line.strip().lower() for line in file.readlines()]

skills_list = load_keywords("/content/unique_skills.txt")
job_roles_list = load_keywords("/content/unique_job_roles.txt")

# NER (Named Entity Recognition)

Fine-tune spaCy's NER model to extract required entities.

Train a Custom NER Model

Prepare Training Data

Format: ("Sentence text", {"entities": [(start, end, "ENTITY_TYPE")]})

Train the Model

# Use Pretrained BERT for Named Entity Recognition

Use Hugging Face's transformers pipeline for NER.

In [None]:
from transformers import pipeline

# Load the pretrained BERT NER model
bert_ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# List of keywords that indicate a company or organization
company_keywords = [
    "Solutions", "Institute", "Inc", "Ltd", "Limited", "Company", "Organization",
    "Group", "Corporation", "Technologies", "Systems", "Consulting", "University",
    "Enterprises", "Foundation", "Associates", "Partners", "Industries"
]

def extract_entities_bert(text):
    entities = bert_ner(text)
    extracted_info = {
        "APPLICANT_NAME": [],
        "COMPANY": [],
        "JOB_ROLE": [],
        "SKILL": [],
        "EDUCATION": [],
        "CERTIFICATION": [],
        "YEARS_EXPERIENCE": [],
        "LOCATION": [],
        "LINKEDIN": []
    }

    # Map entity labels to desired categories
    entity_mapping = {
        "PER": "APPLICANT_NAME",
        "ORG": "COMPANY",
        "JOB": "JOB_ROLE",
        "SKILL": "SKILL",
        "EDU": "EDUCATION",
        "CERT": "CERTIFICATION",
        "EXP": "YEARS_EXPERIENCE",
        "LOC": "LOCATION",
        "URL": "LINKEDIN",
    }

    for entity in entities:
        label = entity["entity_group"]
        word = entity["word"]
        mapped_label = entity_mapping.get(label)

        if mapped_label:
            if mapped_label == "COMPANY":
                # Check if word contains any company-related keyword
                if any(keyword.lower() in word.lower() for keyword in company_keywords):
                    extracted_info[mapped_label].append(word)
            else:
                extracted_info[mapped_label].append(word)

    # Convert lists to strings or keep them as lists if multiple entities exist
    for key in extracted_info:
        if len(extracted_info[key]) == 1:
            extracted_info[key] = extracted_info[key][0]
        elif len(extracted_info[key]) == 0:
            extracted_info[key] = None

    return extracted_info

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Extract Additional Features

Since BERT does not handle all entities well, I'll use regex for:

Phone numbers, Emails, LinkedIn profiles

In [None]:
import re

def extract_additional_info(text, extracted_info):
    # Extract Phone Number
    phone_match = re.search(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
    if phone_match:
        extracted_info["PHONE"] = phone_match.group()

    # Extract Email
    email_match = re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
    if email_match:
        extracted_info["EMAIL"] = email_match.group()

    # Extract LinkedIn Profile
    linkedin_match = re.search(r'https?://(www\.)?linkedin\.com/in/[A-Za-z0-9_-]+', text)
    if linkedin_match:
        extracted_info["LINKEDIN"] = linkedin_match.group()

    return extracted_info


Load Skills & Job Roles for Matching

Use predefined job roles and skills for entity matching.

In [None]:
def load_keywords(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return [line.strip().lower() for line in file.readlines()]

skills_list = load_keywords("/content/unique_skills.txt")
job_roles_list = load_keywords("/content/unique_job_roles.txt")

def match_keywords(text, keyword_list):
    return [word for word in keyword_list if word in text.lower()]

def extract_skills_job_roles(text, extracted_info):
    extracted_info["SKILLS"] = match_keywords(text, skills_list)
    extracted_info["JOB_ROLE"] = match_keywords(text, job_roles_list)
    return extracted_info


In [None]:
def extract_years_experience(text):
    """
    Extracts years of experience from resume text using regex patterns.
    """
    experience_patterns = [
        r"(\d+)\s*(?:\+?\s*years?|yrs|Yrs|yr|years of experience)",  # "5 years of experience", "10+ years"
        r"(\d+)-(\d+)",  # "2015-2020" (Calculate difference)
        r"since (\d{4})",  # "since 2015"
        r"(\d{4}) to (\d{4})"  # "2015 to 2020"
    ]

    extracted_years = []

    for pattern in experience_patterns:
        matches = re.findall(pattern, text, re.IGNORECASE)
        for match in matches:
            if isinstance(match, tuple):  # Handling cases like "2015-2020"
                if len(match) == 2:
                    start, end = map(int, match)
                    extracted_years.append(end - start)
            else:
                extracted_years.append(int(match))

    # Get the max experience found
    return max(extracted_years, default="Not Found")



# Complete Resume Parsing Pipeline

Combine BERT NER + Regex + Keyword Matching:

In [None]:
import re
import json

def parse_resume(pdf_path):
    # Step 1: Extract text from PDF
    resume_text = extract_text_from_pdf(pdf_path)

    # Step 2: Extract entities using BERT
    extracted_info = extract_entities_bert(resume_text)

    # Step 3: Extract additional details (phone, email, LinkedIn, certifications)
    extracted_info = extract_additional_info(resume_text, extracted_info)

    # Step 4: Match skills and job roles
    extracted_info = extract_skills_job_roles(resume_text, extracted_info)

    # Step 5: Extract Companies Worked For
    extracted_info["COMPANIES_WORKED_FOR"] = extract_companies(resume_text)

    # Step 6: Extract Education Background
    extracted_info["EDUCATION"], extracted_info["EDUCATION_INSTITUTIONS"] = extract_education(resume_text)

    # ✅ Extract Years of Experience
    extracted_info["YEARS_EXPERIENCE"] = extract_years_experience(resume_text)

    # Ensure all required fields exist
    required_keys = [
        "APPLICANT_NAME", "JOB_ROLE", "PHONE", "EMAIL", "COMPANIES_WORKED_FOR",
        "YEARS_EXPERIENCE", "SKILLS", "REFEREES", "LINKEDIN", "CERTIFICATIONS",
        "EDUCATION", "EDUCATION_INSTITUTIONS"
    ]
    extracted_info = {key: extracted_info.get(key, "Not Found") for key in required_keys}

    # Clean Skills List
    extracted_info["SKILLS"] = clean_skills_list(extracted_info["SKILLS"])

    return extracted_info

# 🔹 Extract companies based on patterns & BERT
def extract_companies(text):
    company_patterns = r"\b(?:Solutions|Institute|Company|Inc|Ltd|Limited|Corp|Technologies|Consulting|Industries|Systems|Enterprises)\b"

    # ✅ Extract ORG entities from BERT (Fix: Use extract_entities_bert)
    entities = extract_entities_bert(text)
    company_names = entities.get("COMPANY", [])  # Extract company names

    # ✅ Extract based on patterns
    pattern_matches = re.findall(r"([A-Z][a-z]+(?:\s[A-Z][a-z]+)*\s" + company_patterns + ")", text)

    # ✅ Combine & remove duplicates
    companies = list(set((company_names if isinstance(company_names, list) else [company_names]) + pattern_matches))
    return companies if companies else "Not Found"


# 🔹 Extract education details based on degrees & institutions
def extract_education(text):
    degree_keywords = r"\b(?:Diploma|Certificate|Bachelor(?:'s)?|BSc|MSc|PhD|Master(?:'s)?|Doctorate|Associate|Engineering|MBA|BS|MS|BA|MA)\b"

    # ✅ Extract ORG entities from BERT (universities are often tagged as ORG)
    entities = extract_entities_bert(text)
    institutions = entities.get("COMPANY", [])  # Universities often get tagged as ORG

    # ✅ Extract degrees from text
    degrees = re.findall(degree_keywords, text, re.IGNORECASE)

    return (", ".join(set(degrees)) if degrees else "Not Found", institutions if institutions else "Not Found")

# 🔹 Clean Skills List
def clean_skills_list(skills_list):
    """Remove noise from extracted skills."""
    stopwords = {"the", "and", "in", "of", "to", "for", "on", "by", "with", "as", "at", "or", "an", "a", "is", "it", "be"}
    return [skill for skill in skills_list if skill.lower() not in stopwords and len(skill) > 2]

# Test on a sample resume
parsed_data = parse_resume("/content/Moses Mugambi Data Analyst CV.pdf")

# Print each extracted entity on a new line
print("\nExtracted Resume Information:")
for key, value in parsed_data.items():
    print(f"{key}: {value if value != 'Not Found' else 'N/A'}")



Extracted Resume Information:
APPLICANT_NAME: Moses Mu
JOB_ROLE: ['data scientist']
PHONE: 2547186952
EMAIL: mugambimoses2@gmail.com
COMPANIES_WORKED_FOR: ['Technical University of Kenya']
YEARS_EXPERIENCE: N/A
SKILLS: ['excel', 'tri', 'data analysis', 'science', 'out', 'sql', 'mac', 'pro', 'machine learning', 'python', 'work', 'act', 'point', 'tech', 'web', 'certificate', 'analysis', 'school', 'project', 'con', 'data visualization', 'grade', 'engineering', 'cal', 'based', 'professional', 'end', 'base', 'data', 'data s', 'pre', 'you', 'mary', 'adapt', 'control', 'high school', 'visual', 'time', 'view', 'chi', 'roll', 'visualization', 'world', 'one', 'per', 'ana', 'drive', 'mma', 'lea', 'experience', 'series analysis', 'google', 'line', 'mail', 'prof', 'scientific', 'series', 'table', 'team', 'engineer', 'model', 'ken', 'rev', 'rom', 'fun', 'apps', 'learning', 'data science', 'quality', 'review', 'las', 'ada', 'act']
REFEREES: N/A
LINKEDIN: None
CERTIFICATIONS: N/A
EDUCATION: Engineeri

# Save the Model for Deployment

Save the BERT pipeline in joblib and pickle for future use.

In [None]:
import joblib
import pickle

# Save BERT NER pipeline using joblib
joblib.dump(bert_ner, "bert_resume_parser.joblib")

# Save using pickle
with open("bert_resume_parser.pkl", "wb") as f:
    pickle.dump(bert_ner, f)


# Load and Use the Saved Model

Reload the model when deploying:

In [None]:
# Load model from joblib
bert_ner = joblib.load("bert_resume_parser.joblib")

# Load model from pickle
with open("bert_resume_parser.pkl", "rb") as f:
    bert_ner = pickle.load(f)

# Test with a new resume
new_parsed_data = parse_resume("/content/Moses Mugambi Data Analyst CV.pdf")
# Print each extracted entity on a new line
print("\nExtracted Resume Information:")
for key, value in new_parsed_data.items():
    print(f"{key}: {value if value != 'Not Found' else 'N/A'}")



Extracted Resume Information:
APPLICANT_NAME: Moses Mu
JOB_ROLE: ['data scientist']
PHONE: 2547186952
EMAIL: mugambimoses2@gmail.com
COMPANIES_WORKED_FOR: ['Technical University of Kenya']
YEARS_EXPERIENCE: N/A
SKILLS: ['excel', 'tri', 'data analysis', 'science', 'out', 'sql', 'mac', 'pro', 'machine learning', 'python', 'work', 'act', 'point', 'tech', 'web', 'certificate', 'analysis', 'school', 'project', 'con', 'data visualization', 'grade', 'engineering', 'cal', 'based', 'professional', 'end', 'base', 'data', 'data s', 'pre', 'you', 'mary', 'adapt', 'control', 'high school', 'visual', 'time', 'view', 'chi', 'roll', 'visualization', 'world', 'one', 'per', 'ana', 'drive', 'mma', 'lea', 'experience', 'series analysis', 'google', 'line', 'mail', 'prof', 'scientific', 'series', 'table', 'team', 'engineer', 'model', 'ken', 'rev', 'rom', 'fun', 'apps', 'learning', 'data science', 'quality', 'review', 'las', 'ada', 'act']
REFEREES: N/A
LINKEDIN: None
CERTIFICATIONS: N/A
EDUCATION: Engineeri

# Implement CV Rating Based on Job Description
**Steps:**

Extract key requirements from the job description:

Skills (Python, SQL, Machine Learning, etc.)
Experience Level (years of experience)
Education Requirements
Certifications (AWS, PMP, etc.)
Compare Resume vs. Job Description

Match extracted skills, experience, and education
Assign weights to each category
Score the CV based on how well it matches the job

Extract Key Information from Job Description
First, create a function to extract keywords from a job description using NER, Regex, and NLP.

In [None]:
import re
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")  # Load Spacy NLP model

def extract_job_requirements(job_text):
    """
    Extracts key requirements (skills, experience, education) from job description.
    """
    doc = nlp(job_text)

    # Extract education (Bachelor, Master, PhD)
    education_levels = ["bachelor", "master", "phd", "degree", "diploma"]
    extracted_education = [token.text for token in doc if token.text.lower() in education_levels]

    # Extract required skills using simple regex (you can enhance this with a skill dataset)
    skill_pattern = r"\b[A-Za-z+#.]+\b"
    extracted_skills = re.findall(skill_pattern, job_text)

    # Extract years of experience
    experience = re.findall(r"(\d+)\s*(?:\+?\s*years?|yrs|years of experience)", job_text)

    job_requirements = {
        "EDUCATION_REQUIRED": list(set(extracted_education)),
        "SKILLS_REQUIRED": list(set(extracted_skills)),
        "EXPERIENCE_REQUIRED": max(map(int, experience), default=0)
    }

    return job_requirements




Score the Resume Against the Job Description
Now, compare the extracted resume details vs. job description requirements.

In [None]:
def rate_cv(resume_data, job_requirements):
    """
    Scores a resume based on how well it matches the job description.
    """
    score = 0
    total_weight = 0

    # ✅ Match Skills
    resume_skills = set(resume_data.get("SKILLS", []))
    job_skills = set(job_requirements.get("SKILLS_REQUIRED", []))

    skill_match = len(resume_skills & job_skills) / max(1, len(job_skills))  # % match
    score += skill_match * 40  # Skills have 40% weight
    total_weight += 40

    # ✅ Match Education Level
    resume_edu = set(resume_data.get("EDUCATION", []))
    job_edu = set(job_requirements.get("EDUCATION_REQUIRED", []))

    education_match = 1 if resume_edu & job_edu else 0  # Full match if any degree matches
    score += education_match * 20  # Education has 20% weight
    total_weight += 20

    # ✅ Match Experience
    resume_exp = resume_data.get("YEARS_EXPERIENCE", 0)
    job_exp = job_requirements.get("EXPERIENCE_REQUIRED", 0)

    experience_match = min(resume_exp / max(1, job_exp), 1)  # Cap at 100%
    score += experience_match * 30  # Experience has 30% weight
    total_weight += 30

    # ✅ Bonus for Certifications (if applicable)
    cert_bonus = 10  # Bonus for having extra certifications
    score += cert_bonus
    total_weight += 10

    # Normalize Score
    final_score = (score / total_weight) * 100  # Convert to percentage

    return round(final_score, 2)


Modify parse_resume to Include CV Rating
Now, integrate job analysis + CV rating into your main function:

In [None]:
def parse_resume_and_rate(pdf_path, job_description):
    """
    Parses a resume, extracts details, and rates it against a job description.
    """
    resume_text = extract_text_from_pdf(pdf_path)
    resume_data = parse_resume(resume_text)  # Extract resume details

    # Extract Job Requirements
    job_requirements = extract_job_requirements(job_description)

    # Score Resume
    cv_score = rate_cv(resume_data, job_requirements)

    # Add score to extracted resume data
    resume_data["CV_SCORE"] = cv_score
    return resume_data


In [None]:
import pdfplumber

sample_resume_path = "/content/Moses Mugambi Data Analyst CV.pdf"

try:
    with pdfplumber.open(sample_resume_path) as pdf:
        print("✅ PDF file opened successfully!")
except FileNotFoundError:
    print("❌ FileNotFoundError: The file path is incorrect.")
except Exception as e:
    print(f"❌ Another error occurred: {e}")



✅ PDF file opened successfully!


In [None]:
# Sample Job Description
job_description = """
We are looking for a Data Scientist with 5+ years of experience in Machine Learning, Python, and SQL.
The ideal candidate should have a Master's degree in Computer Science or a related field.
Familiarity with cloud platforms (AWS, GCP) is a plus.
"""

sample_resume_path = "/content/Moses Mugambi Data Analyst CV.pdf"  # Provide the actual file path
print(type(sample_resume_path))  # Should be <class 'str'>

def debug_parse_resume(path):
    print(f"📂 Trying to open file at: {repr(path)}")
    with pdfplumber.open(path) as pdf:
        return "✅ Opened successfully!"

debug_parse_resume(sample_resume_path)



<class 'str'>
📂 Trying to open file at: '/content/Moses Mugambi Data Analyst CV.pdf'


'✅ Opened successfully!'

In [None]:
import pdfplumber
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def extract_resume_text(pdf_path):
    """Extracts text from the resume PDF."""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])
        return text if text else "No text found."
    except Exception as e:
        return f"Error reading file: {e}"

def summarize_resume(resume_text):
    """Extracts key sections: skills, experience, education from resume text."""

    summary = {}

    # Extract Experience (Look for 'Experience', 'Work', 'Projects')
    experience_match = re.search(r"(experience|work history|projects):?\s*(.+)", resume_text, re.IGNORECASE)
    summary["Experience"] = experience_match.group(2) if experience_match else "Not Found"

    # Extract Skills (Look for 'Skills', 'Technical Skills', etc.)
    skills_match = re.search(r"(skills|technical skills):?\s*(.+)", resume_text, re.IGNORECASE)
    summary["Skills"] = skills_match.group(2) if skills_match else "Not Found"

    # Extract Education
    education_match = re.search(r"(education|academic background):?\s*(.+)", resume_text, re.IGNORECASE)
    summary["Education"] = education_match.group(2) if education_match else "Not Found"

    # Create final description
    resume_description = (
        f"Experience: {summary['Experience']}\n"
        f"Skills: {summary['Skills']}\n"
        f"Education: {summary['Education']}\n"
    )

    return resume_description

def compare_resume_with_job(resume_text, job_description):
    """Compares resume description with job description using TF-IDF similarity."""

    documents = [resume_text, job_description]
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)

    # Compute cosine similarity
    similarity_score = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]

    return round(similarity_score * 100, 2)  # Convert to percentage

# === 🚀 Test the Code on a Sample Resume & Job Description ===
resume_pdf_path = "/content/Moses Mugambi Data Analyst CV.pdf"  # 🔹 Replace with actual file path
job_description = """
We are looking for a Data Scientist with expertise in Python, SQL, and machine learning.
The ideal candidate should have experience in data visualization, NLP, and time series analysis.
"""

# 1️⃣ Extract Resume Text
resume_text = extract_resume_text(resume_pdf_path)
print("\n📄 Extracted Resume Text:\n", resume_text[:500], "...")  # Print first 500 chars

# 2️⃣ Summarize Resume
resume_summary = summarize_resume(resume_text)
print("\n📌 Resume Summary:\n", resume_summary)

# 3️⃣ Compare Resume with Job Description
match_score = compare_resume_with_job(resume_summary, job_description)
print("\n✅ CV Match Score:", match_score, "%")



📄 Extracted Resume Text:
 Moses Mugambi
Data Scientist
Email: mugambimoses2@gmail.com | Phone: +254718695260
LinkedIn: linkedin.com/in/moses-mugambi-njeru | GitHub: github.com/mk7890
Professional Summary
Motivated Data Scientist with expertise in Python, data analysis, visualizations, and machine
learning. Proficient in SQL, Tableau, Excel, and web scraping. Experienced in predictive
modelling, clustering, and NLP. Passionate about deriving insights from data to solve real-world
problems.
Work Experience
• Resume Parser. ...

📌 Resume Summary:
 Experience: d in predictive
Skills: • Hard Skills: Python (Pandas, NumPy, Scikit-Learn), SQL, Tableau, Machine Learning,
Education: Data Science Full - Time


✅ CV Match Score: 22.89 %


# Saving the model

Saving the Model using Pickle (.pkl)

In [None]:
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample job description for training the model
job_description = """
We are looking for a Data Scientist with expertise in Python, SQL, and machine learning.
The ideal candidate should have experience in data visualization, NLP, and time series analysis.
"""

# Train the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit([job_description])  # Fit on sample job description

# Save the vectorizer model using Pickle
with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

print("✅ Model saved as tfidf_vectorizer.pkl")

✅ Model saved as tfidf_vectorizer.pkl


Load and using the model

In [None]:
# Load the trained model
with open("tfidf_vectorizer.pkl", "rb") as f:
    loaded_vectorizer = pickle.load(f)

# Test with new resume text
resume_text = """
Data Scientist skilled in Python, SQL, and NLP. Experienced in machine learning and time series analysis.
"""
job_description_vector = loaded_vectorizer.transform([job_description])
resume_vector = loaded_vectorizer.transform([resume_text])

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_score = cosine_similarity(resume_vector, job_description_vector)[0][0]

print("✅ CV Match Score:", round(similarity_score * 100, 2), "%")


✅ CV Match Score: 76.8 %


JOBLIB

In [None]:
import joblib

# Save the model with Joblib
joblib.dump(vectorizer, "tfidf_vectorizer.joblib")

print("✅ Model saved as tfidf_vectorizer.joblib")


✅ Model saved as tfidf_vectorizer.joblib


## Saving a Hugging Face Transformers Model
For BERT-based NER, your model likely consists of:

A Transformer model (AutoModelForTokenClassification or pipeline("ner")).
A Tokenizer (AutoTokenizer).
Any additional preprocessing logic.

1. Save using joblib (Recommended)
joblib is better for saving large models because it handles NumPy arrays efficiently.

In [None]:
import joblib
from transformers import pipeline

# Initialize the Hugging Face pipeline (if not already done)
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Save the model using joblib
joblib.dump(ner_pipeline, "bert_ner_pipeline.joblib")

print("✅ BERT NER model saved successfully using joblib!")


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


✅ BERT NER model saved successfully using joblib!


2. Save using pickle (Alternative)
pickle works but is slower and less efficient for large models.

In [None]:
import pickle

# Save the pipeline with pickle
with open("bert_ner_pipeline.pkl", "wb") as f:
    pickle.dump(ner_pipeline, f)

print("✅ BERT NER model saved successfully using pickle!")


✅ BERT NER model saved successfully using pickle!


✅ Loading the Model

To load the model and use it for inference:

In [None]:
# Load from joblib
ner_pipeline = joblib.load("bert_ner_pipeline.joblib")

# Test on a sample text
text = "John Doe is a data scientist at Google, working on NLP."
result = ner_pipeline(text)

# Print each entity on a separate line
print("NER Output:")
for entity in result:
    print(f"Entity: {entity['word']}, Label: {entity.get('entity_group', entity.get('entity'))}, Score: {entity['score']:.4f}")


NER Output:
Entity: John, Label: I-PER, Score: 0.9996
Entity: Do, Label: I-PER, Score: 0.9993
Entity: ##e, Label: I-PER, Score: 0.9965
Entity: Google, Label: I-ORG, Score: 0.9990
Entity: NL, Label: I-MISC, Score: 0.6309


Save Model to Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import shutil

# Source path (where the model is saved in Colab)
source_path = "/content/bert_ner_pipeline.joblib"

# Destination path (inside Google Drive)
destination_path = "/content/drive/MyDrive/bert_ner_pipeline_copy.joblib"

# Copy the file instead of moving
shutil.copy(source_path, destination_path)

print(f"Model copied to {destination_path}")


Model copied to /content/drive/MyDrive/bert_ner_pipeline_copy.joblib


In [None]:
# Source path (where the model is saved in Colab)
source_path = "/content/bert_resume_parser.pkl"

# Destination path (inside Google Drive)
destination_path = "/content/drive/MyDrive/bert_ner_pipeline_copy.pkl"

# Copy the file instead of moving
shutil.copy(source_path, destination_path)

print(f"Model copied to {destination_path}")


Model copied to /content/drive/MyDrive/bert_ner_pipeline_copy.pkl


deploy the BERT-based NER model as a web app using Streamlit. The app will:

✅ Allow users to upload a resume (PDF file)
✅ Take a job description as input
✅ Extract entities from the resume (NER output)
✅ Compare it with the job description & provide a CV score

📌 Deployment Approach
Framework: Streamlit (Fast and simple UI)
Backend: Uses Hugging Face Transformers for NER
Storage: No need for a database, process files in memory
Deployment Options: Streamlit Cloud, Hugging Face Spaces, or Render

🚀 Steps to Deploy
- Prepare the Model
- Load the BERT-based NER model (joblib or pickle)
- Use pdfplumber to extract text from the PDF resume
- Build the Streamlit App
- Upload PDF
- Enter Job Description
- Extract Named Entities from the resume
- Compute a CV Score based on entity matching
- Deploy on Streamlit Cloud or Hugging Face Spaces

# 📌 Streamlit App Code (deployable)

In [None]:
import streamlit as st
import joblib
import pdfplumber
from transformers import pipeline
import spacy
import re

# Load the NER model (pre-trained BERT)
model = joblib.load("/content/bert_ner_pipeline.joblib")  # Change path if needed

# Function to extract text from PDF
def extract_text_from_pdf(pdf_file):
    text = ""
    with pdfplumber.open(pdf_file) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text.strip()

# Function to extract entities
def extract_named_entities(text):
    result = model(text)
    return result  # List of entities

# Function to score resume based on job description match
def compute_cv_score(entities, job_desc):
    nlp = spacy.load("en_core_web_sm")
    job_tokens = set([token.lemma_ for token in nlp(job_desc.lower()) if not token.is_stop])
    entity_words = set([re.sub(r"[^a-zA-Z0-9]", "", ent["word"].lower()) for ent in entities])

    common_words = job_tokens.intersection(entity_words)
    score = len(common_words) / len(job_tokens) * 100 if job_tokens else 0
    return round(score, 2)

# Streamlit UI
st.title("📄 AI-Powered Resume Parser & CV Scorer")
st.write("Upload your resume and enter the job description to analyze your fit.")

# Upload Resume
uploaded_file = st.file_uploader("Upload Resume (PDF)", type=["pdf"])

# Enter Job Description
job_desc = st.text_area("Enter Job Description")

# Process when button is clicked
if st.button("Analyze Resume"):
    if uploaded_file and job_desc:
        resume_text = extract_text_from_pdf(uploaded_file)
        st.subheader("Extracted Resume Text")
        st.write(resume_text[:1000] + "...")  # Show only first 1000 characters

        # Extract Named Entities
        entities = extract_named_entities(resume_text)
        st.subheader("Extracted Named Entities")
        for entity in entities:
            st.write(f"**Entity:** {entity['word']} | **Label:** {entity['entity']} | **Score:** {entity['score']:.4f}")

        # Compute CV Score
        cv_score = compute_cv_score(entities, job_desc)
        st.subheader(f"🔍 CV Match Score: {cv_score}%")
        st.progress(cv_score / 100)

    else:
        st.warning("Please upload a resume and enter a job description!")


How to Deploy

1️⃣ Run Locally

Save the script as app.py

Install dependencies:

In [None]:
!pip install streamlit transformers pdfplumber spacy joblib
!python -m spacy download en_core_web_sm
!pip install streamlit

bash

In [None]:
streamlit run app.py
