<a href="https://colab.research.google.com/github/mk7890/Resume-Parsing-System/blob/main/ResumeParsingSystem_DataCleaning2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview
Resume Parsing System

The system will include text extraction, entity recognition, data cleaning, synthetic data generation, classification, CV rating, and chatbot-based CV improvement.

📌 Project Breakdown & Approach

1. Load and Process Resumes (PDF Dataset)

✅ Load resume PDFs from a dataset

✅ Extract text using pdfplumber, pdfminer, PyMuPDF (fitz), etc

✅ Save extracted text to a CSV file

2. Check and Fill Missing Values

✅ Identify missing values in extracted fields:
Applicant Name, Job Role, Email, Phone Number, Companies Worked, Work Experience, Skills, Education, Certifications, Referees

✅ Generate synthetic values for missing fields using:
Faker (for realistic names, emails, phone numbers)
Randomized industry-relevant values for missing experience, companies, skills

✅ Save the cleaned data as a new dataset


3. Named Entity Recognition (NER) Model for Feature Extraction

✅ Build an NLP-based Resume Parsing Model using:
spaCy (NER for extracting applicant details)
BERT / Flair for advanced entity recognition

✅ Extract all key details (Name, Job Title, Skills, Experience, Education, etc.)


4. Resume Rating (Matching with Job Description)

✅ Compare extracted resume skills & experience with job descriptions

✅ Use TF-IDF + Cosine Similarity to compute a matching score

✅ Highlight missing keywords and suggest improvements


5. Deployment & User Interface

✅ Create a Streamlit web app to:

Upload resume PDF
Extract structured resume data
Rate the resume based on job description
Suggest improvements via chatbot


# Load Libraries

Install Dependancies

In [None]:
!pip install pdfplumber pandas faker tqdm
!pip install pymupdf pdfplumber
!pip install sentence_transformers
!pip install fpdf
!pip install sentence_transformers
!pip install faker
!pip install pymupdf pdfplumber
!pip install pdfminer.six spacy faker joblib pandas numpy scikit-learn torch transformers streamlit
!python -m spacy download en_core_web_sm

Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting faker
  Downloading Faker-36.1.0-py3-none-any.whl.metadata (15 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20

# Load Dataset
Load and Extract Text from PDF Resumes. I used pdfplumber for accurate extraction.

Load Large dataset saved on google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install pdfplumber pymupdf

Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymupdf
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m4.2 MB/s[0m eta

In [None]:
!pip install fpdf

Collecting fpdf
  Downloading fpdf-1.7.2.tar.gz (39 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: fpdf
  Building wheel for fpdf (setup.py) ... [?25l[?25hdone
  Created wheel for fpdf: filename=fpdf-1.7.2-py2.py3-none-any.whl size=40704 sha256=ec046440f3668c7e27f9da054b4247f13fe80dcffe752c87d1a554415bc51eb6
  Stored in directory: /root/.cache/pip/wheels/65/4f/66/bbda9866da446a72e206d6484cd97381cbc7859a7068541c36
Successfully built fpdf
Installing collected packages: fpdf
Successfully installed fpdf-1.7.2


In [None]:
import gdown

# Replace 'FILE_ID' with the actual file ID from your link
file_id2 = "1vAaazotBSAS6tAZ17l6d1x7rCjzRp5te"
output_filename2 = "resume_dataset_with_features.csv"  # Change to desired output filename

# Construct the download URL
url = f"https://drive.google.com/uc?id={file_id2}"

# Download the file
gdown.download(url, output_filename2, quiet=False)

print(f"File downloaded as: {output_filename2}")

Downloading...
From (original): https://drive.google.com/uc?id=1vAaazotBSAS6tAZ17l6d1x7rCjzRp5te
From (redirected): https://drive.google.com/uc?id=1vAaazotBSAS6tAZ17l6d1x7rCjzRp5te&confirm=t&uuid=321aadf6-6939-481a-9219-c922de3da83e
To: /content/resume_dataset_with_features.csv
100%|██████████| 638M/638M [00:11<00:00, 56.2MB/s]

File downloaded as: resume_dataset_with_features.csv





In [None]:
import os
import pdfplumber
import pandas as pd
import re
import spacy
from tqdm import tqdm
from fpdf import FPDF

In [None]:
# Load NLP model for better name extraction
nlp = spacy.load("en_core_web_sm")

folder_path = "/content/drive/MyDrive/resume_datasets_archive/data"
clean_folder_path = "/content/drive/MyDrive/resume_datasets_archive/clean_resumes"
os.makedirs(clean_folder_path, exist_ok=True)

Extract text from all pdf files in google drive folder

In [None]:
import os
import fitz  # PyMuPDF
import pdfplumber
import pandas as pd
from tqdm import tqdm

# Define folder path
folder_path = "/content/drive/MyDrive/resume_datasets_archive/data"
output_csv = "/content/raw_resume_data.csv"

# Function to extract text using a hybrid approach
def extract_text_from_pdf(pdf_path):
    text = ""

    # Try extracting with PyMuPDF (fitz)
    try:
        with fitz.open(pdf_path) as doc:
            text = "\n".join([page.get_text("text") for page in doc])
    except Exception as e:
        print(f"PyMuPDF failed for {pdf_path}: {e}")

    # Fallback to pdfplumber if PyMuPDF fails
    if not text.strip():
        try:
            with pdfplumber.open(pdf_path) as pdf:
                text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])
        except Exception as e:
            print(f"pdfplumber failed for {pdf_path}: {e}")

    return text.strip() if text else None

# List all PDF files in the folder
pdf_files = [f for f in os.listdir(folder_path) if f.endswith(".pdf")]

# Initialize list to store extracted data
resume_data = []

# Extract text from each PDF file with progress bar
for pdf_file in tqdm(pdf_files, desc="Extracting Text", unit="file"):
    pdf_path = os.path.join(folder_path, pdf_file)
    extracted_text = extract_text_from_pdf(pdf_path)
    if extracted_text:  # Ensure only non-empty text is stored
        resume_data.append({"filename": pdf_file, "text": extracted_text})

# Convert to DataFrame and save to CSV
df = pd.DataFrame(resume_data)
df.to_csv(output_csv, index=False, encoding="utf-8")

print(f"Extraction complete! Data saved to {output_csv}")


Extracting Text: 100%|██████████| 2484/2484 [01:38<00:00, 25.17file/s]


Extraction complete! Data saved to /content/raw_resume_data.csv


Previewing the Raw Resume Data

In [None]:
raw_resume = pd.read_csv("/content/raw_resume_data.csv")
raw_resume.head(10)

Unnamed: 0,filename,text
0,FINANCE (21).pdf,FINANCE AND OPERATIONS MANAGER\nExperience\nFi...
1,ENGINEERING (61).pdf,ENGINEERING TECHNICIAN\nSummary\nTo obtain a p...
2,FITNESS (24).pdf,"INTERN\nSummary\nMotivated, responsible Person..."
3,FINANCE (57).pdf,OPERATIONS FINANCE DIRECTOR\nSummary\nSkilled ...
4,DIGITAL_MEDIA (1).pdf,MEDIA ACTIVITIES SPECIALIST\nSummary\nMulti-Ta...
5,CONSULTANT (114).pdf,CONSULTANT\nSummary\nResults-oriented Californ...
6,DIGITAL_MEDIA (43).pdf,MEDIA SUPPORT SPECIALIST\nProfessional Summary...
7,FINANCE (59).pdf,FINANCE MANAGER FINANCE MANAGER\nExecutive Pro...
8,FITNESS (30).pdf,REHABILITATION SPECIALIST / MASSAGE THERAPIST\...
9,FINANCE (108).pdf,FINANCE MANAGER\nSummary\npreparing annual bud...


No charts were generated by quickchart


# Preprocessing

✅ Extract important details like Name, Email, Phone, Skills, Job Title, Work Experience, Education, Companies Worked For, Certifications, Referees.

✅ Use regular expressions (regex) and NLP (spaCy) for extraction.

✅ Fill missing values using Faker (a Python library for generating
fake data).

Key Differences:
Model	Size	Vocabulary Size	Word Vectors	Accuracy (NER, POS)	Best Use Case

- en_core_web_sm	~12MB	~50k	No word vectors	Lower accuracy	Light-weight tasks, fast inference
- en_core_web_md	~43MB	~685k	300-dimensional word vectors	Medium accuracy	General NLP tasks with semantic similarity
- en_core_web_lg	~741MB	~1M	300-dimensional word vectors	Highest accuracy	Best for NER, dependency parsing, semantic tasks

Explanation:
- en_core_web_sm: Small model with no word vectors, only context-sensitive embeddings. Faster but less accurate.
- en_core_web_md: Medium-sized model with word vectors, providing better semantic similarity.
- en_core_web_lg: Large model with a bigger vocabulary and word vectors, best for deep NLP tasks

In [None]:
!pip install spacy faker
#!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_md

Collecting faker
  Downloading Faker-36.1.1-py3-none-any.whl.metadata (15 kB)
Downloading Faker-36.1.1-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-36.1.1
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")

More Feature extraction

In [None]:
import pandas as pd
import spacy
import re
from tqdm import tqdm  # Import tqdm for progress bar

# Load spaCy NLP model (use "en_core_web_md" or "en_core_web_lg" for better entity recognition)
nlp = spacy.load("en_core_web_lg")

# Load raw resume data CSV
input_csv = "/content/raw_resume_data.csv"
output_csv = "/content/parsed_resume_data2.csv"

df = pd.read_csv(input_csv)

# Define regex patterns
email_pattern = r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
phone_pattern = r"\b(\+?\d{1,3}[-.\s]?)?(\(?\d{2,4}\)?[-.\s]?)?\d{3,4}[-.\s]?\d{4}\b"
experience_pattern = r"(\d+)\s*(?:years?|yrs?)\s*(?:of)?\s*(?:experience|exp)"
linkedin_pattern = r"(https?://www\.linkedin\.com/in/[a-zA-Z0-9-_/]+)"

# Keywords for education and institutions
degree_keywords = ["bachelor", "master", "phd", "associate", "doctorate", "diploma", "certificate", "degree"]
institution_keywords = ["university", "college", "institute", "academy", "school"]

# Function to extract skills dynamically
def extract_skills(text):
    doc = nlp(text)
    skills = set()

    # Extract noun phrases that could be skills
    for chunk in doc.noun_chunks:
        if any(token.pos_ in ["NOUN", "PROPN"] for token in chunk):
            skills.add(chunk.text)

    # Look for patterns like "Proficient in Python", "Experience with SQL"
    patterns = [r"Proficient in (\w+)", r"Experience with (\w+)", r"Skilled in (\w+)"]
    for pattern in patterns:
        matches = re.findall(pattern, text, re.IGNORECASE)
        skills.update(matches)

    return ", ".join(skills) if skills else None

# Function to extract certifications dynamically
def extract_certifications(text):
    doc = nlp(text)
    certifications = set()

    for chunk in doc.noun_chunks:
        if any(keyword in chunk.text.lower() for keyword in ["certified", "certificate", "certification"]):
            certifications.add(chunk.text)

    return ", ".join(certifications) if certifications else None

# Function to extract education background
def extract_education(text):
    doc = nlp(text)
    degrees = set()
    institutions = set()

    for ent in doc.ents:
        if ent.label_ == "ORG" and any(keyword in ent.text.lower() for keyword in institution_keywords):
            institutions.add(ent.text)

    for token in doc:
        if any(keyword in token.text.lower() for keyword in degree_keywords):
            degrees.add(token.text)

    return ", ".join(degrees) if degrees else None, ", ".join(institutions) if institutions else None

# Function to extract physical address
def extract_address(text):
    doc = nlp(text)
    address = []

    for ent in doc.ents:
        if ent.label_ in ["GPE", "LOC", "FAC"]:  # Geographical entities
            address.append(ent.text)

    return ", ".join(address) if address else None

# Function to extract LinkedIn profile
def extract_linkedin(text):
    match = re.search(linkedin_pattern, text)
    return match.group() if match else None

# Function to extract structured resume data
def extract_resume_details(text):
    doc = nlp(text)

    # Extract name (first detected Proper Noun)
    name = next((ent.text for ent in doc.ents if ent.label_ == "PERSON"), None)

    # Extract job title (NER-based)
    job_title = next((ent.text for ent in doc.ents if ent.label_ in ["ORG", "WORK_OF_ART"]), None)

    # Extract email
    email = re.search(email_pattern, text)
    email = email.group() if email else None

    # Extract phone number
    phone = re.search(phone_pattern, text)
    phone = phone.group() if phone else None

    # Extract years of experience
    experience = re.search(experience_pattern, text)
    experience = experience.group(1) if experience else None

    # Extract companies worked for
    companies = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    companies = ", ".join(companies) if companies else None

    # Extract dynamic skills
    skills = extract_skills(text)

    # Extract dynamic certifications
    certifications = extract_certifications(text)

    # Extract education background and institutions
    degrees, institutions = extract_education(text)

    # Extract address
    address = extract_address(text)

    # Extract LinkedIn profile
    linkedin = extract_linkedin(text)

    # Extract referees (based on keywords like "Reference" or "Referee")
    referees = "Yes" if "reference" in text.lower() or "referee" in text.lower() else "No"

    return [name, job_title, email, phone, experience, companies, skills, certifications, degrees, institutions, address, linkedin, referees]

# Apply extraction function to dataset with progress bar
df_extracted = []
for text in tqdm(df["text"], desc="Processing Resumes", unit="resume"):
    df_extracted.append(extract_resume_details(text))

# Create structured DataFrame
df_parsed = pd.DataFrame(df_extracted, columns=[
    "Applicant Name", "Job Title", "Email", "Phone", "Years of Experience",
    "Companies Worked For", "Skills", "Certifications", "Education Background",
    "Institutions Attended", "Physical Address", "LinkedIn Profile", "Referees"
])

# Save to CSV
df_parsed.to_csv(output_csv, index=False)

print(f"Extraction complete! Processed resume data saved to {output_csv}")


Processing Resumes: 100%|██████████| 2483/2483 [41:31<00:00,  1.00s/resume]


Extraction complete! Processed resume data saved to /content/parsed_resume_data2.csv


Using pyresparser pretrained model

In [None]:
!pip install pyresparser nltk pandas tqdm


Collecting pyresparser
  Downloading pyresparser-1.0.6-py3-none-any.whl.metadata (7.4 kB)
Collecting docx2txt>=0.7 (from pyresparser)
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six>=20181108 (from pyresparser)
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Collecting pycryptodome>=3.8.2 (from pyresparser)
  Downloading pycryptodome-3.21.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pyrsistent>=0.15.2 (from pyresparser)
  Downloading pyrsistent-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting sortedcontainers>=2.1.0 (from pyresparser)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Downloading pyresparser-1.0.6-py3-none-any.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.s

In [None]:
!python -m nltk.downloader punkt
#!python -m spacy download en_core_web_sm


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
!pip install transformers torch pandas tqdm spacy pdfplumber
#!python -m spacy download en_core_web_sm


Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Resume Parsing with Hugging Face Transformers

In [None]:
import pandas as pd
import re
from tqdm import tqdm
import pdfplumber
from transformers import pipeline

# Enable tqdm for pandas
tqdm.pandas()

# Load Hugging Face NER model (General-Purpose)
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)

# Regex patterns
linkedin_pattern = r"https?://(www\.)?linkedin\.com/in/[a-zA-Z0-9-_]+"
email_pattern = r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
phone_pattern = r"\b(\+?\d{1,3}[-.\s]?)?(\(?\d{2,4}\)?[-.\s]?)?\d{3,4}[-.\s]?\d{3,4}\b"

# Keywords for different categories
education_keywords = ["Bachelor", "Master", "PhD", "BSc", "MSc", "Diploma", "Certificate", "Degree"]
certification_keywords = ["Certified", "Certification", "Credential", "Course", "Training"]
job_role_keywords = ["Engineer", "Manager", "Consultant", "Analyst", "Developer", "Scientist", "Specialist", "Technician"]
experience_keywords = ["years", "experience"]

# Load resume dataset
input_csv = "/content/raw_resume_data.csv"
output_csv = "/content/parsed_resume_data_lg_transformers.csv"
df = pd.read_csv(input_csv)

# Function to extract named entities from text
def extract_entities(text):
    try:
        entities = ner_pipeline(text)

        # Extract structured fields
        name = next((ent["word"] for ent in entities if ent["entity_group"] == "PER"), None)
        organizations = [ent["word"] for ent in entities if ent["entity_group"] == "ORG"]
        skills = [ent["word"] for ent in entities if ent["entity_group"] == "MISC"]

        # Extract email
        email = re.search(email_pattern, text)
        email = email.group() if email else None

        # Extract phone number
        phone = re.search(phone_pattern, text)
        phone = phone.group() if phone else None

        # Extract LinkedIn profile
        linkedin = re.search(linkedin_pattern, text)
        linkedin = linkedin.group() if linkedin else None

        # Extract education background & institutions
        education = [word for word in text.split() if any(kw in word for kw in education_keywords)]
        education = ", ".join(education) if education else None

        institutions = ", ".join(organizations) if organizations else None

        # Extract job role
        job_roles = [word for word in text.split() if any(kw in word for kw in job_role_keywords)]
        job_roles = ", ".join(set(job_roles)) if job_roles else None

        # Extract certifications
        certifications = [word for word in text.split() if any(kw in word for kw in certification_keywords)]
        certifications = ", ".join(set(certifications)) if certifications else None

        # Extract years of experience
        experience_match = re.search(r"(\d+)\s*(years|year) of experience", text, re.IGNORECASE)
        years_of_experience = experience_match.group(1) if experience_match else None

        # Extract referees (basic keyword matching)
        referees = "Yes" if "reference" in text.lower() or "referee" in text.lower() else "No"

        return [
            name, job_roles, phone, email, ", ".join(organizations),
            years_of_experience, ", ".join(skills), referees, linkedin,
            certifications, education, institutions
        ]

    except Exception as e:
        print(f"Error processing resume: {e}")
        return [None] * 12  # Return empty values if an error occurs

# Apply extraction function with a progress bar
df_extracted = df["text"].progress_apply(extract_entities)

# Create structured DataFrame
df_parsed = pd.DataFrame(df_extracted.tolist(), columns=[
    "Applicant Name", "Job Role", "Phone", "Email", "Companies Worked For",
    "Years of Work Experience", "Skills", "Referees", "LinkedIn Profile",
    "Certifications", "Education Background", "Education Institutions"
])

# Save to CSV
df_parsed.to_csv(output_csv, index=False)

print(f"✅ Extraction complete! Processed resume data saved to {output_csv}")


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
100%|██████████| 2483/2483 [02:09<00:00, 19.24it/s]

✅ Extraction complete! Processed resume data saved to /content/parsed_resume_data.csv





2️⃣ Check and Fill Missing Values


✅ Identify missing values in extracted fields: Applicant Name, Job Role, Email, Phone Number, Companies Worked, Work Experience, Skills, Education, Certifications, Referees ✅ Generate synthetic values for missing fields using: Faker (for realistic names, emails, phone numbers) Randomized industry-relevant values for missing experience, companies, skills ✅ Save the cleaned data as a new dataset

In [36]:
parsed_resume = pd.read_csv('/content/parsed_resume_data2.csv')
parsed_resume.head(25)

Unnamed: 0,Applicant Name,Job Title,Email,Phone,Years of Experience,Companies Worked For,Skills,Certifications,Education Background,Institutions Attended,Physical Address,LinkedIn Profile,Referees
0,Johnson,Operations,,,,"Operations, Challenge, Authorize, Scrutinized,...","monthly budget, Financial Performance, Continu...",,"Bachelor, Master",Wales University - City,,,No
1,,Accomplishments\nHome Improvement Projects,,,,"Accomplishments\nHome Improvement Projects, Ou...","water main reconstruction, maintenance, Storm ...","corner certificate drawings, section corner ce...","certificate, certificates","Education\n\nUniversity of Northern Iowa, Hawk...","Trimble 5600 Total Station, Eagle Point, Iowa,...",,No
2,11/2012,First Aid,,,,"First Aid, Intern 03/2013, 10/2013 Company, Of...","recreation workers, Research and study players...",a certified personal training certification\nC...,Diploma,,"CHARLOTTE, NC UNITED STATES",,No
3,,Project,,2008-2013,,"Project, Microsoft Office, Kronos, SOX, DOJ, B...","maintenance, internal and external regulatory ...","ï¼​ City , State , USA\nLanguages\nSpanish- Fl...","Certificate, Associate",Member Business Continuity Institute,"Concur, San Francisco, New York, City, nto, Sa...",,No
4,Michio Kaku,Multi-Tasking Media Relations,,1997-2004,,"Multi-Tasking Media Relations, Strategic Initi...","Skills, Achievement, Los Angeles California, m...",,"Bachelor, Degrees, Master","Howard High School, Tennessee College Public R...","City, 03/1996, North Carolina, City, Los Angel...",,No
5,02/2014,SOX Compliance\nBusiness,,,,"SOX Compliance\nBusiness, Oracle, Oracle FSG, ...","the database, tax returns, the acquisition, cl...",the UC Santa Cruz Certificate,"Certificate, Bachelor, Associate, Master",Bachelor of Science : Business Administration ...,"California, Name City, U.S., U.S., Name City, ...",,No
6,Proof,Skills\nActive Directory,,,,"Skills\nActive Directory, Automotive, DNS, LAN...","network cabling installation, IP, proper opera...",,,Political Science Indiana University of,"DC, the United States, NYC, DC, Philadelphia, ...",,No
7,,Executive Profile\nFinancial,,,,"Executive Profile\nFinancial, Â , Skill Highl...","large, high profile companies, andadvanced com...","Certified Public Accountant, Skill Highlights\...","Associate, associates","City University of New, Keller Graduate School...","Name City, 42nd Street, 42nd Street Facilitate...",,No
8,Kokopelli Triathlon,REHABILITATION SPECIALIST / MASSAGE THERAPIST,,2013-2014,,"REHABILITATION SPECIALIST / MASSAGE THERAPIST,...","massage therapy, NSCA, Exercise, movement asse...",Certified/Licensed Massage Therapist,"Masters, Bachelor","Florida Gulf Coast University ï¼​ City, Nevada...","City, City, City, City, Las Vegas, City, Las V...",,No
9,Abdul Majeed,"Commercial Operation, Accounts & Finance",,,,"Commercial Operation, Accounts & Finance, Audi...","revenues, bank reconciliation, legislative and...",Certifications\nUrdu Level,"degree, Master",Brooklyn Park University Finance Location,"Dammam, Saudi Arabia Responsible, Pennsylvania...",,No


In [37]:
parsed_resume.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483 entries, 0 to 2482
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Applicant Name         1895 non-null   object 
 1   Job Title              2483 non-null   object 
 2   Email                  19 non-null     object 
 3   Phone                  333 non-null    object 
 4   Years of Experience    245 non-null    float64
 5   Companies Worked For   2483 non-null   object 
 6   Skills                 2483 non-null   object 
 7   Certifications         1041 non-null   object 
 8   Education Background   2278 non-null   object 
 9   Institutions Attended  2151 non-null   object 
 10  Physical Address       2351 non-null   object 
 11  LinkedIn Profile       11 non-null     object 
 12  Referees               2483 non-null   object 
dtypes: float64(1), object(12)
memory usage: 252.3+ KB


In [38]:
parsed_resume.isnull().sum()

Unnamed: 0,0
Applicant Name,588
Job Title,0
Email,2464
Phone,2150
Years of Experience,2238
Companies Worked For,0
Skills,0
Certifications,1442
Education Background,205
Institutions Attended,332


In [39]:
parsed_resume.columns

Index(['Applicant Name', 'Job Title', 'Email', 'Phone', 'Years of Experience',
       'Companies Worked For', 'Skills', 'Certifications',
       'Education Background', 'Institutions Attended', 'Physical Address',
       'LinkedIn Profile', 'Referees'],
      dtype='object')

In [None]:
import pandas as pd
import re
import random
from faker import Faker

# Initialize Faker
fake = Faker()

# Input and output CSV paths
input_csv = "/content/parsed_resume_data.csv"
output_csv = "/content/cleaned_parsed_resume_data.csv"

# Load parsed resume data
df = pd.read_csv(input_csv)

# Define regex patterns for validation
email_pattern = r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
phone_pattern = r"\b(\+?\d{1,3}[-.\s]?)?(\(?\d{2,4}\)?[-.\s]?)?\d{3,4}[-.\s]?\d{4}\b"
linkedin_pattern = r"(https?:\/\/)?(www\.)?linkedin\.com\/in\/[a-zA-Z0-9-]+"

# Industry-relevant values for missing fields
random_companies = ["Google", "Microsoft", "Apple", "Amazon", "IBM", "Tesla", "Facebook", "Salesforce", "Uber", "Netflix"]
random_skills = ["Python", "SQL", "Machine Learning", "Data Analysis", "Project Management", "Cloud Computing", "Cybersecurity"]
random_experience = [str(i) for i in range(1, 21)]  # Experience between 1-20 years
random_education = ["B.Sc. Computer Science", "M.Sc. Data Science", "MBA", "Ph.D. AI", "Diploma in IT", "B.Tech Mechanical"]
random_universities = ["MIT", "Harvard", "Stanford", "Cambridge", "Oxford", "UC Berkeley", "Carnegie Mellon"]

# Function to fill missing values
def fill_missing_values(row):
    """Fill missing fields with realistic synthetic data."""
    row["Applicant Name"] = row["Applicant Name"] if pd.notna(row["Applicant Name"]) else fake.name()
    row["Email"] = row["Email"] if pd.notna(row["Email"]) and re.match(email_pattern, str(row["Email"])) else fake.email()
    row["Phone"] = row["Phone"] if pd.notna(row["Phone"]) and re.match(phone_pattern, str(row["Phone"])) else fake.phone_number()
    row["Years of Work Experience"] = row["Years of Work Experience"] if pd.notna(row["Years of Work Experience"]) else random.choice(random_experience)
    row["Companies Worked For"] = row["Companies Worked For"] if pd.notna(row["Companies Worked For"]) else random.choice(random_companies)
    row["Skills"] = row["Skills"] if pd.notna(row["Skills"]) else random.choice(random_skills)
    row["Certifications"] = row["Certifications"] if pd.notna(row["Certifications"]) else "Certified " + random.choice(random_skills)
    row["Education Background"] = row["Education Background"] if pd.notna(row["Education Background"]) else random.choice(random_education)
    row["Education Institutions"] = row["Education Institutions"] if pd.notna(row["Education Institutions"]) else random.choice(random_universities)
    row["LinkedIn Profile"] = row["LinkedIn Profile"] if pd.notna(row["LinkedIn Profile"]) and re.match(linkedin_pattern, str(row["LinkedIn Profile"])) else f"https://linkedin.com/in/{fake.user_name()}"
    row["Referees"] = row["Referees"] if pd.notna(row["Referees"]) else fake.name()

    return row

# Apply missing value filling
df_cleaned = df.apply(fill_missing_values, axis=1)

# Save cleaned data
df_cleaned.to_csv(output_csv, index=False)

print(f"Missing values filled! Cleaned resume data saved to {output_csv}")


Missing values filled! Cleaned resume data saved to /content/cleaned_parsed_resume_data.csv


Generating a new clean Resume pdf dataset

In [40]:
resume_df = pd.read_csv('/content/cleaned_parsed_resume_data.csv')
resume_df.head()

Unnamed: 0,Applicant Name,Job Role,Phone,Email,Companies Worked For,Years of Work Experience,Skills,Referees,LinkedIn Profile,Certifications,Education Background,Education Institutions
0,Darryl Rodriguez,"Manager, Analyst",+1-810-234-8247x16972,thompsonrichard@example.com,"State B, ##ANCI",12.0,"Excel, R",No,https://linkedin.com/in/ryan73,Certifications,"Master, Bachelor's","State B, ##ANCI"
1,Allison Joseph,"Technician, Engineering",476.373.3629x1110,cpacheco@example.com,"##CI, Current Engineering, HMA, ##P, ArcV, TOP...",3.0,"Improvement, Water Utility, Microsoft Access, ...",No,https://linkedin.com/in/murphybrian,Certified Data Analysis,M.Sc. Data Science,"##CI, Current Engineering, HMA, ##P, ArcV, TOP..."
2,Christopher Williams,Specialist,294.859.3140,diana99@example.org,Amazon,9.0,"Aid, Mass Index",No,https://linkedin.com/in/barnesmary,"Certifications, Training",Diploma,UC Berkeley
3,Elizabeth Cooper,-Manager,2008-2013,johnathan90@example.com,"Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec...",13.0,"NYSE Regulation, Amex, Floor, ##SE",No,https://linkedin.com/in/areed,"Certified, Certifications","Certificate, Certificate-Project, Certificate-LMC","Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec..."
4,Neil de Grasse Tyson,"Specialist, Manager/Supervisor",1997-2004,mcgrathtina@example.net,"MEDI, Chattanooga State, Tennessee Legislature...",15.0,"African American, Modern Times, Spanish, Spani...",No,https://linkedin.com/in/xmiller,Certified Cybersecurity,"Degrees, Master, Bachelor","MEDI, Chattanooga State, Tennessee Legislature..."


In [41]:
resume_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483 entries, 0 to 2482
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Applicant Name            2483 non-null   object 
 1   Job Role                  1923 non-null   object 
 2   Phone                     2483 non-null   object 
 3   Email                     2483 non-null   object 
 4   Companies Worked For      2483 non-null   object 
 5   Years of Work Experience  2483 non-null   float64
 6   Skills                    2483 non-null   object 
 7   Referees                  2483 non-null   object 
 8   LinkedIn Profile          2483 non-null   object 
 9   Certifications            2483 non-null   object 
 10  Education Background      2483 non-null   object 
 11  Education Institutions    2483 non-null   object 
dtypes: float64(1), object(11)
memory usage: 232.9+ KB


In [42]:
resume_df.isnull().sum()

Unnamed: 0,0
Applicant Name,0
Job Role,560
Phone,0
Email,0
Companies Worked For,0
Years of Work Experience,0
Skills,0
Referees,0
LinkedIn Profile,0
Certifications,0


In [None]:
import pandas as pd
import os

# File paths
cleaned_csv = "/content/cleaned_parsed_resume_data.csv"
raw_csv = "/content/raw_resume_data.csv"

# Load datasets
df_cleaned = pd.read_csv(cleaned_csv)
df_raw = pd.read_csv(raw_csv)

# Extract job role from the first line of text column
def extract_job_role(text):
    """Extract the first line before a newline character as the job role."""
    if pd.notna(text):
        return text.split("\n")[0].strip()
    return None

# Apply extraction to raw resume data
df_raw["Extracted Job Role"] = df_raw["text"].apply(extract_job_role)

# Convert 'filename' to match 'Applicant Name' (remove extensions)
df_raw["Applicant Name"] = df_raw["filename"].apply(lambda x: os.path.splitext(x)[0])

# Merge job roles into cleaned data using 'Applicant Name'
df_cleaned = df_cleaned.merge(df_raw[["Applicant Name", "Extracted Job Role"]], on="Applicant Name", how="left")

# Fill missing job roles
df_cleaned["Job Role"] = df_cleaned["Job Role"].fillna(df_cleaned["Extracted Job Role"])

# Drop the temporary "Extracted Job Role" column
df_cleaned.drop(columns=["Extracted Job Role"], inplace=True)

# Save the updated dataset
df_cleaned.to_csv(cleaned_csv, index=False)

print(f"✅ Missing Job Roles updated using raw resume text! Saved to {cleaned_csv}")


✅ Missing Job Roles updated using raw resume text! Saved to /content/cleaned_parsed_resume_data.csv


In [None]:
df = pd.read_csv('/content/cleaned_parsed_resume_data.csv')
df.head()

Unnamed: 0,Applicant Name,Job Role,Phone,Email,Companies Worked For,Years of Work Experience,Skills,Referees,LinkedIn Profile,Certifications,Education Background,Education Institutions
0,Darryl Rodriguez,"Manager, Analyst",+1-810-234-8247x16972,thompsonrichard@example.com,"State B, ##ANCI",12.0,"Excel, R",No,https://linkedin.com/in/ryan73,Certifications,"Master, Bachelor's","State B, ##ANCI"
1,Allison Joseph,"Technician, Engineering",476.373.3629x1110,cpacheco@example.com,"##CI, Current Engineering, HMA, ##P, ArcV, TOP...",3.0,"Improvement, Water Utility, Microsoft Access, ...",No,https://linkedin.com/in/murphybrian,Certified Data Analysis,M.Sc. Data Science,"##CI, Current Engineering, HMA, ##P, ArcV, TOP..."
2,Christopher Williams,Specialist,294.859.3140,diana99@example.org,Amazon,9.0,"Aid, Mass Index",No,https://linkedin.com/in/barnesmary,"Certifications, Training",Diploma,UC Berkeley
3,Elizabeth Cooper,-Manager,2008-2013,johnathan90@example.com,"Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec...",13.0,"NYSE Regulation, Amex, Floor, ##SE",No,https://linkedin.com/in/areed,"Certified, Certifications","Certificate, Certificate-Project, Certificate-LMC","Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec..."
4,Neil de Grasse Tyson,"Specialist, Manager/Supervisor",1997-2004,mcgrathtina@example.net,"MEDI, Chattanooga State, Tennessee Legislature...",15.0,"African American, Modern Times, Spanish, Spani...",No,https://linkedin.com/in/xmiller,Certified Cybersecurity,"Degrees, Master, Bachelor","MEDI, Chattanooga State, Tennessee Legislature..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483 entries, 0 to 2482
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Applicant Name            2483 non-null   object 
 1   Job Role                  1923 non-null   object 
 2   Phone                     2483 non-null   object 
 3   Email                     2483 non-null   object 
 4   Companies Worked For      2483 non-null   object 
 5   Years of Work Experience  2483 non-null   float64
 6   Skills                    2483 non-null   object 
 7   Referees                  2483 non-null   object 
 8   LinkedIn Profile          2483 non-null   object 
 9   Certifications            2483 non-null   object 
 10  Education Background      2483 non-null   object 
 11  Education Institutions    2483 non-null   object 
dtypes: float64(1), object(11)
memory usage: 232.9+ KB


In [None]:
#fill missing values in job role column with mode and save to new csv file final_clean_resume
df['Job Role'].fillna(df['Job Role'].mode()[0], inplace=True)
df.to_csv('/content/final_clean_resume.csv', index=False)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Job Role'].fillna(df['Job Role'].mode()[0], inplace=True)


In [45]:
df = pd.read_csv('/content/final_clean_resume.csv')
df.head()

Unnamed: 0,Applicant Name,Job Role,Phone,Email,Companies Worked For,Years of Work Experience,Skills,Referees,LinkedIn Profile,Certifications,Education Background,Education Institutions
0,Darryl Rodriguez,"Manager, Analyst",+1-810-234-8247x16972,thompsonrichard@example.com,"State B, ##ANCI",12.0,"Excel, R",No,https://linkedin.com/in/ryan73,Certifications,"Master, Bachelor's","State B, ##ANCI"
1,Allison Joseph,"Technician, Engineering",476.373.3629x1110,cpacheco@example.com,"##CI, Current Engineering, HMA, ##P, ArcV, TOP...",3.0,"Improvement, Water Utility, Microsoft Access, ...",No,https://linkedin.com/in/murphybrian,Certified Data Analysis,M.Sc. Data Science,"##CI, Current Engineering, HMA, ##P, ArcV, TOP..."
2,Christopher Williams,Specialist,294.859.3140,diana99@example.org,Amazon,9.0,"Aid, Mass Index",No,https://linkedin.com/in/barnesmary,"Certifications, Training",Diploma,UC Berkeley
3,Elizabeth Cooper,-Manager,2008-2013,johnathan90@example.com,"Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec...",13.0,"NYSE Regulation, Amex, Floor, ##SE",No,https://linkedin.com/in/areed,"Certified, Certifications","Certificate, Certificate-Project, Certificate-LMC","Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec..."
4,Neil de Grasse Tyson,"Specialist, Manager/Supervisor",1997-2004,mcgrathtina@example.net,"MEDI, Chattanooga State, Tennessee Legislature...",15.0,"African American, Modern Times, Spanish, Spani...",No,https://linkedin.com/in/xmiller,Certified Cybersecurity,"Degrees, Master, Bachelor","MEDI, Chattanooga State, Tennessee Legislature..."


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483 entries, 0 to 2482
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Applicant Name            2483 non-null   object 
 1   Job Role                  2483 non-null   object 
 2   Phone                     2483 non-null   object 
 3   Email                     2483 non-null   object 
 4   Companies Worked For      2483 non-null   object 
 5   Years of Work Experience  2483 non-null   float64
 6   Skills                    2483 non-null   object 
 7   Referees                  2483 non-null   object 
 8   LinkedIn Profile          2483 non-null   object 
 9   Certifications            2483 non-null   object 
 10  Education Background      2483 non-null   object 
 11  Education Institutions    2483 non-null   object 
dtypes: float64(1), object(11)
memory usage: 232.9+ KB


## Large Resume Dataset NER Feature Extraction

In [43]:
import pandas as pd
import spacy
from spacy import displacy

In [44]:
large_resume_df = pd.read_csv('/content/resume_dataset_with_features.csv')
large_resume_df.head()

Unnamed: 0,Role,Features
0,Social Media Manager,5 to 15 Years Digital Marketing Specialist M.T...
1,Frontend Web Developer,"2 to 12 Years Web Developer BCA HTML, CSS, Jav..."
2,Quality Control Manager,0 to 12 Years Operations Manager PhD Quality c...
3,Wireless Network Engineer,4 to 11 Years Network Engineer PhD Wireless ne...
4,Conference Manager,1 to 12 Years Event Manager MBA Event planning...


In [None]:
large_resume_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1615940 entries, 0 to 1615939
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Role      1615940 non-null  object
 1   Features  1615940 non-null  object
dtypes: object(2)
memory usage: 24.7+ MB


In [None]:
large_resume_df.columns

Index(['Role', 'Features'], dtype='object')

In [None]:
import pandas as pd
import re
from tqdm import tqdm
from transformers import pipeline

# Enable tqdm for pandas
tqdm.pandas()

# Load Hugging Face NER model (General-Purpose)
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)

# Regex patterns
linkedin_pattern = r"https?://(www\.)?linkedin\.com/in/[a-zA-Z0-9-_]+"
email_pattern = r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
phone_pattern = r"\b(\+?\d{1,3}[-.\s]?)?(\(?\d{2,4}\)?[-.\s]?)?\d{3,4}[-.\s]?\d{3,4}\b"

# Keywords for different categories
education_keywords = ["Bachelor", "Master", "PhD", "BSc", "MSc", "Diploma", "Certificate", "Degree"]
certification_keywords = ["Certified", "Certification", "Credential", "Course", "Training"]
experience_keywords = ["years", "experience"]

# Load resume dataset
input_csv = "/content/resume_dataset_with_features.csv"
output_csv = "/content/parsed_resume_dataset_with_features.csv"
df = pd.read_csv(input_csv)

# Only extract the first 150,000 rows
df = df.head(10000)

# Function to extract named entities from text
def extract_entities(row):
    try:
        text = row['Features']
        role = row['Role']
        entities = ner_pipeline(text)

        # Extract structured fields
        name = next((ent["word"] for ent in entities if ent["entity_group"] == "PER"), None)
        organizations = [ent["word"] for ent in entities if ent["entity_group"] == "ORG"]
        skills = [ent["word"] for ent in entities if ent["entity_group"] == "MISC"]

        # Extract email
        email = re.search(email_pattern, text)
        email = email.group() if email else None

        # Extract phone number
        phone = re.search(phone_pattern, text)
        phone = phone.group() if phone else None

        # Extract LinkedIn profile
        linkedin = re.search(linkedin_pattern, text)
        linkedin = linkedin.group() if linkedin else None

        # Extract education background & institutions
        education = [word for word in text.split() if any(kw in word for kw in education_keywords)]
        education = ", ".join(education) if education else None

        institutions = ", ".join(organizations) if organizations else None

        # Extract certifications
        certifications = [word for word in text.split() if any(kw in word for kw in certification_keywords)]
        certifications = ", ".join(set(certifications)) if certifications else None

        # Extract years of experience
        experience_match = re.search(r"(\d+)\s*(years|year) of experience", text, re.IGNORECASE)
        years_of_experience = experience_match.group(1) if experience_match else None

        # Extract referees (basic keyword matching)
        referees = "Yes" if "reference" in text.lower() or "referee" in text.lower() else "No"

        return [
            name, role, phone, email, ", ".join(organizations),
            years_of_experience, ", ".join(skills), referees, linkedin,
            certifications, education, institutions
        ]

    except Exception as e:
        print(f"Error processing resume: {e}")
        return [None] * 12  # Return empty values if an error occurs

# Apply extraction function with a progress bar
df_extracted = df.progress_apply(extract_entities, axis=1)

# Create structured DataFrame
df_parsed = pd.DataFrame(df_extracted.tolist(), columns=[
    "Applicant Name", "Job Role", "Phone", "Email", "Companies Worked For",
    "Years of Work Experience", "Skills", "Referees", "LinkedIn Profile",
    "Certifications", "Education Background", "Education Institutions"
])

# Save to CSV
df_parsed.to_csv(output_csv, index=False)
df_parsed.to_csv("/content/drive/MyDrive/parsed_resume_dataset_with_features.csv")

print(f"✅ Extraction complete! Processed resume data saved to {output_csv}")


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
100%|██████████| 10000/10000 [1:02:13<00:00,  2.68it/s]

✅ Extraction complete! Processed resume data saved to /content/parsed_resume_dataset_with_features.csv





In [52]:
large_resume_data = pd.read_csv("/content/parsed_resume_dataset_with_features.csv")

In [53]:
large_resume_data.head()

Unnamed: 0,Applicant Name,Job Role,Phone,Email,Companies Worked For,Years of Work Experience,Skills,Referees,LinkedIn Profile,Certifications,Education Background,Education Institutions
0,,Social Media Manager,,,"M. Tech, Facebook, Twitter, In, ##gram",,,No,,,,"M. Tech, Facebook, Twitter, In, ##gram"
1,,Frontend Web Developer,100340.0,,"BCA, U",,"##L, CS, JavaScript",No,,,,"BCA, U"
2,,Quality Control Manager,,,,,ISO 9001,No,,,PhD,
3,,Wireless Network Engineer,129896.0,,,,,No,,,PhD,
4,,Conference Manager,,,MBA,,,No,,,,MBA


In [54]:
large_resume_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Applicant Name            611 non-null    object 
 1   Job Role                  10000 non-null  object 
 2   Phone                     2845 non-null   float64
 3   Email                     0 non-null      float64
 4   Companies Worked For      7222 non-null   object 
 5   Years of Work Experience  0 non-null      float64
 6   Skills                    2535 non-null   object 
 7   Referees                  10000 non-null  object 
 8   LinkedIn Profile          0 non-null      float64
 9   Certifications            112 non-null    object 
 10  Education Background      977 non-null    object 
 11  Education Institutions    7222 non-null   object 
dtypes: float64(4), object(8)
memory usage: 937.6+ KB


In [55]:
large_resume_data.columns

Index(['Applicant Name', 'Job Role', 'Phone', 'Email', 'Companies Worked For',
       'Years of Work Experience', 'Skills', 'Referees', 'LinkedIn Profile',
       'Certifications', 'Education Background', 'Education Institutions'],
      dtype='object')

In [56]:
large_resume_data.isnull().sum()

Unnamed: 0,0
Applicant Name,9389
Job Role,0
Phone,7155
Email,10000
Companies Worked For,2778
Years of Work Experience,10000
Skills,7465
Referees,0
LinkedIn Profile,10000
Certifications,9888


In [57]:
import pandas as pd
import random
import string
from faker import Faker

# Initialize Faker for generating realistic names and phone numbers
fake = Faker()

# Load dataset (Replace with actual path)
df = large_resume_data

# Sample data for synthetic filling
company_identifiers = ["TechCorp", "Global Solutions", "InnovateX", "NextGen", "DataWorks", "CyberNet"]
degree_options = ["Bachelor's", "Master's", "PhD", "Diploma", "Associate Degree"]
universities = ["Harvard University", "MIT", "Stanford University", "Oxford University", "Cambridge University", "Yale University"]
certifications_list = ["PMP", "AWS Certified Solutions Architect", "Certified Data Scientist", "Google Cloud Professional", "Cisco CCNA"]
skill_sets = ["Python", "Machine Learning", "Data Analysis", "SQL", "Java", "Cybersecurity", "Cloud Computing", "Project Management"]
email_domains = ["gmail.com", "yahoo.com", "outlook.com", "protonmail.com", "icloud.com", "aol.com"]

# Generate synthetic data
def generate_name():
    return fake.first_name() + " " + fake.last_name()

def generate_phone():
    return fake.phone_number()

def generate_email(name):
    return name.replace(" ", "").lower() + "@" + random.choice(email_domains)

def generate_company():
    return fake.company() if random.random() > 0.3 else random.choice(company_identifiers)

def generate_experience():
    return random.randint(1, 30)

def generate_skills():
    return ", ".join(random.sample(skill_sets, random.randint(3, 6)))

def generate_linkedin(name):
    return "https://linkedin.com/in/" + name.replace(" ", "").lower()

def generate_education():
    return random.choice(degree_options)

def generate_certifications():
    return ", ".join(random.sample(certifications_list, random.randint(1, 3)))

def generate_institution():
    return random.choice(universities)

# Filling missing values
df["Applicant Name"].fillna(df["Applicant Name"].apply(lambda x: generate_name() if pd.isna(x) else x), inplace=True)
df["Phone"].fillna(df["Phone"].apply(lambda x: generate_phone() if pd.isna(x) else x), inplace=True)
df["Email"].fillna(df["Applicant Name"].apply(lambda x: generate_email(x) if pd.isna(df.loc[df["Applicant Name"] == x, "Email"]).all() else x), inplace=True)
df["Companies Worked For"].fillna(df["Companies Worked For"].apply(lambda x: generate_company() if pd.isna(x) else x), inplace=True)
df["Years of Work Experience"].fillna(df["Years of Work Experience"].apply(lambda x: generate_experience() if pd.isna(x) else x), inplace=True)
df["Skills"].fillna(df["Skills"].apply(lambda x: generate_skills() if pd.isna(x) else x), inplace=True)
df["LinkedIn Profile"].fillna(df["Applicant Name"].apply(lambda x: generate_linkedin(x) if pd.isna(df.loc[df["Applicant Name"] == x, "LinkedIn Profile"]).all() else x), inplace=True)
df["Education Background"].fillna(df["Education Background"].apply(lambda x: generate_education() if pd.isna(x) else x), inplace=True)
df["Certifications"].fillna(df["Certifications"].apply(lambda x: generate_certifications() if pd.isna(x) else x), inplace=True)
df["Education Institutions"].fillna(df["Education Institutions"].apply(lambda x: generate_institution() if pd.isna(x) else x), inplace=True)

# Save the updated dataset
df.to_csv("/content/large_resume_dataset_filled.csv", index=False)
print("✅ Missing values filled and saved to resume_dataset_filled.csv")


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Applicant Name"].fillna(df["Applicant Name"].apply(lambda x: generate_name() if pd.isna(x) else x), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Phone"].fillna(df["Phone"].apply(lambda x: generate_phone() if pd.isna(x) else x), inplace=True)
 '8936531785'

✅ Missing values filled and saved to resume_dataset_filled.csv


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["LinkedIn Profile"].fillna(df["Applicant Name"].apply(lambda x: generate_linkedin(x) if pd.isna(df.loc[df["Applicant Name"] == x, "LinkedIn Profile"]).all() else x), inplace=True)
 'https://linkedin.com/in/angelasandoval'
 'https://linkedin.com/in/ericwood' ...
 'https://linkedin.com/in/brianjones'
 'https://linkedin.com/in/deannasnyder'
 'https://linkedin.com/in/danielthomas']' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df["LinkedIn Profile"].fillna(df["Applicant Name"].apply(lambda x: generate_linkedin(x) if pd.isna(df.loc[df["Applicant Name"] == x, "Linked

In [58]:
df = pd.read_csv("/content/large_resume_dataset_filled.csv")
df.isnull().sum()

Unnamed: 0,0
Applicant Name,0
Job Role,0
Phone,0
Email,0
Companies Worked For,0
Years of Work Experience,0
Skills,0
Referees,0
LinkedIn Profile,0
Certifications,0


In [47]:
# Saving unique job roles to txt file

import pandas as pd

# Load resume dataset
input_csv = "/content/resume_dataset_with_features.csv"
df = pd.read_csv(input_csv)

# Get all unique job roles
unique_job_roles = df['Role'].unique()

# Save unique job roles to a text file
output_txt = "/content/unique_job_roles.txt"
with open(output_txt, 'w') as file:
    for role in unique_job_roles:
        file.write(f"{role}\n")

print(f"✅ Unique job roles saved to {output_txt}")


✅ Unique job roles saved to /content/unique_job_roles.txt


In [48]:
# Saving unique job roles to txt file

import pandas as pd

# Load resume dataset
input_csv = "/content/final_clean_resume.csv"
df = pd.read_csv(input_csv)

# Get all unique job roles
unique_job_roles2 = df['Job Role'].unique()

# Save unique job roles to a text file
output_txt = "/content/unique_job_roles2.txt"
with open(output_txt, 'w') as file:
    for role in unique_job_roles:
        file.write(f"{role}\n")

print(f"✅ Unique job roles saved to {output_txt}")

✅ Unique job roles saved to /content/unique_job_roles2.txt


In [None]:
# Saving unique skills to txt file

import pandas as pd

# Load structured resume dataset
input_csv = "/content/structured_resume_data.csv"
df = pd.read_csv(input_csv)

# Get all unique skills
unique_skills = df['Skills'].str.split(',').explode().str.strip().unique()

# Save unique skills to a text file
output_txt = "/content/unique_skills.txt"
with open(output_txt, 'w') as file:
    for skill in unique_skills:
        file.write(f"{skill}\n")

print(f"✅ Unique skills saved to {output_txt}")


✅ Unique skills saved to /content/unique_skills.txt


In [None]:
import pandas as pd

# Load structured resume dataset
input_csv = "/content/structured_resume_data.csv"
df = pd.read_csv(input_csv)

# Get all unique skills
unique_skills = df['Skills'].str.split(',').explode().str.strip().unique()

# Save unique skills to a text file
output_txt = "/content/unique_skills.txt"
with open(output_txt, 'w') as file:
    for skill in unique_skills:
        file.write(f"{skill}\n")

print(f"✅ Unique skills saved to {output_txt}")

In [None]:
import pandas as pd

# Load structured resume dataset
input_csv = "/content/final_clean_resume.csv"
df = pd.read_csv(input_csv)

# Get all unique skills
unique_skills = df['Skills'].str.split(',').explode().str.strip().unique()

# Save unique skills to a text file
output_txt = "/content/unique_skills2.txt"
with open(output_txt, 'w') as file:
    for skill in unique_skills:
        file.write(f"{skill}\n")

print(f"✅ Unique skills saved to {output_txt}")

✅ Unique skills saved to /content/unique_skills2.txt


# Named Entity Recognition using Hybrid Approach: spaCy and BERT

In [None]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
from transformers import pipeline

In [None]:
final_clean_resume = pd.read_csv('/content/final_clean_resume.csv')
final_clean_resume.head()

Unnamed: 0,Applicant Name,Job Role,Phone,Email,Companies Worked For,Years of Work Experience,Skills,Referees,LinkedIn Profile,Certifications,Education Background,Education Institutions
0,Darryl Rodriguez,"Manager, Analyst",+1-810-234-8247x16972,thompsonrichard@example.com,"State B, ##ANCI",12.0,"Excel, R",No,https://linkedin.com/in/ryan73,Certifications,"Master, Bachelor's","State B, ##ANCI"
1,Allison Joseph,"Technician, Engineering",476.373.3629x1110,cpacheco@example.com,"##CI, Current Engineering, HMA, ##P, ArcV, TOP...",3.0,"Improvement, Water Utility, Microsoft Access, ...",No,https://linkedin.com/in/murphybrian,Certified Data Analysis,M.Sc. Data Science,"##CI, Current Engineering, HMA, ##P, ArcV, TOP..."
2,Christopher Williams,Specialist,294.859.3140,diana99@example.org,Amazon,9.0,"Aid, Mass Index",No,https://linkedin.com/in/barnesmary,"Certifications, Training",Diploma,UC Berkeley
3,Elizabeth Cooper,-Manager,2008-2013,johnathan90@example.com,"Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec...",13.0,"NYSE Regulation, Amex, Floor, ##SE",No,https://linkedin.com/in/areed,"Certified, Certifications","Certificate, Certificate-Project, Certificate-LMC","Microsoft Office, K, Ex, NYSE, SF, NY, NY, Sec..."
4,Neil de Grasse Tyson,"Specialist, Manager/Supervisor",1997-2004,mcgrathtina@example.net,"MEDI, Chattanooga State, Tennessee Legislature...",15.0,"African American, Modern Times, Spanish, Spani...",No,https://linkedin.com/in/xmiller,Certified Cybersecurity,"Degrees, Master, Bachelor","MEDI, Chattanooga State, Tennessee Legislature..."


In [None]:
final_clean_resume.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483 entries, 0 to 2482
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Applicant Name            2483 non-null   object 
 1   Job Role                  2483 non-null   object 
 2   Phone                     2483 non-null   object 
 3   Email                     2483 non-null   object 
 4   Companies Worked For      2483 non-null   object 
 5   Years of Work Experience  2483 non-null   float64
 6   Skills                    2483 non-null   object 
 7   Referees                  2483 non-null   object 
 8   LinkedIn Profile          2483 non-null   object 
 9   Certifications            2483 non-null   object 
 10  Education Background      2483 non-null   object 
 11  Education Institutions    2483 non-null   object 
dtypes: float64(1), object(11)
memory usage: 232.9+ KB


In [None]:
final_clean_resume.columns

Index(['Applicant Name', 'Job Role', 'Phone', 'Email', 'Companies Worked For',
       'Years of Work Experience', 'Skills', 'Referees', 'LinkedIn Profile',
       'Certifications', 'Education Background', 'Education Institutions'],
      dtype='object')

In [None]:
!pip install spacy transformers flair torch
!python -m spacy download en_core_web_sm


Collecting flair
  Downloading flair-0.15.1-py3-none-any.whl.metadata (12 kB)
Collecting boto3>=1.20.27 (from flair)
  Downloading boto3-1.36.20-py3-none-any.whl.metadata (6.7 kB)
Collecting conllu<5.0.0,>=4.0 (from flair)
  Downloading conllu-4.5.3-py2.py3-none-any.whl.metadata (19 kB)
Collecting ftfy>=6.1.0 (from flair)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting langdetect>=1.0.9 (from flair)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mpld3>=0.3 (from flair)
  Downloading mpld3-0.5.10-py3-none-any.whl.metadata (5.1 kB)
Collecting pptree>=3.1 (from flair)
  Downloading pptree-3.1.tar.gz (3.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytorch-revgrad>=0.2.0 (from flair)
  Downloading pytorch_revgrad-0.2.0-py3-none-any.whl.metadata (1.7 kB)


In [None]:
import spacy
import re
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
import flair
from flair.data import Sentence
from flair.models import SequenceTagger

# Load spaCy NER model
nlp_spacy = spacy.load("en_core_web_sm")

# Load BERT-based NER model (Hugging Face)
model_name = "dslim/bert-base-NER"  # Pre-trained BERT NER model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
bert_ner = pipeline("ner", model=model, tokenizer=tokenizer)

# Load Flair NER model
flair_tagger = SequenceTagger.load("flair/ner-english")

# Helper function: Extract email
def extract_email(text):
    match = re.search(r"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+", text)
    return match.group(0) if match else None

# Helper function: Extract phone number
def extract_phone(text):
    match = re.search(r"\+?\d{10,15}", text)
    return match.group(0) if match else None

# Helper function: Extract LinkedIn profile
def extract_linkedin(text):
    match = re.search(r"(https?:\/\/)?(www\.)?linkedin\.com\/[a-zA-Z0-9\-_/]+", text)
    return match.group(0) if match else None

# Extract structured data using spaCy
def extract_structured_entities(text):
    doc = nlp_spacy(text)
    name, email, phone, linkedin, certifications = None, None, None, None, []

    for ent in doc.ents:
        if ent.label_ == "PERSON":
            name = ent.text
        elif ent.label_ == "ORG":
            certifications.append(ent.text)

    email = extract_email(text)
    phone = extract_phone(text)
    linkedin = extract_linkedin(text)

    return {
        "Applicant Name": name,
        "Phone": phone,
        "Email": email,
        "LinkedIn Profile": linkedin,
        "Certifications": ", ".join(certifications) if certifications else None
    }

# Extract contextual entities using BERT
def extract_contextual_entities(text):
    job_role, skills, companies, experience = None, [], [], None

    bert_results = bert_ner(text)
    for entity in bert_results:
        entity_text = entity["word"]
        entity_type = entity["entity"].replace("B-", "").replace("I-", "")

        if entity_type == "JOB_TITLE":
            job_role = entity_text
        elif entity_type == "ORG":
            companies.append(entity_text)
        elif entity_type == "SKILL":
            skills.append(entity_text)
        elif entity_type == "DATE":  # Assume date entities might be experience
            experience = entity_text

    return {
        "Job Role": job_role,
        "Skills": ", ".join(set(skills)) if skills else None,
        "Companies Worked For": ", ".join(set(companies)) if companies else None,
        "Years of Work Experience": experience
    }

# Extract using Flair (alternative NER)
def extract_flair_entities(text):
    sentence = Sentence(text)
    flair_tagger.predict(sentence)

    job_role, education, institutions = None, [], []

    for entity in sentence.get_spans("ner"):
        if entity.tag == "ORG":
            institutions.append(entity.text)
        elif entity.tag in ["MISC", "JOB_TITLE"]:
            job_role = entity.text
        elif entity.tag == "EDUCATION":
            education.append(entity.text)

    return {
        "Education Background": ", ".join(set(education)) if education else None,
        "Education Institutions": ", ".join(set(institutions)) if institutions else None
    }

# Full resume parsing function
def parse_resume(text):
    structured_data = extract_structured_entities(text)
    contextual_data = extract_contextual_entities(text)
    flair_data = extract_flair_entities(text)

    # Merge results
    parsed_resume = {**structured_data, **contextual_data, **flair_data, "Referees": None}

    return parsed_resume

# Example usage
resume_text = """
Moses Mugambi Data Scientist Email: mugambimoses2@gmail.com | Phone: +254718695260 LinkedIn: linkedin.com/in/moses-mugambi-njeru | GitHub: github.com/mk7890
Professional Summary
Motivated Data Scientist with expertise in Python, data analysis, visualizations, and machine learning. Proficient in SQL, Tableau, Excel, and web scraping. Experienced in predictive modelling, clustering, and NLP. Passionate about deriving insights from data to solve real-world problems.
Work Experience
• Resume Parser. Built an NLP pipeline for extracting key resume details.
• Google Play Store Apps & YouTube Analysis: Data-driven insights using SQL, Pandas, and visualizations.
• Machine Learning Projects: Implemented Regression, Classification, Clustering, PCA, and CNNs.
• Time Series Analysis: Forecasted air quality trends using statistical models.
• NLP Sentiment Analysis: Analysed customer reviews for sentiment trends.
• Voice-Controlled Calculator: Developed a Python-based scientific calculator with speech recognition functionality.
Skills
• Hard Skills: Python (Pandas, NumPy, Scikit-Learn), SQL, Tableau, Machine Learning, NLP, Web Scraping, Data Visualization.
• Soft Skills: Problem-Solving, Analytical Thinking, Communication, Teamwork, Adaptability.
Education
Data Science Full - Time Zindua School, September 2024 – February 2025
B.Sc. Electrical and Electronics Engineering The Technical University of Kenya, 2013 – 2018 : Second Upper Class Honours
High School Certificate Moi High School Mbiruri, 2008 – 2011 : Grade A plain of 81 points
"""

parsed_resume = parse_resume(resume_text)
print(parsed_resume)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


pytorch_model.bin:   0%|          | 0.00/419M [00:00<?, ?B/s]

2025-02-13 23:21:58,252 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
{'Applicant Name': 'Scikit-Learn', 'Phone': '+254718695260', 'Email': 'mugambimoses2@gmail.com', 'LinkedIn Profile': 'linkedin.com/in/moses-mugambi-njeru', 'Certifications': 'GitHub, Motivated Data Scientist, NLP, NLP, SQL, Pandas, • Machine Learning Projects, PCA, • NLP Sentiment Analysis, • Voice-Controlled, Skills\n• Hard Skills: Python, Machine Learning, NLP, Data Visualization, Analytical Thinking, Adaptability, Education, Electrical and Electronics Engineering The Technical University', 'Job Role': None, 'Skills': None, 'Companies Worked For': 'and, Kenya, Z, ##ua, Scientist, Google, Electronics, of, Technical, University, ##ind', 'Years of Work Experience': None, 'Education Background': None, 'Education Institutions': 'Technical University of Kenya, Moses Mugambi Dat