Resume Screening Using NLP (w/ HF Dataset)

In [146]:
# load the dataset

from datasets import load_dataset

dataset = load_dataset('cnamuangtoun/resume-job-description-fit', split='train')

In [147]:
print(dataset)

Dataset({
    features: ['resume_text', 'job_description_text', 'label'],
    num_rows: 6241
})


In [148]:
# convert to pandas

df = dataset.to_pandas()

In [149]:
df.head(10)

Unnamed: 0,resume_text,job_description_text,label
0,SummaryHighly motivated Sales Associate with e...,Net2Source Inc. is an award-winning total work...,No Fit
1,Professional SummaryCurrently working with Cat...,At Salas OBrien we tell our clients that were ...,No Fit
2,SummaryI started my construction career in Jun...,Schweitzer Engineering Laboratories (SEL) Infr...,No Fit
3,SummaryCertified Electrical Foremanwith thirte...,"Mizick Miller & Company, Inc. is looking for a...",No Fit
4,SummaryWith extensive experience in business/r...,Life at Capgemini\nCapgemini supports all aspe...,No Fit
5,"SummarySolution-oriented, results-driven strat...",\n\nResponsibilitiesLead and provide day-to-da...,No Fit
6,SummaryA position in a company that will utili...,"Senior Salesforce Software Engineer, Salesforc...",No Fit
7,SummaryTo participate as a team member in a dy...,***W2 ONLY** \nSoftware Engineer (12-month con...,No Fit
8,SummaryMore than ten years of progressive expe...,Calling all innovators find your future at Fi...,No Fit
9,SummaryCERTIFIED SOFTWARE DEVELOPMENT PROFESSI...,Who We AreWere reinventing the egg protein bus...,No Fit


In [150]:
# remove the irrelevants

import pandas as pd

df = df.drop(columns=['job_description_text', 'label'])
df.replace('', pd.NA, inplace=True)
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

In [151]:
# introduce a new column id and assign id to them

df.insert(0, 'id', range(0, len(df)))

In [152]:
df

Unnamed: 0,id,resume_text
0,0,SummaryHighly motivated Sales Associate with e...
1,1,Professional SummaryCurrently working with Cat...
2,2,SummaryI started my construction career in Jun...
3,3,SummaryCertified Electrical Foremanwith thirte...
4,4,SummaryWith extensive experience in business/r...
...,...,...
4981,637,Professional SummaryCurrent Accountant with th...
5283,638,Career OverviewHighly skilled SOFTWARE QUALITY...
5318,639,SummaryAmbitious electrical engineer capable o...
5861,640,Career OverviewClients/Employers: Thompson Reu...


In [154]:
# assuming the given job description

job_description = '''
Full job description
DescriptionAre you passionate about telecommunications and eager to gain hands-on experience in a dynamic and innovative environment? This is your opportunity to work on cutting-edge projects, collaborate with industry experts, and make a real impact in a variety of fields.

What We’re Looking For:

Fresh graduates having academic background in Networking or Telecommunication
Basic knowledge of LAN, WAN, DHCP, DNS, IP Subnetting, etc.
Excellent English communication skills are a must
Must be comfortable working on a rotational shift schedule (evenings, nights, weekends).
Location:
E-11/2, Islamabad (Onsite)
Timings:
2:00 PM to 8:00 PM (Rotational)

Why MCG Technologies?

Innovative Culture: Be part of a forward-thinking company that values creativity and innovation.
Professional Development: Gain practical experience that will prepare you for a successful career in technology.
Supportive Environment: Work in a collaborative and supportive setting that encourages growth and development.
Competitive Stipend: Receive a competitive stipend for your contributions and hard work.
Job Type: Internship
Contract length: 6 weeks

Application Question(s):

Are you comfortable working on-site ( E11/2) from 2pm-8pm? Apply only if yes.
Work Location: In person
'''

In [155]:
# preprocessing resumes and job description

import re
import spacy
from sentence_transformers import SentenceTransformer, util

nlp = spacy.load('en_core_web_lg')

def preprocessing_function(text):
    if not isinstance(text, str):
        return ''

    headers = [
        'professional summary', 'career overview', 'work history',
        'technical skills', 'core competencies', 'work experience',
        'professional', 'summary', 'career', 'overview', 'work', 'history',
        'technical', 'skills', 'core', 'competencies', 'experience',
        'about', 'profile', 'introduction', 'background', 'objective', 'summary',
        'education', 'qualifications', 'interests',
        'projects', 'certifications', 'responsibilities', 'job'
    ]

    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)

    pattern = r'(?i)(?:' + '|'.join([re.escape(h).replace(r'\ ', r'\s*') for h in headers]) + r')[:\s]*'
    text = re.sub(pattern, '', text)

    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^A-Za-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()

    doc = nlp(text)

    tokens = [token.lemma_.lower() for token in doc if token.is_alpha]

    return ' '.join(tokens)


model = SentenceTransformer("all-MiniLM-L6-v2")

df['preprocessed_resume_text'] = df['resume_text'].apply(preprocessing_function)
job_description = preprocessing_function(job_description)

resume_vec = model.encode(
    df['preprocessed_resume_text'].tolist(),
    batch_size=16,
    show_progress_bar=True,
)

jd_vec = model.encode(job_description, convert_to_tensor=True)

Batches:   0%|          | 0/41 [00:00<?, ?it/s]

In [156]:
# finding out their resume-jd similarities using cosine similarity score

similarity_scores = util.cos_sim(resume_vec, jd_vec)

df['similarity_score'] = similarity_scores.numpy().flatten()

In [157]:
# top resumes based on cosine similarity

print(df.sort_values(by='similarity_score', ascending=False).head(5))

       id                                        resume_text  \
67     62  ProfileTo obtain the position of a Telecommuni...   
568   357  SummaryI am a computer engineer with over 12 y...   
1279  561  Professional SummaryWith the attitude of learn...   
850   461  SummaryI am seeking a position that utilizes m...   
553   352  Career OverviewSeeking a Network Support posit...   

                               preprocessed_resume_text  similarity_score  
67    to obtain the position of a telecommunications...          0.618741  
568   i be a computer engineer with over year of ded...          0.608466  
1279  with the attitude of learn i be look for an in...          0.603528  
850   i be seek a position that utilize my strong so...          0.586564  
553   seek a netsupport position in a progressive fi...          0.569200  


Bonus: Resume Screening Application (w/o HF Dataset)

In [5]:
# load models

import spacy
from sentence_transformers import SentenceTransformer

nlp = spacy.load('en_core_web_lg')
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [21]:
# extraction from resumes and jd

import fitz
import docx
import io

def extract_text(file):
    try:
        if isinstance(file, str):
            filename = file
            with open(file, "rb") as f:
                file_bytes = f.read()
        else:
            filename = file.name
            file_bytes = file.read()

        file_ext = filename.split(".")[-1].lower()

        if file_ext == 'txt':
            return file_bytes.decode('utf-8')
        elif file_ext == 'pdf':
            doc = fitz.open(stream=io.BytesIO(file_bytes), filetype='pdf')
            return ' '.join([page.get_text() for page in doc])
        elif file_ext == 'docx':
            doc = docx.Document(io.BytesIO(file_bytes))
            return '\n'.join([para.text for para in doc.paragraphs])
        return ''
    except Exception as e:
        return f'Error reading file \'{filename}\': {str(e)}'

def extract_keywords(text):
    if nlp is None:
        print("Warning: spaCy model not loaded for keyword extraction.")
        return set()
    doc = nlp(text)
    return {token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop and not token.is_punct}

def extract_keywords_for_model(text):
    return ' '.join(extract_keywords(text))

def get_common_words(processed_resume_str, processed_jd_str):
    resume_words = set(processed_resume_str.split())
    jd_words = set(processed_jd_str.split())
    common = sorted(list(resume_words.intersection(jd_words)))
    return ', '.join(common)

In [64]:
# preprocessing resumes

def single_resume_mode(resume_file, job_description):
    resume_text = extract_text(resume_file)
    jd_text = job_description

    processed_resume = extract_keywords_for_model(resume_text)
    processed_jd = extract_keywords_for_model(jd_text)

    resume_vec = model.encode(processed_resume, convert_to_tensor=True)
    jd_vec = model.encode(processed_jd, convert_to_tensor=True)

    raw_score = float(util.cos_sim(resume_vec, jd_vec).item()) * 100
    bias = 40 * (1 - raw_score / 100)
    similarity_score = min(raw_score + bias, 100)


    common_keywords = get_common_words(processed_resume, processed_jd)

    result = f'### **Match Score: {similarity_score:.5f}%**\n'

    if similarity_score >= 80:
        result += '✅ **Strong Match:** Excellent alignment with the job description.'
    elif similarity_score >= 60:
        result += '⚠️ **Moderate Match:** Shows potential with some key overlaps.'
    else:
        result += '❌ **Weak Match:** Lacks significant alignment with the role.'

    return result, common_keywords, resume_text, processed_resume, processed_jd

def batch_resume_mode(resume_files, job_description_text):
    jd_text = job_description_text
    processed_jd = extract_keywords_for_model(jd_text)
    jd_vec = model.encode(processed_jd, convert_to_tensor=True)

    results = []
    for file in resume_files:
        resume_text = extract_text(file)
        if resume_text.startswith('Error'):
            score = 0
        else:
            processed_resume = extract_keywords_for_model(resume_text)
            resume_vec = model.encode(processed_resume, convert_to_tensor=True)

            raw_score = float(util.cos_sim(resume_vec, jd_vec).item()) * 100
            bias = 40 * (1 - raw_score / 100)
            score = min(raw_score + bias, 100)


        results.append({
            'File Name': file.name,
            'Fit (%)': round(score, 5)
        })

    results_df = pd.DataFrame(results).sort_values(by='Fit (%)', ascending=False).reset_index(drop=True)
    return results_df

In [62]:
# overall matching function

import pandas as pd
import traceback

def match_resumes(mode, resume_files, job_description_text):
    try:
        if not job_description_text.strip():
            return '❌ Error: Job Description cannot be empty.', '', pd.DataFrame(), '', '', ''
        if not resume_files:
            return '❌ Error: Please upload at least one resume.', '', pd.DataFrame(), '', '', ''

        if mode == 'Single Resume':
            if len(resume_files) > 1:
                return '⚠️ Please upload only one resume for Single mode.', '', pd.DataFrame(), '', '', ''
            score_text, common_keywords, resume_text, processed_resume, processed_jd = single_resume_mode(resume_files[0], job_description_text)
            return score_text, common_keywords, pd.DataFrame(), resume_text, processed_resume, processed_jd

        elif mode == 'Multiple Resumes':
            result_table = batch_resume_mode(resume_files, job_description_text)
            return '✅ Batch processing complete. See ranked results below.', '', result_table, '', '', ''

    except Exception as e:
        traceback.print_exc()
        return f'❌ An unexpected error occurred: {str(e)}', '', pd.DataFrame(), '', '', ''

In [65]:
# front-end ui

from enum import auto
import gradio as gr

theme = gr.themes.Monochrome(
    neutral_hue="slate",
    font=["IBM Plex Sans", "Roboto", "sans-serif"]
).set(
    background_fill_primary="*neutral_950",
    background_fill_secondary="*neutral_900",
    border_color_accent="*red_700",
    block_background_fill="*neutral_800",
    button_primary_background_fill="*red_600",
    button_primary_background_fill_hover="*red_700",
    button_secondary_background_fill="*neutral_700",
    button_secondary_background_fill_hover="*neutral_600",
    input_background_fill="*neutral_800",
    panel_background_fill="*neutral_950",
)

with gr.Blocks(theme=theme, title="🧠 Resume Screening using NLP") as demo:
    gr.Markdown("# 🧠 Resume Screening Tool")
    gr.Markdown("""
    <p style="color: #A3A3A3;">Upload one or more resumes and paste a job description to get an intelligent semantic match score powered by NLP.</p>
    <p style="color: #A3A3A3;">**Supports:** <span style="color: #EF4444;">.pdf, .docx, .txt</span> | **Modes:** Single or Batch</p>
    """)

    gr.Markdown("---")

    with gr.Row():
        with gr.Column(scale=1, min_width=300):
            gr.Markdown("## 📁 Input Section")

            mode_dropdown = gr.Dropdown(
                ["Single Resume", "Multiple Resumes"],
                label="🎯 Mode",
                value="Single Resume"
            )
            resume_input = gr.File(
                label="📄 Upload Resume(s)",
                file_types=[".pdf", ".docx", ".txt"],
                file_count="multiple"
            )
            jd_input = gr.Textbox(
                label="📝 Paste Job Description",
                placeholder="Paste the full job description here...",
                lines=12
            )

            with gr.Row():
                clear_button = gr.ClearButton(value="🧹 Clear All", variant="secondary")
                match_button = gr.Button("🚀 Screen Resume(s)", variant="primary")

        with gr.Column(scale=2, min_width=500):
            gr.Markdown("## 📊 Results")
            with gr.Group(visible=True) as single_mode_outputs:
                gr.Markdown("### Match Results (Single Resume)")
                output_score = gr.Markdown()
                output_common_entities = gr.Textbox(
                    label="🔍 Common Keywords Found",
                    interactive=False,
                    #lines automatic
                )
                with gr.Accordion("📂 Resume & JD Details", open=False):
                    output_resume_text = gr.Textbox(label="📄 Extracted Resume Text", interactive=False, lines=10, max_lines=10)
                    output_preprocessed_resume = gr.Textbox(label="🔧 Preprocessed Resume Text", interactive=False, lines=10, max_lines=10)
                    output_preprocessed_jd = gr.Textbox(label="🔧 Preprocessed JD Text", interactive=False, lines=10, max_lines=10)

            with gr.Group(visible=False) as multi_mode_output:
                gr.Markdown("### Ranked Resumes (Batch Mode)")
                output_table = gr.Dataframe(label="🧮 Resume Match Scores", wrap=True, interactive=False)

    def update_visibility(mode):
        return {
            single_mode_outputs: gr.update(visible=(mode == "Single Resume")),
            multi_mode_output: gr.update(visible=(mode == "Multiple Resumes")),
        }

    mode_dropdown.change(
        fn=update_visibility,
        inputs=[mode_dropdown],
        outputs=[single_mode_outputs, multi_mode_output]
    )

    match_button.click(
        fn=match_resumes,
        inputs=[mode_dropdown, resume_input, jd_input],
        outputs=[
            output_score,
            output_common_entities,
            output_table,
            output_resume_text,
            output_preprocessed_resume,
            output_preprocessed_jd
        ]
    )

    clear_button.add([
        resume_input, jd_input, output_score, output_common_entities,
        output_table, output_resume_text, output_preprocessed_resume, output_preprocessed_jd
    ])

demo.launch(debug=True)

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://710da5bf0203bd3d88.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://710da5bf0203bd3d88.gradio.live


