<a href="https://colab.research.google.com/github/nitishkpandey/EDA-Python-ML/blob/main/NLP_for_Intelligent_Talent_Acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**NLP Framework for Intelligent Talent Acquisition**
Title: NLP Framework for Intelligent Talent Acquisition

Author: Nitish Kumar Pandey

### **1. Introduction of the Business Context**

####**Client Overview**

The client is a large recruitment platform (similar to Indeed, LinkedIn, StepStone, etc.) that receives thousands of resumes daily. They want to automate the candidate screening process to improve efficiency and reduce manual workload for recruiters.

####**Business Problem**

Recruiters struggle to manually analyze resumes, compare them across roles, and shortlist suitable candidates. This leads to:

*   Slow hiring cycles
*   Inconsistent evaluation
*   High recruiter workload
*   Missed potential candidates

The client wants an **NLP-driven system** that:

*   Understands resume content
*   Extracts key information (skills, experience, education)
*   Compares it with job descriptions
*   Predicts how well a candidate matches a job
*   Supports recruiters with AI-driven shortlisting

### **2. Business Benefits**

####**For the Business / Recruitment Platform**

*   Automated candidate shortlisting
*   Consistent and data-driven evaluation
*   Identification of top candidates faster
*   Improved applicant experience (reduced waiting time)
*   Increased recruiter productivity
*   Better job-candidate matching → higher hiring success

####**For Recruiters / Clients**

*   Easy comparison of candidates
*   Transparent similarity/match scoring
*   Identifies missing or incorrect skills
*   Helps focus on high-value interviews instead of resume reading

####**For Job Seekers**

*   Better visibility if their skills match roles
*   More accurate recommendations

### **3. Dataset Information & Collection**

####**Dataset Source**

The dataset is taken from: https://www.kaggle.com/datasets/saugataroyarghya/resume-dataset

####**Dataset Description**

The dataset contains:

####**Resume information:**

*   Skills
*   Experience
*   Projects
*   Career objective
*   Certifications

####**Job description information:**

*   Job position
*   Responsibilities
*   Required skills
*   Experience requirements

####**Labels:**

*   matched_score — rating between candidate & job posting (numeric)

####**Why this Dataset?**

It supports both:

*   **NER task**: extract structured entities from resumes
*   **Matching task**: predict similarity score between candidate & job

### **4. Formulating as an NLP Task**

This project consists of three NLP components:

####**i. Resume Parsing (NER Task)**

Extract entities like:
*   Skills
*   Experience years
*   Degree / Education
*   Tools / Technologies
*   Companies

####**ii. Text Representation & Vectorization**

Convert resume & job text into numerical representations using:
*   TF-IDF
*   Sentence embeddings (optional)
*   Skill overlap features
*   Similarity measures

####**iii. Candidate–Job Matching (Regression/Classification)**

Using constructed features to predict:
*   Match score (regression)

    or

*   High/Medium/Low suitability (classification)

### **5. Importing Required Libraries**

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore")

### **6. Fetching & Viewing the Data**

In [6]:
df = pd.read_csv("resume_data.csv")

In [7]:
print(df.shape)
df.head()

(9544, 35)


Unnamed: 0,address,career_objective,skills,educational_institution_name,degree_names,passing_years,educational_results,result_types,major_field_of_studies,professional_company_names,...,online_links,issue_dates,expiry_dates,﻿job_position_name,educationaL_requirements,experiencere_requirement,age_requirement,responsibilities.1,skills_required,matched_score
0,,Big data analytics working and database wareho...,"['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapr...",['The Amity School of Engineering & Technology...,['B.Tech'],['2019'],['N/A'],[None],['Electronics'],['Coca-COla'],...,,,,Senior Software Engineer,B.Sc in Computer Science & Engineering from a ...,At least 1 year,,Technical Support\nTroubleshooting\nCollaborat...,,0.85
1,,Fresher looking to join as a data analyst and ...,"['Data Analysis', 'Data Analytics', 'Business ...","['Delhi University - Hansraj College', 'Delhi ...","['B.Sc (Maths)', 'M.Sc (Science) (Statistics)']","['2015', '2018']","['N/A', 'N/A']","['N/A', 'N/A']","['Mathematics', 'Statistics']",['BIB Consultancy'],...,,,,Machine Learning (ML) Engineer,M.Sc in Computer Science & Engineering or in a...,At least 5 year(s),,Machine Learning Leadership\nCross-Functional ...,,0.75
2,,,"['Software Development', 'Machine Learning', '...","['Birla Institute of Technology (BIT), Ranchi']",['B.Tech'],['2018'],['N/A'],['N/A'],['Electronics/Telecommunication'],['Axis Bank Limited'],...,,,,"Executive/ Senior Executive- Trade Marketing, ...",Master of Business Administration (MBA),At least 3 years,,"Trade Marketing Executive\nBrand Visibility, S...",Brand Promotion\nCampaign Management\nField Su...,0.416667
3,,To obtain a position in a fast-paced business ...,"['accounts payables', 'accounts receivables', ...","['Martinez Adult Education, Business Training ...",['Computer Applications Specialist Certificate...,['2008'],[None],[None],['Computer Applications'],"['Company Name ï¼ City , State', 'Company Name...",...,,,,Business Development Executive,Bachelor/Honors,1 to 3 years,Age 22 to 30 years,Apparel Sourcing\nQuality Garment Sourcing\nRe...,Fast typing skill\nIELTSInternet browsing & on...,0.76
4,,Professional accountant with an outstanding wo...,"['Analytical reasoning', 'Compliance testing k...",['Kent State University'],['Bachelor of Business Administration'],[None],['3.84'],[None],['Accounting'],"['Company Name', 'Company Name', 'Company Name...",...,[None],[None],"['February 15, 2021']",Senior iOS Engineer,Bachelor of Science (BSc) in Computer Science,At least 4 years,,iOS Lifecycle\nRequirement Analysis\nNative Fr...,iOS\niOS App Developer\niOS Application Develo...,0.65


### **7. Data Cleaning & Preprocessing**

####**7.1 Remove Irrelevant or Redundant Columns**

*   Drop fields that do not contribute to NLP matching
*   Examples: address, IDs, license fields, unnecessary metadata
*   Provide a short justification for each dropped column

In [8]:
df.columns.tolist()

['address',
 'career_objective',
 'skills',
 'educational_institution_name',
 'degree_names',
 'passing_years',
 'educational_results',
 'result_types',
 'major_field_of_studies',
 'professional_company_names',
 'company_urls',
 'start_dates',
 'end_dates',
 'related_skils_in_job',
 'positions',
 'locations',
 'responsibilities',
 'extra_curricular_activity_types',
 'extra_curricular_organization_names',
 'extra_curricular_organization_links',
 'role_positions',
 'languages',
 'proficiency_levels',
 'certification_providers',
 'certification_skills',
 'online_links',
 'issue_dates',
 'expiry_dates',
 '\ufeffjob_position_name',
 'educationaL_requirements',
 'experiencere_requirement',
 'age_requirement',
 'responsibilities.1',
 'skills_required',
 'matched_score']

In [9]:
cols_to_drop = [
    'address',
    'company_urls',
    'online_links',
    'issue_dates',
    'expiry_dates'
]
cols_to_drop = [c for c in cols_to_drop if c in df.columns]

df.drop(columns=cols_to_drop, inplace=True)
df.shape

(9544, 30)

These columns either represent metadata (addresses, URLs) or information that is not directly useful for text-based similarity and matching, so they are removed to simplify the dataset.

####**7.2 Fix Column Names**

Some columns may contain strange characters or dots (like responsibilities.1 or BOM in ﻿job_position_name).

In [10]:
rename_map = {}
for col in df.columns:
    if col.startswith('\ufeff'):
        rename_map[col] = col.replace('\ufeff', '')

rename_map['responsibilities.1'] = 'responsibilities'

df.rename(columns=rename_map, inplace=True)

df.columns.tolist()

['career_objective',
 'skills',
 'educational_institution_name',
 'degree_names',
 'passing_years',
 'educational_results',
 'result_types',
 'major_field_of_studies',
 'professional_company_names',
 'start_dates',
 'end_dates',
 'related_skils_in_job',
 'positions',
 'locations',
 'responsibilities',
 'extra_curricular_activity_types',
 'extra_curricular_organization_names',
 'extra_curricular_organization_links',
 'role_positions',
 'languages',
 'proficiency_levels',
 'certification_providers',
 'certification_skills',
 'job_position_name',
 'educationaL_requirements',
 'experiencere_requirement',
 'age_requirement',
 'responsibilities',
 'skills_required',
 'matched_score']

I have removed the hidden BOM characters and replace responsibilities.1 with a cleaner name responsibilities to avoid issues in later processing.

####**7.3 Handle Missing Values**

Breaking into 3 sub-steps:

#####**7.3.1 Fill missing textual fields**

We fill NaN in important text columns with empty strings so text processing functions don’t break.

In [11]:
resume_text_cols = [
    'career_objective',
    'skills',
    'experience',
    'projects',
    'certification_skills'
]

job_text_cols = [
    'job_position_name',
    'responsibilities',
    'skills_required',
    'educationaL_requirements',
    'experiencere_requirement'
]

resume_text_cols = [c for c in resume_text_cols if c in df.columns]
job_text_cols = [c for c in job_text_cols if c in df.columns]

for col in resume_text_cols + job_text_cols:
    df[col] = df[col].fillna("")

Missing text is treated as “no information” rather than dropping rows. Filling with empty strings ensures downstream NLP operations (like concatenation, TF-IDF) work correctly.

#####**7.3.2 Handle Missing matched_score (Target)**

In [12]:
before_rows = len(df)
df = df.dropna(subset=['matched_score'])
after_rows = len(df)

print(f"Rows before dropping missing scores: {before_rows}")
print(f"Rows after: {after_rows} | Dropped: {before_rows - after_rows}")

Rows before dropping missing scores: 9544
Rows after: 9544 | Dropped: 0


Rows without matched_score cannot be used for supervised learning, so they are removed to keep the training data consistent.

#####**7.3.3 Ensure matched_score is Numeric**

In [13]:
df['matched_score'] = pd.to_numeric(df['matched_score'], errors='coerce')
df['matched_score'].describe()

Unnamed: 0,matched_score
count,9544.0
mean,0.660831
std,0.16704
min,0.0
25%,0.583333
50%,0.683333
75%,0.793333
max,0.97


We ensure the target variable is numeric so that regression models can be applied without type issues.

#####**7.4 Select Relevant Columns for NLP Pipeline**

In [14]:
resume_cols = [
    'career_objective',
    'skills',
    'experience',
    'projects',
    'certification_skills'
]
resume_cols = [c for c in resume_cols if c in df.columns]

job_cols = [
    'job_position_name',
    'responsibilities',
    'skills_required',
    'educationaL_requirements',
    'experiencere_requirement'
]
job_cols = [c for c in job_cols if c in df.columns]

target_col = 'matched_score'

print("Resume columns:", resume_cols)
print("Job columns:", job_cols)
print("Target column:", target_col)

Resume columns: ['career_objective', 'skills', 'certification_skills']
Job columns: ['job_position_name', 'responsibilities', 'skills_required', 'educationaL_requirements', 'experiencere_requirement']
Target column: matched_score


I have explicitly defined which columns form the resume text, which form the job description, and which column is the target label. This makes the pipeline structure clear and reproducible.

#####**7.5 Create Combined Text Fields (resume_text, job_text)**

In [15]:
df['resume_text'] = df[resume_cols].agg(' '.join, axis=1)

df['job_text'] = df[job_cols].agg(' '.join, axis=1)

df[['resume_text', 'job_text']].head()

Unnamed: 0,resume_text,job_text
0,Big data analytics working and database wareho...,Senior Software Engineer Technical Support\nTr...
1,Fresher looking to join as a data analyst and ...,Machine Learning (ML) Engineer Machine Learnin...
2,"['Software Development', 'Machine Learning', ...","Executive/ Senior Executive- Trade Marketing, ..."
3,To obtain a position in a fast-paced business ...,Business Development Executive Apparel Sourcin...
4,Professional accountant with an outstanding wo...,Senior iOS Engineer iOS Lifecycle\nRequirement...


Recruiters read resumes and job descriptions as full documents, not individual fields. Combining columns into resume_text and job_text reflects this and prepares the data for unified NLP processing (TF-IDF, similarity, NER).

#####**7.6 Define Text Cleaning Function**

In [16]:
def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", " ", text)
    text = re.sub(r"[^a-z0-9\s\.,\-+/]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

Cleaning removes noise such as URLs and strange characters, and standardizes casing. This helps the vectorizer focus on meaningful tokens and improves model stability.

#####**7.7 Apply Text Cleaning to Combined Fields**

In [17]:
df['resume_text_clean'] = df['resume_text'].apply(clean_text)
df['job_text_clean'] = df['job_text'].apply(clean_text)

df[['resume_text_clean', 'job_text_clean']].head()

Unnamed: 0,resume_text_clean,job_text_clean
0,big data analytics working and database wareho...,senior software engineer technical support tro...
1,fresher looking to join as a data analyst and ...,machine learning ml engineer machine learning ...
2,"software development , machine learning , deep...","executive/ senior executive- trade marketing, ..."
3,to obtain a position in a fast-paced business ...,business development executive apparel sourcin...
4,professional accountant with an outstanding wo...,senior ios engineer ios lifecycle requirement ...


I kept both raw and cleaned versions. The cleaned text is used for TF-IDF and similarity computations, while the original can be used later for interpretation or display.

#####**7.8 Validate Cleaned Data**

In [18]:
print(df['resume_text_clean'].isna().sum(), "NaNs in resume_text_clean")
print(df['job_text_clean'].isna().sum(), "NaNs in job_text_clean")

idx_example = 0
print("Original resume text:\n", df.loc[idx_example, 'resume_text'][:500])
print("\nCleaned resume text:\n", df.loc[idx_example, 'resume_text_clean'][:500])

0 NaNs in resume_text_clean
0 NaNs in job_text_clean
Original resume text:
 Big data analytics working and database warehouse manager with robust experience in handling all kinds of data. I have also used multiple cloud infrastructure services and am well acquainted with them. Currently in search of role that offers more of development. ['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapreduce', 'Spark', 'Java', 'Machine Learning', 'Cloud', 'Hdfs', 'YARN', 'Core Java', 'Data Science', 'C++', 'Data Structures', 'DBMS', 'RDBMS', 'Informatica', 'Talend', 'Amazon Redshift', 'Micr

Cleaned resume text:
 big data analytics working and database warehouse manager with robust experience in handling all kinds of data. i have also used multiple cloud infrastructure services and am well acquainted with them. currently in search of role that offers more of development. big data , hadoop , hive , python , mapreduce , spark , java , machine learning , cloud , hdfs , yarn , core java , data science , c++ , 

I have validated that no NaNs remain in the cleaned text and quickly inspect a sample to confirm that the cleaning logic behaves as intended.

### **8. Resume Parsing using Named Entity Recognition (NER)**

In this step, I will use spaCy’s NER model to convert unstructured resume text into structured features.  
For each resume, I counted the number of organizations, locations, and date mentions, and stored them as `num_orgs`, `num_locations`, and `num_dates`. These features approximate aspects like work history richness and geographic/organizational diversity and will be combined with text similarity features in the next step.

#####**8.1 Quick NER Check on a Single Resume**

In [19]:
nlp = spacy.load("en_core_web_sm")
example_text = df['resume_text_clean'].iloc[0]
doc = nlp(example_text)

[(ent.text, ent.label_) for ent in doc.ents][:20]

[('java', 'PERSON'),
 ('core java', 'PERSON'),
 ('c++', 'PERSON'),
 ('microsoft', 'ORG')]

####**8.3 Define NER Feature Extraction Function**

In [20]:
def extract_ner_features(text: str) -> pd.Series:

    doc = nlp(text)

    orgs = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    gpes = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
    dates = [ent.text for ent in doc.ents if ent.label_ == "DATE"]

    return pd.Series({
        "num_orgs": len(orgs),
        "num_locations": len(gpes),
        "num_dates": len(dates)
    })

For each resume, I count how many organizations, locations, and dates appear. These counts act as structured features that summarize how detailed/experienced a resume looks.

####**8.4 Apply NER to the Dataset**

In [21]:
ner_features = df['resume_text_clean'].apply(extract_ner_features)

ner_features.head()

Unnamed: 0,num_orgs,num_locations,num_dates
0,1,0,0
1,0,1,0
2,0,0,0
3,0,0,0
4,0,0,3


####**8.5 Merge NER Features Back into Main DataFrame**

In [22]:
df = pd.concat([df, ner_features], axis=1)
df[['num_orgs', 'num_locations', 'num_dates']].describe()

Unnamed: 0,num_orgs,num_locations,num_dates
count,9544.0,9544.0,9544.0
mean,0.857921,0.167225,0.249371
std,1.565749,0.457923,0.70999
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,1.0,0.0,0.0
max,11.0,3.0,6.0


The original dataframe with three new numeric columns:

*   num_orgs → how many organizations mentioned
*   num_locations → how many locations mentioned
*   num_dates → how many date expressions mentioned

These NER-based signals will be used later in the feature engineering and modelling steps to improve candidate–job match prediction.

### **9. Feature Engineering for Candidate–Job Matching**

#### **9.1 TF-IDF Vectorization of Resume & Job Text**

In [23]:
tfidf = TfidfVectorizer(max_features=5000)
combined_corpus = pd.concat(
    [df['resume_text_clean'], df['job_text_clean']],
    axis=0
)
tfidf.fit(combined_corpus)
resume_tfidf = tfidf.transform(df['resume_text_clean'])
job_tfidf    = tfidf.transform(df['job_text_clean'])

resume_tfidf.shape, job_tfidf.shape

((9544, 3248), (9544, 3248))

#### **9.2 Cosine Similarity Between Resume & Job Text**

In [24]:
similarity_matrix = cosine_similarity(resume_tfidf, job_tfidf)
tfidf_sim = similarity_matrix.diagonal()

df['tfidf_similarity'] = tfidf_sim
df['tfidf_similarity'].describe()

Unnamed: 0,tfidf_similarity
count,9544.0
mean,0.030502
std,0.037864
min,0.0
25%,0.00625
50%,0.018879
75%,0.040915
max,0.416624


#### **9.3 Skill Overlap Feature**

In [25]:
def extract_skill_set(text: str):
    if pd.isna(text):
        return set()
    tokens = [t.strip().lower() for t in re.split(r"[,\|;/]", str(text)) if t.strip()]
    return set(tokens)

def compute_skill_overlap(row):
    resume_skills = extract_skill_set(row.get('skills', ''))
    job_skills    = extract_skill_set(row.get('skills_required', ''))

    if not job_skills:
        return 0.0

    overlap = resume_skills & job_skills
    return len(overlap) / len(job_skills)

df['skill_overlap_ratio'] = df.apply(compute_skill_overlap, axis=1)
df['skill_overlap_ratio'].describe()

Unnamed: 0,skill_overlap_ratio
count,9544.0
mean,0.0
std,0.0
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,0.0


Here I compute the proportion of required job skills that are also present in the candidate’s resume. A score of 0 means none of the required skills are mentioned, while 1 means all required skills are covered. This creates a very interpretable feature for matching.

#### **9.4 Assemble Final Feature Matrix**

Now, I will gather all the engineered features:

*   tfidf_similarity (from 9.2)
*   skill_overlap_ratio (from 9.3)
*   num_orgs, num_locations, num_dates (from NER)

In [26]:
feature_cols = [
    'tfidf_similarity',
    'skill_overlap_ratio',
    'num_orgs',
    'num_locations',
    'num_dates'
]

for col in ['num_orgs', 'num_locations', 'num_dates']:
    if col not in df.columns:
        df[col] = 0

X = df[feature_cols]
y = df['matched_score']

X.head(), y.head()

(   tfidf_similarity  skill_overlap_ratio  num_orgs  num_locations  num_dates
 0          0.008995                  0.0         1              0          0
 1          0.104530                  0.0         0              1          0
 2          0.000000                  0.0         0              0          0
 3          0.020637                  0.0         0              0          0
 4          0.009315                  0.0         0              0          3,
 0    0.850000
 1    0.750000
 2    0.416667
 3    0.760000
 4    0.650000
 Name: matched_score, dtype: float64)

I have consolidated all engineered features into a single feature matrix X, with matched_score as the target y. These features combine text similarity, skill matching, and structural NER-based signals for the predictive model.

### **10. Modelling Candidate–Job Match**

#### **10.1 Train–Test Split**

In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape

((7635, 5), (1909, 5))

I have split the data into training (80%) and test (20%) sets so that model performance can be evaluated on unseen examples.

#### **10.2 Baseline Model: RandomForest Regressor**

In [28]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

I have used a RandomForestRegressor as a baseline model to predict the numerical match score from the engineered features. Random forests can capture non-linear relationships and work well with mixed-scale features.