Matching a CV to a job description in Python

<br>Extract Text: Use libraries like PyPDF2 or pdfplumber to extract text from a CV file (typically a PDF) and a job description (either a text input or a separate file).
<br>Preprocess Text: Apply NLP techniques (tokenization, stop word removal, etc.) to clean the extracted text.
<br>Vectorize Text: Convert the cleaned CV and JD text into a numerical matrix using a vectorizer, such as CountVectorizer or TfidfVectorizer from scikit-learn.
<br>Calculate Similarity: Use cosine_similarity to get a match score.
Total Score =
(Required Skills × 50%) +
(Preferred Skills × 25%) +
(Experience × 15%) +
(Keywords × 10%)



**Method 1:** Cosine_similarity

In [4]:
!pip install PyPDF2
!pip install sklearn

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mh

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import PyPDF2 # You may need to install this: pip install PyPDF2

# Assume 'cv_text' and 'jd_text' contain the preprocessed text from the files
# (Text extraction from PDF using PyPDF2 or pdfplumber is a prior step)

# For demonstration, assume sample strings:
cv_text = "Experienced data analyst skilled in SQL, Python, and machine learning. Developed predictive models and analyzed large datasets."
jd_text = "Seeking a data analyst with 2+ years experience. Must have strong skills in Python, data analysis, and SQL for reporting."

# Place the texts into a list for the vectorizer
match_test = [cv_text, jd_text]

# Initialize CountVectorizer and create the count matrix
cv = CountVectorizer()
count_matrix = cv.fit_transform(match_test)

# Calculate the cosine similarity
similarity_score = cosine_similarity(count_matrix)[0][1]

# Convert the score to a percentage
match_percentage = round(similarity_score * 100, 2)

print(f"Match Percentage is: {match_percentage}%")


Match Percentage is: 41.04%


Component	Weight	Rationale
<br>Required Skills	50%	Essential technical needs
<br>Preferred Skills	25%	Competitive differentiators
<br>Experience	15%	Professional depth
<br>Keywords	10%	Domain familiarity

**Method 2:** Cosine Similarity after removing stop words->Job Description (JD) with text **similarity** **\

##

In [6]:
pip install nltk scikit-learn pdfplumber python-docx


Collecting pdfplumber
  Downloading pdfplumber-0.11.9-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Collecting pdfminer.six==20251230 (from pdfplumber)
  Downloading pdfminer_six-20251230-py3-none-any.whl.metadata (4.3 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-5.3.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.8/67.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.9-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer_six-20251230-py3-none-any.whl (6.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.

In [7]:
import pdfplumber

def extract_pdf_text(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text

stop word removing

In [8]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z ]', '', text)
    words = text.split()
    words = [w for w in words if w not in stopwords.words('english')]
    return " ".join(words)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z ]', '', text)
    words = text.split()
    words = [w for w in words if w not in stopwords.words('english')]
    return " ".join(words)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Vectorizing**

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()


In [12]:
from sklearn.metrics.pairwise import cosine_similarity

job_description = clean_text("""
Looking for a React developer with experience in React,
REST APIs, and AWS. Basic knowledge of React.js, JavaScript,
HTML and CSSUnderstanding of component-based architecture
Familiar with RESTful APIs integration and asynchronous programming
""")

resume_1 = clean_text(extract_pdf_text("resume1.pdf"))
resume_2 = clean_text(extract_pdf_text("resume2.pdf"))
resume_3 = clean_text(extract_pdf_text("resume3.pdf"))

documents = [job_description, resume_1, resume_2, resume_3]

tfidf_matrix = vectorizer.fit_transform(documents)
similarity_scores = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])

print(similarity_scores)


[[0.05016986 0.02050535 0.05489717]]


In [13]:
scores = similarity_scores[0]

ranking = sorted(
    enumerate(scores),
    key=lambda x: x[1],
    reverse=True
)

for index, score in ranking:
    print(f"Resume {index+1} Score: {round(score*100, 2)}%")


Resume 3 Score: 5.49%
Resume 1 Score: 5.02%
Resume 2 Score: 2.05%


**Method 3:** BERT  Resume Matching

**BERT EMbedding**
| Feature            | TF-IDF | BERT |
| ------------------ | ------ | ---- |
| Understand meaning | ❌      | ✅    |
| Synonyms           | ❌      | ✅    |
| Context            | ❌      | ✅    |
| Accuracy           | Medium | High |


In [14]:
pip install sentence-transformers pdfplumber




In [15]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [16]:
import pdfplumber

def extract_pdf_text(path):
    text = ""
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text


In [17]:
job_description = """
Looking for a React developer with experience in React,
REST APIs, and AWS. Basic knowledge of React.js, JavaScript,
HTML and CSSUnderstanding of component-based architecture
Familiar with RESTful APIs integration and asynchronous programming
"""

resume_1 = extract_pdf_text("resume1.pdf")
resume_2 = extract_pdf_text("resume2.pdf")
resume_3 = extract_pdf_text("resume3.pdf")
documents = [job_description, resume_1, resume_2,resume_3]


In [19]:
embeddings = model.encode(documents)


In [20]:
from sklearn.metrics.pairwise import cosine_similarity

scores = cosine_similarity(
    [embeddings[0]],
    embeddings[1:]
)[0]


In [21]:
ranking = sorted(
    enumerate(scores),
    key=lambda x: x[1],
    reverse=True
)

for index, score in ranking:
    print(f"Resume {index+1} Match Score: {round(score*100, 2)}%")


Resume 1 Match Score: 52.119998931884766%
Resume 3 Match Score: 37.279998779296875%
Resume 2 Match Score: 34.040000915527344%


**Method 4: Skill Extraction + BERT Hybrid Resume Matching**

In [30]:
SKILLS = [
    "react", "api", "next.js", "reactjs",
    "aws", "next", "javascript", "css",
    "sql", "mongodb", "rest api"
]


**skill extraction ( rule based)**

In [28]:
def extract_skills(text):
    text = text.lower()
    found_skills = set()
    for skill in SKILLS:
        if skill in text:
            found_skills.add(skill)
    return found_skills


**Skill Match Score**

In [23]:
def skill_match_score(jd_skills, resume_skills):
    if not jd_skills:
        return 0
    matched = jd_skills.intersection(resume_skills)
    return len(matched) / len(jd_skills)


**BERT Semantic Similarity**

In [24]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

def bert_similarity(jd_text, resume_text):
    embeddings = model.encode([jd_text, resume_text])
    return cosine_similarity(
        [embeddings[0]],
        [embeddings[1]]
    )[0][0]


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


**Hybrid Scroing**

In [25]:
def final_score(skill_score, bert_score,
                skill_weight=0.6, bert_weight=0.4):
    return (skill_weight * skill_score) + (bert_weight * bert_score)


**Example**

In [31]:
job_description = """
Looking for a Python developer with Django, AWS,
REST APIs, and cloud experience.
"""

resume_text = """
Experienced software engineer skilled in Python,
Flask, AWS, Docker, and RESTful APIs.
"""

jd_skills = extract_skills(job_description)
resume_skills = extract_skills(resume_text)

skill_score = skill_match_score(jd_skills, resume_skills)
bert_score = bert_similarity(job_description, resume_text)

score = final_score(skill_score, bert_score)

print("JD Skills:", jd_skills)
print("Resume Skills:", resume_skills)
print("Skill Score:", round(skill_score, 2))
print("BERT Score:", round(bert_score, 2))
print("Final Score:", round(score * 100, 2), "%")


JD Skills: {'rest api', 'api', 'aws'}
Resume Skills: {'api', 'aws'}
Skill Score: 0.67
BERT Score: 0.77
Final Score: 70.7 %
