# AI RESUME RANKER
How does it work?
* User uploads a Resume (pdf/docx format) and Job Description (text)
* Preprocess both the texts (cleaning, tokenization etc)
* Convert both texts into embeddings (first using basic TF-IDF then using SBERT maybe even fine tuning it)
* Compute Similarity Score using Cosine Similarity
* Show result -
  > * High Score (80-100%) -> Resume is a great match!
  > * Medium Score (50-79%) -> Resume needs improvement
  > * Low Score (<50%) -> Resume is not a good match.

In [51]:
import pdfplumber
import nltk
import re
import docx
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer,util

In [33]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/khushimadan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/khushimadan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
stop_words = set(stopwords.words('english'))

In [11]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

## Data Extraction and Preprocessing

#### Extracting text from resume
> Supports both pdf and docx formats

In [12]:
def extract_resume_text(file_path):
    if file_path.endswith(".pdf"):
        with pdfplumber.open(file_path) as pdf:
            text = "\n".join([page.extract_text() or "" for page in pdf.pages])
    elif file_path.endswith(".docx"):
        doc = docx.Document(file_path)
        text = "\n".join([para.text for para in doc.paragraphs])
    else:
        raise ValueError("Unsupported file format")
    return text

#### Example Usage

In [13]:
resume_text = extract_resume_text("Khushi_Madan_Resume.pdf")
print(resume_text)

Khushi Madan
Delhi, India | P: +91 8375876890 | khushimadan11@gmail.com | LinkedIn | GitHub | LeetCode
EDUCATION
MANIPAL UNIVERSITY JAIPUR Jaipur, Rajasthan
Bachelor of Technology (Hons) Computer Science and Engineering with specialization in September 2022 - May 2026
Artificial Intelligence and Machine Learning
Cumulative GPA: 8.97/10.0; Dean’s List 2023-2024
Technical Lead at Google Developer Groups on Campus
VENKATESHWAR INTERNATIONAL SCHOOL Dwarka, Delhi
High School Diploma in Physics, Chemistry, Maths with Computer Science (CBSE) April 2008 - June 2022
10th - 92.6% and 12th - 86.6%
SKILLS
Technical Skills - C, C++, Python, Machine Learning, Data Analysis, Flutter, HTML, CSS, Linux, AutoCAD, SQL, JavaScript,
Bootstrap, Power BI, Data Science, Data Visualization, Artificial Intelligence, Figma, Excel, Scikit Learn, Flask and Firebase
Relevant Coursework - Operating Systems, Principles of Artificial Intelligence, Software Engineering and Project Management,
Computer Networks, Data St

In [14]:
doc_text = extract_resume_text("Khushi_Madan_Resume.docx")
print(doc_text)

Khushi Madan
Delhi, India | P: +91 8375876890 | khushimadan11@gmail.com | LinkedIn | GitHub | LeetCode
EDUCATION
MANIPAL UNIVERSITY JAIPUR	Jaipur, Rajasthan
Bachelor of Technology (Hons) Computer Science and Engineering with specialization in	September 2022 - May 2026 Artificial Intelligence and Machine Learning
Cumulative GPA: 8.97/10.0; Dean’s List 2023-2024
Technical Lead at Google Developer Groups on Campus
VENKATESHWAR INTERNATIONAL SCHOOL	Dwarka, Delhi
High School Diploma in Physics, Chemistry, Maths with Computer Science (CBSE)	April 2008 - June 2022 10th - 92.6% and 12th - 86.6%
SKILLS
Technical Skills - C, C++, Python, Machine Learning, Data Analysis, Flutter, HTML, CSS, Linux, AutoCAD, SQL, JavaScript, Bootstrap, Power BI, Data Science, Data Visualization, Artificial Intelligence, Figma, Excel, Scikit Learn, Flask and Firebase Relevant Coursework - Operating Systems, Principles of Artificial Intelligence, Software Engineering and Project Management, Computer Networks, Data St

#### Preprocessing Text
> * Cleaning Text
> * Tokenization
> * Stopword Removal

In [31]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text) #removing special characters
    tokens = word_tokenize(text) #Tokenization
    tokens = [word for word in tokens if word not in stop_words] #removing stopwords from our corpus
    return ' '.join(tokens)

#### Example Usage

In [34]:
clean_resume_text = preprocess_text(resume_text)
print(clean_resume_text)

khushi madan delhi india p 91 8375876890 khushimadan11gmailcom linkedin github leetcode education manipal university jaipur jaipur rajasthan bachelor technology hons computer science engineering specialization september 2022 may 2026 artificial intelligence machine learning cumulative gpa 897100 deans list 20232024 technical lead google developer groups campus venkateshwar international school dwarka delhi high school diploma physics chemistry maths computer science cbse april 2008 june 2022 10th 926 12th 866 skills technical skills c c python machine learning data analysis flutter html css linux autocad sql javascript bootstrap power bi data science data visualization artificial intelligence figma excel scikit learn flask firebase relevant coursework operating systems principles artificial intelligence software engineering project management computer networks data structures algorithms object oriented programming database management systems work experience arcadis gurugram haryana dat

## Converting Resume and Job Description to Vectors!

> Converting text into numerical form using TF-IDF

In [40]:
def rank_resumes(resumes, job_description):
    all_texts = resumes + [job_description]
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_texts)

    job_vec = tfidf_matrix[-1] #JD Vector
    resume_vecs = tfidf_matrix[:-1] #Resume vectors

    similarity_scores = cosine_similarity(resume_vecs, job_vec)
    return similarity_scores.flatten()

In [64]:
resumes = [clean_resume_text]
job_description = """Job Title: Data Analyst Intern  

Location: Remote / XYZCompany  

Duration: 2 months 

About the Role:  
We are looking for a Data Analyst Intern to support our project by analyzing, cleaning, and visualizing business and employee task data. This role involves working with large datasets, extracting insights, and creating reports and dashboards to drive decision-making.  

Key Responsibilities:  
- Collect, clean, and preprocess data from multiple sources, including Excel and databases.  
- Perform data analysis to identify trends, patterns, and insights.  
- Develop visualizations and interactive dashboards using Power BI and Python (Pandas, Matplotlib, Seaborn).  
- Work with stakeholders to understand data requirements and optimize reporting.  
- Assist in automating data extraction, transformation, and consolidation processes.  
- Ensure data accuracy and integrity across all reports and visualizations.  

Required Skills:  
- Strong analytical skills with experience in Excel, SQL, and Python (Pandas, Numpy, Matplotlib).  
- Hands-on experience with Power BI for data visualization.  
- Understanding of data structures, cleaning, and transformation techniques.  
- Ability to work independently and manage multiple tasks efficiently.  
- Strong problem-solving skills and attention to detail.  

Preferred Qualifications:  
- Experience with business analytics, reporting, and performance tracking.  
- Familiarity with data extraction from multiple Excel sheets and consolidation techniques.  
- Basic understanding of machine learning concepts is a plus.  

What You’ll Gain:  
- Hands-on experience working with real-world business and employee task data.  
- Exposure to industry-standard tools and technologies.  
- Opportunity to contribute to decision-making processes through data-driven insights.  
- Mentorship and guidance to enhance your analytical and technical skills.  

How to Apply:  
Send your resume and a brief cover letter to xyzcompany@gmail.com."""


In [65]:
job_description = preprocess_text(job_description)
job_description

'job title data analyst intern location remote xyzcompany duration 2 months role looking data analyst intern support project analyzing cleaning visualizing business employee task data role involves working large datasets extracting insights creating reports dashboards drive decisionmaking key responsibilities collect clean preprocess data multiple sources including excel databases perform data analysis identify trends patterns insights develop visualizations interactive dashboards using power bi python pandas matplotlib seaborn work stakeholders understand data requirements optimize reporting assist automating data extraction transformation consolidation processes ensure data accuracy integrity across reports visualizations required skills strong analytical skills experience excel sql python pandas numpy matplotlib handson experience power bi data visualization understanding data structures cleaning transformation techniques ability work independently manage multiple tasks efficiently 

In [66]:
scores = rank_resumes(resumes, job_description)
print(scores)

[0.24759701]


#### !!! Low Score of TF-IDF is because it does not understand meaning and only focuses on word frequency. Even if skills match TF-IDF would not recognize synonyms

## Improve matching with BERT
> Instead of TF-IDF, BERT embeddings can be used for better context understanding

In [70]:
def rank_resumes_with_bert(resumes, job_description):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    resume_embeddings = model.encode(resumes,convert_to_tensor=True)
    jd_embedding = model.encode(job_description,convert_to_tensor=True)

    similarity_scores = util.pytorch_cos_sim(resume_embeddings,jd_embedding)
    return similarity_scores.flatten().tolist()

In [71]:
scores_bert = rank_resumes_with_bert([clean_resume_text],job_description)
scores_bert

[0.6897550225257874]