#  **AI-Powered Resume Screener (RAG Pipeline)**

### **Project Goal**
Build an automated HR assistant that can search through a directory of PDF resumes. Using **Retrieval-Augmented Generation (RAG)**, the assistant identifies top candidates based on specific skills (e.g., "TensorFlow", "Project Management") and provides a summary of their qualifications.


### **Step 1:** Import libraries and extract documents

The documents (resumes) were downloaded randmonly from a random website

In [1]:
import os
import numpy as np
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
loader = PyPDFDirectoryLoader("resumes")
docs_before_split = loader.load()


In [3]:
print(docs_before_split[0])

page_content='AT
ALEXANDER TAYLOR
Seasoned Criminal Trial Lawyer | Legal Training Expert
 1  234  555 1234 Email linkedin.com Chicago, Illinois
SUMMARY
With over a decade of legal expertise and a track record of transformative client advocacy, I've advanced prosecutorial practices, revolutionizing case management and legal training deliverance.
EXPERIENCE
Senior Criminal Prosecutor
Cook County State’s Attorney Office
01/2015   Present  Chicago, IL
Led a team of 20 attorneys, achieving a record high 95% conviction rate on major felony cases over two years.Orchestrated a workflow overhaul for case management, resulting in a 30% improvement in case throughput.Pioneered a diversity-focused recruitment initiative that boosted retention by 25% within the legal department.Developed training modules on emerging legal standards, which became a blueprint for state-wide prosecutor offices.Spearheaded a partnership with local law enforcement agencies, enriching case quality and collaboration effec

In [4]:
import fitz  # PyMuPDF
import os
from langchain_core.documents import Document

def load_resumes(folder_path):
    documents = []

    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(folder_path, filename)
            text = ""

            doc = fitz.open(file_path)
            for page in doc:
                text += page.get_text()

            documents.append(
                Document(
                    page_content=text,
                    metadata={
                        "candidate": filename.replace(".pdf", ""),
                        "source": file_path
                    }
                )
            )

    return documents


In [5]:
docs_before_split = load_resumes("resumes")

In [6]:
print(docs_before_split[0])

page_content='AT
ALEXANDER TAYLOR
Seasoned Criminal Trial Lawyer | Legal Training Expert
+1-(234)-555-1234
Email
linkedin.com
Chicago, Illinois
SUMMARY
With over a decade of legal expertise and a track record of transformative 
client advocacy, I've advanced prosecutorial practices, revolutionizing case 
management and legal training deliverance.
EXPERIENCE
Senior Criminal Prosecutor
Cook County State’s Attorney Office
01/2015 - Present 
Chicago, IL
Led a team of 20 attorneys, achieving a record high 95% conviction rate 
on major felony cases over two years.
Orchestrated a workflow overhaul for case management, resulting in a 
30% improvement in case throughput.
Pioneered a diversity-focused recruitment initiative that boosted 
retention by 25% within the legal department.
Developed training modules on emerging legal standards, which became 
a blueprint for state-wide prosecutor offices.
Spearheaded a partnership with local law enforcement agencies, 
enriching case quality and collabor

### **Why the second method is better:**
**1. Cleaner Text Extraction**

Look closely at the contact information in your two outputs:

- PyPDFDirectoryLoader: �1��234��555�1234 (Full of null characters and encoding errors).

- PyMuPDF (fitz): +1-(234)-555-1234 (Clean, readable text).

If the LLM tries to find a phone number or specific technical keywords in the first version, it will likely fail because of those broken characters (�).

**2. Meaningful Metadata**

In a Resume Screener, the most important piece of information is "Who does this experience belong to?"

- Option 1: Only gives you the file path.

- Option 2: Explicitly adds a "candidate": "AlexanderTaylorResume" key to the metadata. When the LLM retrieves a chunk later, it will be much easier for it to say, "I found a match in Alexander Taylor's resume" rather than just pointing to a file path.

**3. Handling of Multi-line Layouts**

Resumes often use two-column layouts.

- Standard loaders (Option 1) often read across the columns horizontally, mixing the "Skills" from the right column into the "Experience" from the left column, creating a "text soup."

- PyMuPDF (Option 2) is generally more robust at extracting text in a logical flow that preserves the relationship between headers and descriptions.

### **Step 2:** Chunking the data

In [7]:
len(docs_before_split[0].page_content)

4222

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
) 
 
docs_after_split = text_splitter.split_documents(docs_before_split)


In [9]:
len(docs_after_split[0].page_content)

769

### **Step 3:** Embeddings and Vector DB

In [10]:
huggingface_embeddings = HuggingFaceBgeEmbeddings(
    model_name = "sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

In [11]:
vectorstore = FAISS.from_documents(
    documents=docs_after_split,
    embedding=huggingface_embeddings
)

In [12]:
query = "Find candidates with Sensor Integration, System Architecture, CAD Design, MATLAB, Data Analysis, Mechanical Engineering experience"

### **Search Strategy: Similarity vs. MMR**

**Different ways to convert query to embedding to get top chunks**

In a resume screening project, choosing between Similarity Search and MMR (Maximal Marginal Relevance) is the difference between finding the "four most similar spots" versus finding the "four best unique candidates."


These 2 have got me confused so i did a small search on them and here is a summary of it if you got confused as well

**1. Standard Similarity Search**
* **How it works:** Finds chunks that are mathematically closest to the query.
* **The Problem:** If one candidate has a very long section about "Python," the search might return 4 different paragraphs from that **same person**.
* **Best for:** Finding specific facts within a single document.

**2. MMR (Maximal Marginal Relevance)**
* **How it works:** It balances **relevance** (is it a match?) with **diversity** (is this info new?).
* **The Solution:** If it finds a match for "Python" in Candidate A's resume, it will intentionally look for the next match in a *different* candidate's resume.
* **Best for:** Resume screening, as it ensures you see a diverse shortlist of different candidates rather than multiple pages of the same person.



In [13]:
relevant_docs = vectorstore.similarity_search(
    query,
    k=5,
    fetch_k=10,
    search_type="mmr"
)

In [16]:
print(relevant_docs[1].page_content)

ANDREW GREEN
Sensor Hardware Engineer | Autonomous Systems | Integration Expert
+1-(234)-555-1234
Email
linkedin.com
Seattle, Washington
SUMMARY
With over a decade of experience in sensor systems and a deep-seated 
knowledge of autonomous vehicle technology, I bring a proven track record of 
innovation and team leadership in challenging engineering environments. My 
expertise spans across integrating and testing cutting-edge sensors, driving 
both product and technological advancements.
EXPERIENCE
Senior Sensor Hardware Engineer
Blue Origin
06/2019 - Present 
Kent, WA
Spearheaded a 12-member engineering team in the design and 
implementation of advanced sensor suite for aerospace applications, 
increasing detection accuracy by 30%.


### **Conclusion for the steps done till now**

The current implementation successfully performs semantic retrieval over resume chunks; however, it lacks candidate-level aggregation and LLM-based reasoning, which are required to generate a final ranked list or summary of suitable candidates.

In [20]:
#  Create a retriever from the vectorstore for RAG pipeline 
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 8}  
)

In [24]:
retrieved_docs = retriever.invoke(query)


In [25]:
from collections import defaultdict

candidate_chunks = defaultdict(list)

for doc in retrieved_docs:
    candidate = doc.metadata.get("candidate", "unknown")
    candidate_chunks[candidate].append(doc.page_content)


In [26]:
candidate_scores = {
    candidate: len(chunks)
    for candidate, chunks in candidate_chunks.items()
}

ranked_candidates = sorted(
    candidate_scores.items(),
    key=lambda x: x[1],
    reverse=True
)

ranked_candidates


[('AndrewGreenResume', 5), ('AvaTaylorResume', 2), ('BrendaCalvinResume', 1)]

In [27]:
query_terms = [t.lower() for t in query.replace("+", " ").split()]

def is_strong_match(text):
    text = text.lower()
    return sum(term in text for term in query_terms) >= 2

filtered_candidates = {}

for candidate, chunks in candidate_chunks.items():
    strong_chunks = [c for c in chunks if is_strong_match(c)]
    if strong_chunks:
        filtered_candidates[candidate] = strong_chunks


In [28]:
for candidate, chunks in filtered_candidates.items():
    print(f"\nCandidate: {candidate}")
    print("Relevant evidence:")
    print(chunks[0][:500], "...")



Candidate: AndrewGreenResume
Relevant evidence:
Awarded for creating the best technical 
documentation for a new sensor suite, which 
became the reference standard across multiple 
projects.
Led Sensor Suite Deployment
Successfully led the rapid deployment of an in-
house designed sensor suite, leading to a 
contract extension with a major aerospace 
client.
Cross-Functional Leadership
Championed a cross-functional initiative that 
improved inter-department communication, 
resulting in a 15% faster project delivery time.
SKILLS
Sensor Integra ...

Candidate: AvaTaylorResume
Relevant evidence:
AVA TAYLOR
Innovative Entry-Level Mechanical Engineer
+1-541-754-3010
Email
linkedin.com
New Brunswick, NJ
SUMMARY
Recent Mechanical Engineering graduate with a strong understanding of 
design, development, and implementation of custom mechanical components, 
sub-assemblies, and final assemblies. Eager to apply my skills to contribute to 
the Mechanical Engineer role at Intellectt.
EXPERIENCE
Mec

In [43]:
from langchain_huggingface import HuggingFaceEndpoint
from langchain_huggingface.chat_models import ChatHuggingFace

access_token = "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"

llm = HuggingFaceEndpoint(
    repo_id="mistralai/Mistral-7B-Instruct-v0.2",
    task="conversational",
    huggingfacehub_api_token=access_token,
    temperature=0.1,
)
llm = ChatHuggingFace(llm=llm)


In [45]:
def build_candidate_context(chunks, max_chunks=3):
    return "\n\n".join(chunks[:max_chunks])


In [40]:
def build_prompt(query, context):
    return f"""
You are an AI HR assistant.

Task:
Determine whether the candidate meets the following requirement:

"{query}"

Resume excerpts:
{context}

Instructions:
- Answer YES or NO
- Briefly justify your decision using evidence from the text
"""


In [None]:
final_results = {}


for candidate, chunks in filtered_candidates.items():
    context = build_candidate_context(chunks)
    prompt = build_prompt(query, context)

    response = llm.invoke(prompt)

    final_results[candidate] = response



In [None]:
# Assuming final_results is your dictionary
for resume_name, ai_message in final_results.items():
    print(f"--- {resume_name} ---")
    print(ai_message.content.strip()) 
    print("\n") 


--- AndrewGreenResume ---
Based on the information provided in the resume, I would answer YES. The candidate has experience and skills in Sensor Integration, System Architecture, CAD Design, MATLAB, Data Analysis, and Mechanical Engineering as required. They have also led engineering teams and contributed to the development of sensor suites, demonstrating their expertise in these areas. Additionally, they have completed relevant courses in CAD Design and Robotics, further solidifying their qualifications.


--- AvaTaylorResume ---
Based on the information provided in the resume, AVA TAYLOR does have experience in Mechanical Engineering and design using Pro E / Creo. She also mentions her strong analytical thinking and problem-solving abilities, which could be beneficial for Data Analysis. However, there is no clear mention of experience with Sensor Integration, System Architecture, CAD Design (beyond Pro E / Creo), MATLAB, or Mechanical Engineering at an advanced level. Therefore, base