#  **AI-Powered Resume Screener (RAG Pipeline)**

### **Project Goal**
Build an automated HR assistant that can search through a directory of PDF resumes. Using **Retrieval-Augmented Generation (RAG)**, the assistant identifies top candidates based on specific skills (e.g., "TensorFlow", "Project Management") and provides a summary of their qualifications.


### **Step 1:** Import libraries and extract documents

The documents (resumes) were downloaded randmonly from a random website

In [1]:
import os
import numpy as np
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
loader = PyPDFDirectoryLoader("resumes")
docs_before_split = loader.load()


In [4]:
print(docs_before_split[0])

page_content='AT
ALEXANDER TAYLOR
Seasoned Criminal Trial Lawyer | Legal Training Expert
 1  234  555 1234 Email linkedin.com Chicago, Illinois
SUMMARY
With over a decade of legal expertise and a track record of transformative client advocacy, I've advanced prosecutorial practices, revolutionizing case management and legal training deliverance.
EXPERIENCE
Senior Criminal Prosecutor
Cook County State’s Attorney Office
01/2015   Present  Chicago, IL
Led a team of 20 attorneys, achieving a record high 95% conviction rate on major felony cases over two years.Orchestrated a workflow overhaul for case management, resulting in a 30% improvement in case throughput.Pioneered a diversity-focused recruitment initiative that boosted retention by 25% within the legal department.Developed training modules on emerging legal standards, which became a blueprint for state-wide prosecutor offices.Spearheaded a partnership with local law enforcement agencies, enriching case quality and collaboration effec

In [5]:
import fitz  # PyMuPDF
import os
from langchain_core.documents import Document

def load_resumes(folder_path):
    documents = []

    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(folder_path, filename)
            text = ""

            doc = fitz.open(file_path)
            for page in doc:
                text += page.get_text()

            documents.append(
                Document(
                    page_content=text,
                    metadata={
                        "candidate": filename.replace(".pdf", ""),
                        "source": file_path
                    }
                )
            )

    return documents


In [6]:
docs_before_split = load_resumes("resumes")

In [9]:
print(docs_before_split[0])

page_content='AT
ALEXANDER TAYLOR
Seasoned Criminal Trial Lawyer | Legal Training Expert
+1-(234)-555-1234
Email
linkedin.com
Chicago, Illinois
SUMMARY
With over a decade of legal expertise and a track record of transformative 
client advocacy, I've advanced prosecutorial practices, revolutionizing case 
management and legal training deliverance.
EXPERIENCE
Senior Criminal Prosecutor
Cook County State’s Attorney Office
01/2015 - Present 
Chicago, IL
Led a team of 20 attorneys, achieving a record high 95% conviction rate 
on major felony cases over two years.
Orchestrated a workflow overhaul for case management, resulting in a 
30% improvement in case throughput.
Pioneered a diversity-focused recruitment initiative that boosted 
retention by 25% within the legal department.
Developed training modules on emerging legal standards, which became 
a blueprint for state-wide prosecutor offices.
Spearheaded a partnership with local law enforcement agencies, 
enriching case quality and collabor

### **Why the second method is better:**
**1. Cleaner Text Extraction**

Look closely at the contact information in your two outputs:

- PyPDFDirectoryLoader: �1��234��555�1234 (Full of null characters and encoding errors).

- PyMuPDF (fitz): +1-(234)-555-1234 (Clean, readable text).

If the LLM tries to find a phone number or specific technical keywords in the first version, it will likely fail because of those broken characters (�).

**2. Meaningful Metadata**

In a Resume Screener, the most important piece of information is "Who does this experience belong to?"

- Option 1: Only gives you the file path.

- Option 2: Explicitly adds a "candidate": "AlexanderTaylorResume" key to the metadata. When the LLM retrieves a chunk later, it will be much easier for it to say, "I found a match in Alexander Taylor's resume" rather than just pointing to a file path.

**3. Handling of Multi-line Layouts**

Resumes often use two-column layouts.

- Standard loaders (Option 1) often read across the columns horizontally, mixing the "Skills" from the right column into the "Experience" from the left column, creating a "text soup."

- PyMuPDF (Option 2) is generally more robust at extracting text in a logical flow that preserves the relationship between headers and descriptions.