# Resume–JD Matching POC Demo
This notebook is a **demo + client-style walkthrough** of the resume matching POC.

It shows:
- Loading a job description (JD)
- Loading resumes (TXT/PDF), optional PII redaction, and debug text outputs
- Running scoring (hybrid BM25 + chunked semantic)
- Explainability: top terms + top sentences
- Optional cross-encoder reranking
- Synthetic evaluation (nDCG, Spearman)


## 0) Setup
Run this notebook from `notebooks/`.

If you don’t have dependencies installed yet, uncomment and run the install cell.

In [9]:
# Optional: install dependencies (uncomment if needed)
# !pip install -r ../requirements.txt

from pathlib import Path
import pandas as pd

REPO_ROOT = Path("..").resolve()
DATA_DIR = REPO_ROOT / "data"
RESUMES_DIR = DATA_DIR / "resumes"
OUTPUT_DIR = REPO_ROOT / "outputs"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("Repo root:", REPO_ROOT)
print("Resumes dir:", RESUMES_DIR)


Repo root: /Users/amanverma/Downloads/ema-resume-matching-poc 4
Resumes dir: /Users/amanverma/Downloads/ema-resume-matching-poc 4/data/resumes


## 1) Load the job description
We read the JD from `data/job_description.txt`.

In [10]:
jd_path = DATA_DIR / "job_description.txt"
jd_text = jd_path.read_text(encoding="utf-8", errors="ignore")
print("JD path:", jd_path)
print("\nJD preview (first 600 chars):\n")
print(jd_text[:600])


JD path: /Users/amanverma/Downloads/ema-resume-matching-poc 4/data/job_description.txt

JD preview (first 600 chars):

Role: AI Engineer – Agentic & Intelligent Systems
Shape the Way the World Understands Data

At Teradata, we’re not just managing data—we’re unleashing its full potential. Through ClearScape Analytics™ and our Enterprise Vector Store, we are building next-generation autonomous and agentic AI systems that observe, reason, adapt, and drive enterprise-scale decision-making.
As part of the AI Engineering team, you will design and deploy production-grade AI agents tightly integrated with business workflows—turning data into measurable outcomes.

What You’ll Do
AI Systems & Agent Engineering
Architec


## 2) Load resumes (TXT/PDF) + optional PII redaction
This uses the same loader as the CLI.

- Extracted text is saved to `outputs/extracted/`
- Redacted text is saved to `outputs/redacted/` (only if enabled)


In [11]:
import sys
sys.path.append(str(REPO_ROOT))  # so `import src...` works

from src.cli import load_resumes

resumes = load_resumes(
    folder=RESUMES_DIR,
    do_redact_pii=True,          # flip to False if you want raw text
    save_debug_text=True,
    outputs_dir=OUTPUT_DIR,
)

print("Loaded resumes:", len(resumes))
print("Example resume ids:", list(resumes.keys())[:5])


Loaded resumes: 10
Example resume ids: ['resume_00_AIML_Fullstack.pdf', 'resume_01_genai_fresher.txt', 'resume_02_cloudEngg.txt', 'resume_03_genai_lead.txt', 'resume_04_genai_support.txt']


## 3) Run matching (baseline: no cross-encoder)
Key output columns:
- `final`: final score in [0,1] (ranking signal)
- `semantic`: semantic similarity (scaled)
- `bm25`: lexical overlap (scaled)
- `top_terms`: important overlapping JD keywords
- `top_sentences`: top resume sentences most similar to the JD


In [12]:
from src.matching import ResumeMatcher, MatchingConfig

cfg = MatchingConfig(
    use_cross_encoder=False,
    chunk_size_tokens=320,
    chunk_overlap_tokens=80,
    semantic_top_k_avg=1,
    rerank_top_n=5,
)

matcher = ResumeMatcher(cfg)
rows = matcher.score(jd_text, resumes)
df = pd.DataFrame(rows)

cols = [
    "resume_id","final","semantic","bm25","missing_bucket_count",
    "jd_required_years","resume_years_est","seniority_penalty_factor",
    "top_terms"
]
cols = [c for c in cols if c in df.columns]
df[cols].head(10)


Unnamed: 0,resume_id,final,semantic,bm25,missing_bucket_count,jd_required_years,resume_years_est,seniority_penalty_factor,top_terms
0,resume_03_genai_lead.txt,1.0,1.0,1.0,1,3.0,9.0,1.0,"architectures, architect, orchestration, scali..."
1,resume_00_AIML_Fullstack.pdf,0.696708,0.701746,0.781271,0,3.0,3.0,1.0,"within, mlops, llms, mlflow, memory, field, ml..."
2,resume_01_genai_fresher.txt,0.429211,0.88035,0.425809,1,3.0,1.0,0.65,"focus, tests, integration, quality, production..."
3,resume_04_genai_support.txt,0.392647,0.82889,0.144411,3,3.0,2.0,0.85,"is, solid, distributed, or, ability, detection..."
4,resume_07_BA_DataScience.pdf,0.321647,0.565526,0.326972,2,3.0,,0.92,"visualization, measurable, 3+, ensure, team, c..."
5,resume_09_BA.pdf,0.225405,0.38842,0.267036,3,3.0,3.0,1.0,"collaboration, planning, implementation, analy..."
6,resume_08_DataEngg.pdf,0.219687,0.335863,0.196421,1,3.0,4.0,1.0,"knowledge, unit, innovation, as, testing, fram..."
7,resume_02_cloudEngg.txt,0.188826,0.447244,0.0,3,3.0,5.0,1.0,"infrastructure, ability, applications., promet..."
8,resume_06_JavaDev.txt,0.016167,0.202394,0.034786,4,3.0,2.0,0.85,"understanding, related, practices, databases, ..."
9,resume_05_frontend.txt,0.0,0.0,0.373623,3,3.0,5.0,1.0,"work, ui, or, typescript, frontend, applicatio..."


### 3a) Explainability examples
Show the top matched sentences for the top-ranked resumes.

In [13]:
def show_explanations(df: pd.DataFrame, k: int = 3):
    for i in range(min(k, len(df))):
        r = df.iloc[i]
        print("="*90)
        print(f"Rank #{i+1}: {r['resume_id']}")
        print(f"final={r['final']:.3f}  semantic={r['semantic']:.3f}  bm25={r['bm25']:.3f}")
        print("\nTop terms:", r.get("top_terms",""))
        print("\nTop sentences:")
        print(str(r.get("top_sentences",""))[:1200])

show_explanations(df, k=3)


Rank #1: resume_03_genai_lead.txt
final=1.000  semantic=1.000  bm25=1.000

Top terms: architectures, architect, orchestration, scaling, frameworks, vector, practices, design, databases, quality

Top sentences:
collaborated with platform and security teams to enable production-ready ai deployments. || senior ai engineer | 2016 – 2019 built ml-driven backend services and early nlp pipelines powering search and recommendation workflows. || architecture & technical skills genai systems: rag architectures, agentic workflows, prompt orchestration, evaluation frameworks llm stack: python, llm apis, embeddings, rerankers, vector databases platform & devops: kubernetes, docker, ci/cd, observability data & storage: vector dbs, sql/nosql, document pipelines design: high-level system design, scalability, reliability, cost optimization
Rank #2: resume_00_AIML_Fullstack.pdf
final=0.697  semantic=0.702  bm25=0.781

Top terms: within, mlops, llms, mlflow, memory, field, ml, ai/ml, typescript, implemen

## 4) Run matching with cross-encoder reranking (top-N)
Cross-encoder reads (JD, resume) together and reranks only the top-N from the baseline stage.

In [14]:
cfg_ce = MatchingConfig(
    use_cross_encoder=True,
    rerank_top_n=5,  # rerank only top 5
    chunk_size_tokens=320,
    chunk_overlap_tokens=80,
    semantic_top_k_avg=1,
)

matcher_ce = ResumeMatcher(cfg_ce)
rows_ce = matcher_ce.score(jd_text, resumes)
df_ce = pd.DataFrame(rows_ce)

cols_ce = [
    "resume_id","final","semantic","bm25","cross_encoder","missing_bucket_count",
    "jd_required_years","resume_years_est","seniority_penalty_factor",
    "top_terms"
]
cols_ce = [c for c in cols_ce if c in df_ce.columns]
df_ce[cols_ce].head(10)


Unnamed: 0,resume_id,final,semantic,bm25,cross_encoder,missing_bucket_count,jd_required_years,resume_years_est,seniority_penalty_factor,top_terms
0,resume_03_genai_lead.txt,1.0,1.0,1.0,1.0,1,3.0,9.0,1.0,"architectures, architect, orchestration, scali..."
1,resume_00_AIML_Fullstack.pdf,0.676624,0.701746,0.781271,0.665035,0,3.0,3.0,1.0,"within, mlops, llms, mlflow, memory, field, ml..."
2,resume_01_genai_fresher.txt,0.304783,0.88035,0.425809,0.108347,1,3.0,1.0,0.65,"focus, tests, integration, quality, production..."
3,resume_04_genai_support.txt,0.255872,0.82889,0.144411,0.038032,3,3.0,2.0,0.85,"is, solid, distributed, or, ability, detection..."
4,resume_09_BA.pdf,0.225405,0.38842,0.267036,,3,3.0,3.0,1.0,"collaboration, planning, implementation, analy..."
5,resume_08_DataEngg.pdf,0.219687,0.335863,0.196421,,1,3.0,4.0,1.0,"knowledge, unit, innovation, as, testing, fram..."
6,resume_07_BA_DataScience.pdf,0.193561,0.565526,0.326972,0.0,2,3.0,,0.92,"visualization, measurable, 3+, ensure, team, c..."
7,resume_02_cloudEngg.txt,0.188826,0.447244,0.0,,3,3.0,5.0,1.0,"infrastructure, ability, applications., promet..."
8,resume_06_JavaDev.txt,0.016167,0.202394,0.034786,,4,3.0,2.0,0.85,"understanding, related, practices, databases, ..."
9,resume_05_frontend.txt,0.0,0.0,0.373623,,3,3.0,5.0,1.0,"work, ui, or, typescript, frontend, applicatio..."


## 5) Evaluate on synthetic labels
We evaluate ranking quality using nDCG@k and Spearman correlation between score and label.

Labels file: `data/labels.csv` with labels in {1.0, 0.5, 0.0}.

In [15]:
import os, sys, subprocess
from pathlib import Path

os.environ["TOKENIZERS_PARALLELISM"] = "false"

cmd = [
    sys.executable, "-m", "src.evaluate",
    "--jd", str(DATA_DIR / "job_description.txt"),
    "--resumes", str(RESUMES_DIR),
    "--labels", str(DATA_DIR / "labels.csv"),
    "--redact_pii",
    "--use_cross_encoder",
    "--rerank_top_n", "5",
]

print("Running:\n", " ".join(cmd))
p = subprocess.run(
    cmd,
    cwd=str(REPO_ROOT),          # ✅ critical
    text=True,
    capture_output=True,
)
print("\nSTDOUT:\n", p.stdout)
print("\nSTDERR:\n", p.stderr)
p.check_returncode()


Running:
 /Users/amanverma/Downloads/ema-resume-matching-poc 4/.venv/bin/python -m src.evaluate --jd /Users/amanverma/Downloads/ema-resume-matching-poc 4/data/job_description.txt --resumes /Users/amanverma/Downloads/ema-resume-matching-poc 4/data/resumes --labels /Users/amanverma/Downloads/ema-resume-matching-poc 4/data/labels.csv --redact_pii --use_cross_encoder --rerank_top_n 5

STDOUT:
                    resume_id  label    final  semantic     bm25  missing_bucket_count                                                                                                  top_terms
    resume_03_genai_lead.txt    1.0 1.000000  1.000000 1.000000                     1 architect, orchestration, architectures, scaling, design, agent, frameworks, databases, quality, practices
resume_00_AIML_Fullstack.pdf    1.0 0.676624  0.701746 0.781271                     0                                       mlflow, field, llms, within, ai/ml, memory, mlops, ml, tests, design
 resume_01_genai_fresher.txt

## The Truth About This POC (Limitations)

* **PDF Parsing:** Extraction is imperfect. Scanned/Image-only PDFs may result in empty text without OCR.
* **Scoring:** Scores are batch-relative, not calibrated probabilities (0.8 in one batch != 0.8 in another).
* **Unsupervised:** This logic does not currently "learn" from recruiter feedback.
* **Bias:** Fairness isn't fully addressed; production systems need strong PII handling and bias audits.
* **Latency:** The Cross-encoder improves quality significantly but increases latency and compute cost.


## Next Improvements

* **OCR Integration:** Layout-aware PDF parsing and OCR fallback for scanned resumes.
* **Entity Extraction:** Extract structured data (Job Titles, Tools, Certifications, Years of Experience) for harder filtering.
* **Stronger Gating:** Configurable "mandatory buckets" per role to auto-reject unqualified candidates.
* **Learning-to-Rank (LTR):** Train a model using real recruiter outcomes (Screen / Interview / Hire) as labels.
* **Scalability:** Move from in-memory processing to ElasticSearch (BM25) + Vector DB (ANN) + Cross-Encoder Reranker.