# Lab: AI-Powered Resume Matcher (ADVANCED SEMANTIC UPGRADE)

### **Objective**
This is the most advanced version of our recruitment AI lab. Here, we use four distinct "Signals" from **IBM Watson NLU** to evaluate a candidate just like a human recruiter would.

### **The Four Semantic Signals**
1. **Keywords (Shallow)**: Technical jargon and specific tools (e.g., "Python", "React").
2. **Concepts (Deep)**: Abstract themes and expertise (e.g., "Software Development Life Cycle").
3. **Entities (Proper Nouns)**: Companies, job titles, and specific organizations (e.g., "IBM", "Staff Engineer").
4. **Categories (Context)**: The industry domain (e.g., "Technology > Software Engineering").

### **Why this matters?**
Keywords might only get you 10%, but your **Entities** and **Concepts** show you have the right background even if you didn't use the "exact" words the job poster chose.

## Step 0: Setup and Authentication

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, KeywordsOptions, ConceptsOptions, CategoriesOptions, EntitiesOptions
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import fitz  # PyMuPDF

# Load credentials
load_dotenv()
iam_key = os.getenv("IAM_KEY")
service_url = os.getenv("SERVICE_URL")

authenticator = IAMAuthenticator(iam_key)
nlu = NaturalLanguageUnderstandingV1(version='2022-04-07', authenticator=authenticator)
nlu.set_service_url(service_url)

print("Advanced Intelligence Setup Complete!")

## Step 1: Quad-Feature Signal Extraction

We will now extract Keywords, Concepts, Entities, and Categories from both documents.

In [None]:
def analyze_document(text):
    return nlu.analyze(
        text=text,
        features=Features(
            keywords=KeywordsOptions(limit=25),
            concepts=ConceptsOptions(limit=15),
            entities=EntitiesOptions(limit=25),
            categories=CategoriesOptions(limit=5)
        )
    ).get_result()

def extract_pdf_text(path):
    text = ""
    with fitz.open(path) as doc:
        for page in doc: text += page.get_text()
    return text

# Process Documents
JOB_FILE = "docs/job-5653.txt"
RESUME_FILE = "docs/PortillaCV-08-2023.pdf"

with open(JOB_FILE, 'r') as f: job_text = f.read()
resume_text = extract_pdf_text(RESUME_FILE)

j_raw = analyze_document(job_text)
r_raw = analyze_document(resume_text)

print("Deep Extraction Complete!")

## Step 2: Comparing the Signals

We now calculate similarity across four dimensions, allowing us to see exactly where the candidate is a fit.

In [None]:
def list_sim(l1, l2):
    if not l1 or not l2: return 0.0
    v = TfidfVectorizer()
    m = v.fit_transform([' '.join(l1), ' '.join(l2)])
    return round(cosine_similarity(m[0], m[1])[0][0] * 100, 2)

# Extract lists
j_kw = [x['text'] for x in j_raw.get('keywords', [])]
r_kw = [x['text'] for x in r_raw.get('keywords', [])]
j_cp = [x['text'] for x in j_raw.get('concepts', [])]
r_cp = [x['text'] for x in r_raw.get('concepts', [])]
j_en = [x['text'] for x in j_raw.get('entities', [])]
r_en = [x['text'] for x in r_raw.get('entities', [])]

# Calculate Match Percentages
kw_score = list_sim(j_kw, r_kw)
cp_score = list_sim(j_cp, r_cp)
en_score = list_sim(j_en, r_en)

# Refined Category Match: Check for INDUSTRY overlap across all categories
j_cats = [c['label'].split('/')[1] for c in j_raw.get('categories', []) if '/' in c['label']]
r_cats = [c['label'].split('/')[1] for c in r_raw.get('categories', []) if '/' in c['label']]
ct_score = 100 if set(j_cats).intersection(set(r_cats)) else 0

# Quadrant weighted score
final_score = round((kw_score * 0.3) + (cp_score * 0.3) + (en_score * 0.2) + (ct_score * 0.2), 2)

print(f"--- DIMENSIONAL ANALYSIS ---")
print(f"Shard Keyword Score: {kw_score}%")
print(f"Semantic Concept Score: {cp_score}%")
print(f"Entity (Proper Name) Score: {en_score}%")
print(f"Industry Category Score: {ct_score}%")
print(f"\nFINAL INTEGRATED MATCH SCORE: {final_score}%")

## Step 3: Visualization & Educational Insight

This chart shows why the "Basic Match" was misleading. Even if the keywords are low, the **Concepts** and **Industry Context** prove the candidate is highly qualified.

In [None]:
labels = ['Keywords', 'Concepts', 'Entities', 'Context', 'OVERALL']
vals = [kw_score, cp_score, en_score, ct_score, final_score]
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99', '#c1f1ec']

plt.figure(figsize=(10, 5))
plt.bar(labels, vals, color=colors)
plt.ylim(0, 100)
plt.ylabel('Match %')
plt.title(f'Advanced Quad-Signal AI Comparison: {final_score}%')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

### **Conclusion for Students**
1. **Keywords**: ATS basics. Often low because of differing jargon.
2. **Concepts**: Shows you "know your stuff" regardless of terms.
3. **Entities**: Validates your career history (Recognizable companies & specific roles).
4. **Category**: Validates that you are in the right professional industry.