# ResumeRecommendationReview

- Author: [Ilgyun Jeong](https://github.com/johnny9210)
- Design:
- Peer Review:
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

Modern recruitment processes require efficient matching between resumes and job postings to ensure the right candidates are connected with the right opportunities.

This system leverages advanced natural language processing techniques to analyze uploaded resumes, compare them with corporate job posting data, and recommend relevant companies based on the candidate's qualifications.

 Additionally, the system evaluates and refines the uploaded resumes to align with the specific requirements and expectations of the recommended companies, providing tailored feedback for enhanced compatibility.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Set up ChromaDB](#Set-up-ChromaDB)
- [Extract Text from PDF](#Extract-Text-from-PDF)
- [Generate and Store Embeddings](#Generate-and-Store-Embeddings)
- [Recommend Relevant Companies](#Recommend-Relevant-Companies)
- [Example Usage of Company Recommendation](#Example-Usage-of-Company-Recommendation)
- [Resume Evaluation](#Resume-Evaluation)


### References



---


## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**

- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.


In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "chromadb",
        "openai",
        "langchain_chroma",
        "langchain_openai",
        "PyMuPDF",
    ]
)

Current environment: mac-py3.10
Release type or date: stable
Installing packages: langsmith, langchain, chromadb, openai, langchain_chroma, langchain_openai, PyMuPDF...
Successfully installed: langsmith, langchain==0.3.13, chromadb, openai, langchain_chroma, langchain_openai, PyMuPDF


In [25]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "TokenTextSplitter",
        "UPSTAGE_API_KEY": "",
    }
)

Environment variables have been set successfully.


In [26]:
from dotenv import load_dotenv

load_dotenv()

True

## Set up ChromaDB

Initialize ChromaDB and create the necessary collections.
Set up separate collections for storing resume and company information.

In [27]:
# ChromaDB 설정 부분 수정
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")


# ChromaDB 초기화
embeddings = OpenAIEmbeddings()

# 이력서와 회사 정보를 위한 벡터 스토어 생성
resume_vectorstore = Chroma(
    collection_name="resumes",
    embedding_function=embeddings,
    persist_directory="../data",
)

company_vectorstore = Chroma(
    collection_name="companies",
    embedding_function=embeddings,
    persist_directory="../data",
)

## Extract Text from PDF

Extract text from resume PDFs and split it into processable chunks.


In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import fitz
from typing import List


def extract_text_from_pdf(pdf_path: str) -> List[str]:
    # Initialize text splitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    # Open PDF and extract text
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()

    # Split text into chunks
    chunks = text_splitter.split_text(text)
    return chunks

Example Usage

In [29]:
# Usage example
pdf_path = "../data/joannadrummond-cv.pdf"
text_chunks = extract_text_from_pdf(pdf_path)
print(f"Number of extracted text chunks: {len(text_chunks)}")

Number of extracted text chunks: 12


Layout Analysis

Analyze the layout of a PDF using the Upstage API to identify the positions of tables and images.

In [32]:
import requests
from typing import Dict
import os

UPSTAGE_API_KEY = os.getenv("UPSTAGE_API_KEY")


def analyze_layout(pdf_path: str) -> Dict:
    with open(pdf_path, "rb") as f:
        files = {"document": f}
        headers = {"Authorization": f"Bearer {UPSTAGE_API_KEY}"}
        response = requests.post(
            "https://api.upstage.ai/v1/document-ai/layout-analysis",
            headers=headers,
            files=files,
        )
        return response.json()


# Execute layout analysis
layout_info = analyze_layout(pdf_path)
print("Detected elements:", [elem["category"] for elem in layout_info["elements"]])

Detected elements: ['heading1', 'paragraph', 'paragraph', 'heading1', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'heading1', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'header', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'heading1', 'paragraph', 'heading1', 'paragraph', 'paragraph', 'paragraph', 'header', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'heading1', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'header', 'paragraph', 'heading1', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'heading1', 'paragraph', 'heading1', 'paragraph', 'paragraph', 'paragraph', 'paragraph', 'header', 'paragraph']


Table and Image Extraction

Extract tables and images from the PDF based on the layout analysis results.

In [33]:
import fitz
from PIL import Image
from typing import List, Dict


def extract_elements(pdf_path: str, layout_info: Dict) -> List[Dict]:
    doc = fitz.open(pdf_path)
    elements = []

    for page_num, page in enumerate(doc):
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

        for element in layout_info["elements"]:
            if element["page"] == page_num and element["category"] in [
                "table",
                "figure",
            ]:
                bbox = element["bounding_box"]
                x1, y1 = bbox[0]["x"], bbox[0]["y"]
                x2, y2 = bbox[2]["x"], bbox[2]["y"]
                cropped = img.crop((x1, y1, x2, y2))
                elements.append({"type": element["category"], "image": cropped})

    return elements


# 요소 추출 실행
elements = extract_elements(pdf_path, layout_info)
print(f"추출된 요소 수: {len(elements)}")

추출된 요소 수: 0


Element Analysis Using GPT

Analyze the extracted tables and images using the GPT-4o API and generate detailed descriptions.

In [34]:
import openai
import io
import base64


def analyze_elements_with_gpt(elements: List[Dict]) -> List[str]:
    descriptions = []
    for element in elements:
        # 이미지를 base64로 변환
        img_byte_arr = io.BytesIO()
        element["image"].element["image"].save(img_byte_arr, format="PNG")
        img_byte_arr = img_byte_arr.getvalue()

        # GPT-4o API 호출
        response = openai.ChatCompletion.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": f"이 {element['type']}의 내용을 자세히 설명해주세요. 이력서의 맥락에서 중요한 정보를 추출해주세요.",
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/png;base64,{base64.b64encode(img_byte_arr).decode()}"
                            },
                        },
                    ],
                }
            ],
        )
        descriptions.append(response.choices[0].message.content)

    return descriptions


# GPT 분석 실행
element_descriptions = analyze_elements_with_gpt(elements)
print(
    "생성된 설명 예시:",
    element_descriptions[0][:200] if element_descriptions else "설명 없음",
)

생성된 설명 예시: 설명 없음


## Generate and Store Embeddings

Generate embeddings for the extracted text and descriptions.
These embeddings will be used later for company matching.

In [35]:
def create_embeddings(texts: List[str]) -> List[List[float]]:
    embeddings = OpenAIEmbeddings()
    return embeddings.embed_documents(texts)


# 모든 텍스트 컨텐츠 결합
all_content = text_chunks + element_descriptions

# 임베딩 생성
embeddings = create_embeddings(all_content)
print(f"생성된 임베딩 수: {len(embeddings)}")
print(embeddings)

생성된 임베딩 수: 12
[[-0.0073113758116960526, -0.02110404521226883, -0.02024075761437416, -0.023853281512856483, 0.00102432316634804, 0.04398778825998306, -0.0308127012103796, 0.01661495305597782, -0.03641742467880249, -0.014636033214628696, 0.011136401444673538, -0.006235587876290083, 0.008978183381259441, 0.015698540955781937, -0.010791086591780186, -0.004774640779942274, 0.012909459881484509, 0.011421949602663517, 0.009217247366905212, -0.014609470963478088, -0.013062194921076298, -0.004764679819345474, -0.0026894707698374987, -0.008845370262861252, -0.00808833446353674, 0.011481715366244316, 0.023614216595888138, -0.0008296685409732163, 0.008300835266709328, -0.0016858996823430061, 0.014636033214628696, 0.0085000554099679, -0.030201759189367294, -0.013905559666454792, -0.02009466290473938, -0.0032522673718631268, -0.002905292436480522, 0.005149838514626026, 0.02036028914153576, -0.004299832973629236, 0.005986562464386225, -0.0023624177556484938, -0.002428824547678232, -0.0185938719660043

Store the processed resume data in ChromaDB.

In [36]:
# 이력서 데이터 저장
resume_id = "resume_001"  # 실제 구현시 고유 ID 생성 필요
resume_vectorstore.add_texts(
    texts=all_content,
    metadatas=[{"source": "resume", "type": "text"} for _ in range(len(all_content))],
    ids=[f"{resume_id}_chunk_{i}" for i in range(len(all_content))],
)

['resume_001_chunk_0',
 'resume_001_chunk_1',
 'resume_001_chunk_2',
 'resume_001_chunk_3',
 'resume_001_chunk_4',
 'resume_001_chunk_5',
 'resume_001_chunk_6',
 'resume_001_chunk_7',
 'resume_001_chunk_8',
 'resume_001_chunk_9',
 'resume_001_chunk_10',
 'resume_001_chunk_11']

## Recommend Relevant Companies

Recommend suitable companies based on the stored resumes.

In [37]:
import pandas as pd
from typing import List
import math


def process_company_data_limited(csv_path: str, limit: int = 200):
    # Read the CSV file (first 200 rows only)
    df = pd.read_csv(csv_path, nrows=limit)
    total_processed = 0

    print(f"Processing {limit} job postings...")

    for _, row in df.iterrows():
        # Format salary information
        salary_info = ""
        if pd.notna(row.get("min_salary")) and pd.notna(row.get("max_salary")):
            salary_info = f"Salary: {row['min_salary']} - {row['max_salary']} {row.get('currency', 'USD')} {row.get('pay_period', '')}"

        # Construct company description
        company_description = f"""
        Company: {row['company_name']}
        Job Title: {row['title']}
        Location: {row['location']}
        Work Type: {row.get('formatted_work_type', '')}
        Experience Level: {row.get('formatted_experience_level', '')}
        {salary_info}
        
        Job Description:
        {row['description']}
        
        Required Skills:
        {row.get('skills_desc', '')}
        """

        # Generate and store embeddings
        try:
            company_vectorstore.add_texts(
                texts=[company_description],
                metadatas=[
                    {
                        "job_id": str(row["job_id"]),
                        "company_name": row["company_name"],
                        "title": row["title"],
                        "location": row["location"],
                        "work_type": row.get("formatted_work_type", ""),
                        "salary_range": salary_info,
                        "remote_allowed": row.get("remote_allowed", False),
                        "type": "company",
                    }
                ],
                ids=[f"job_{row['job_id']}"],
            )
            total_processed += 1

            if total_processed % 50 == 0:
                print(f"Processed {total_processed} job postings...")

        except Exception as e:
            print(f"Error processing job {row['job_id']}: {str(e)}")
            continue

    print(f"\nCompleted! Total processed jobs: {total_processed}")


# Execution
csv_path = "../data/postings.csv"
process_company_data_limited(csv_path, limit=200)

Processing 200 job postings...
Processed 50 job postings...
Processed 100 job postings...
Processed 150 job postings...
Processed 200 job postings...

Completed! Total processed jobs: 200


find_matching_companies

The `find_matching_companies` function is designed to find the top matching companies for a given resume based on relevance, skills, and other criteria. It retrieves the resume data, extracts keywords, and performs a similarity search on the company dataset to identify the most relevant matches.


In [38]:
def find_matching_companies(resume_id: str, top_k: int = 3):
    # Get resume data
    resume_data = resume_vectorstore.get(
        where={"source": "resume"}, include=["documents", "metadatas"]
    )

    if resume_data is None or len(resume_data["documents"]) == 0:
        raise ValueError(f"Resume not found")

    full_resume_text = " ".join(resume_data["documents"])

    # Extract key information from resume
    from sklearn.feature_extraction.text import TfidfVectorizer
    import re

    # 1. Extract key keywords
    vectorizer = TfidfVectorizer(max_features=15, stop_words="english")
    tfidf = vectorizer.fit_transform([full_resume_text])
    resume_keywords = vectorizer.get_feature_names_out()

    # 2. Company search
    search_results = company_vectorstore.similarity_search_with_score(
        query=full_resume_text, k=top_k
    )

    matches = []
    for i, (doc, score) in enumerate(search_results):
        matching_reasons = []

        # Basic information
        job_title = doc.metadata.get("title", "")
        company_name = doc.metadata.get("company_name", "")
        location = doc.metadata.get("location", "")
        description = doc.page_content

        # 1. Job relevance
        title_match = any(
            keyword.lower() in job_title.lower() for keyword in resume_keywords
        )
        if title_match:
            matching_reasons.append(f"Job title matches resume experience")

        # 2. Skills and keyword matching
        matched_keywords = [
            keyword
            for keyword in resume_keywords
            if keyword.lower() in description.lower()
        ]
        if matched_keywords:
            matching_reasons.append(
                f"Matching resume keywords: {', '.join(matched_keywords)}"
            )

        # 3. Salary information
        if doc.metadata.get("max_salary") and doc.metadata.get("pay_period"):
            matching_reasons.append(
                f"Salary: ${doc.metadata['max_salary']} {doc.metadata['pay_period']}"
            )

        # 4. Location based
        if location:
            matching_reasons.append(f"Location: {location}")

        # 5. Overall job compatibility
        matching_reasons.append(
            f"Overall job compatibility: {(1-float(score))*100:.1f}%"
        )

        matches.append(
            {
                "rank": i + 1,
                "company_name": company_name,
                "title": job_title,
                "location": location,
                "match_score": 1 - float(score),
                "matching_reasons": matching_reasons,
                "job_description": description[:300]
                + "...",  # Include partial job description
            }
        )

    return matches

## Example Usage of Company Recommendation

This section introduces our main orchestration function that combines all previous components into a streamlined workflow. The function `process_resume_and_find_matches` serves as the central coordinator for our recommendation system, handling everything from initial PDF processing to final company matching.

Key Features:

1. Automated Processing: Handles the entire pipeline from PDF input to company recommendations
2. Comprehensive Analysis: Combines text, layout, and visual element analysis
3. Intelligent Matching: Utilizes embeddings and similarity scoring for company recommendations
4. Detailed Output: Provides structured results including match scores and reasoning

The function takes a PDF resume path as input and returns detailed matching results with top company recommendations, making it easy to use the entire system with just a single function call.


In [39]:
import time


def process_resume_and_find_matches(pdf_path: str, top_k: int = 3):
    # 1. Extract text
    print("1. Extracting text...")
    text_chunks = extract_text_from_pdf(pdf_path)

    # 2. Analyze layout
    print("2. Analyzing layout...")
    layout_info = analyze_layout(pdf_path)

    # 3. Extract elements
    print("3. Extracting tables and images...")
    elements = extract_elements(pdf_path, layout_info)

    # 4. GPT analysis
    print("4. Analyzing elements...")
    element_descriptions = analyze_elements_with_gpt(elements)

    # 5. Create and store embeddings
    print("5. Creating and storing embeddings...")
    all_content = text_chunks + element_descriptions
    embeddings = create_embeddings(all_content)

    resume_id = f"resume_{int(time.time())}"
    resume_vectorstore.add_texts(
        texts=all_content,
        metadatas=[
            {"source": "resume", "type": "text"} for _ in range(len(all_content))
        ],
        ids=[f"{resume_id}_chunk_{i}" for i in range(len(all_content))],
    )

    # 6. Company matching
    print("6. Matching companies...")
    matches = find_matching_companies(resume_id, top_k)

    # 7. Output results
    print("\n=== Resume Processing Results ===")
    print(f"- Resume ID: {resume_id}")
    print(f"- Extracted text chunks: {len(text_chunks)}")
    print(f"- Extracted tables/images: {len(elements)}")

    # New output format
    print("\n=== Recommended Companies and Positions ===")
    for match in matches:
        print(f"\n#{match['rank']} {match['company_name']}")
        print(f"Position: {match['title']}")
        print(f"Location: {match['location']}")
        print(f"Match Score: {match['match_score']:.2f}")
        print("\nJob Description:")
        print(match["job_description"])
        print("\nRecommendation Reasons:")
        for reason in match["matching_reasons"]:
            print(f"- {reason}")
        print("\n" + "-" * 50)

    return {
        "resume_id": resume_id,
        "text_chunks": len(text_chunks),
        "elements": len(elements),
        "matches": matches,
    }


# Execute
result = process_resume_and_find_matches("../data/joannadrummond-cv.pdf")

1. Extracting text...
2. Analyzing layout...
3. Extracting tables and images...
4. Analyzing elements...
5. Creating and storing embeddings...
6. Matching companies...

=== Resume Processing Results ===
- Resume ID: resume_1736345486
- Extracted text chunks: 12
- Extracted tables/images: 0

=== Recommended Companies and Positions ===

#1 
Position: Software Engineer
Location: Los Angeles Metropolitan Area
Match Score: 0.53

Job Description:

        Company: nan
        Job Title: Software Engineer
        Location: Los Angeles Metropolitan Area
        Work Type: Full-time
        Experience Level: nan
        
        
        Job Description:
        Education Bachelor's degree in software, math, or science required Job Skills Analy...

Recommendation Reasons:
- Matching resume keywords: science
- Location: Los Angeles Metropolitan Area
- Overall job compatibility: 53.0%

--------------------------------------------------

#2 Kona Medical Consulting
Position: Board Certified Behavio

## Resume Evaluation

Evaluate the resume based on the recommended companies.


~~ing