# 📊 SEEK Case Study

We propose two solutions for the SEEK case study:
1. **LLM-Based Information Extraction from Job Ads** 
2. **Smart Job Search** 

This notebook focuses on **Solution 1 only**.  

---

## 📁 Notebook Outline (Solution 1 only)

0. **Import Libraries** – Load required packages  
1. **Load & Preprocess Data** – Read and clean job ads  
2. **EDA** – Explore key patterns in the data  
3. **LLM-Based Information Extraction from Job Ads** – Use LLMs to extract metadata from job ads
4. **Evaluation** – Assess extraction performance

---

## 🚀 Solution 2: Smart Job Search

A separate Streamlit app demonstrates **Solution 2: Smart Job Search**, using vector search and LLMs to enhance job seeker experience.:

- 🌐 [Live Demo](https://smart-job-search-6c2561d72e46.herokuapp.com/)  
- 💻 [Code](https://github.com/psunthorn13/seek-case-study)

For more details, see the project slide deck.

# 0. Import Libraries

Load required packages

In [None]:
# !pip install -r requirements.txt

In [8]:
import json
from openai import OpenAI
from pydantic import BaseModel
import re

import pandas as pd
import plotly.express as px

# 1.Load & Preprocess Data

In this section, we load the job ads data from a JSON file and preprocess the text content by removing HTML tags.

In [9]:
# Read the JSON file and convert it into a pandas DataFrame
with open('data/ads-50k.json', 'r') as f:
    records = [json.loads(line) for line in f]

job_df = pd.DataFrame(records)

# Flatten the metadata column
metadata_df = pd.json_normalize(job_df['metadata']).add_prefix('metadata.')
job_df = pd.concat([job_df, metadata_df], axis=1)

In [10]:
# Function to remove HTML tags
def clean_html(text):
    if pd.isna(text):
        return text
    # Remove HTML tags
    clean_text = re.sub(r'<.*?>', ' ', str(text))
    # Replace multiple spaces with a single space
    clean_text = re.sub(r'\s+', ' ', clean_text)
    # Strip leading and trailing spaces
    clean_text = clean_text.strip()
    return clean_text

# Apply the cleaning function to the content column
job_df['cleaned_content'] = job_df['content'].apply(clean_html)
    

## 2. Exploratory Data Analysis (EDA)

In this section, we perform **Exploratory Data Analysis (EDA)** to better understand the dataset. We will:

- Examine **missing values** across key features  
- Visualise the **distribution of categorical variables**  
- Explore **word count patterns** in abstract and content fields  

---

### 📊 Dataset Overview

The dataset is provided in **JSON format** and contains the following fields:

- **Job ID** – Unique identifier  
- **Title** – Job title  
- **Abstract** – Short summary of the role  
- **Content** – Full job description  
- **Metadata** – Salary, bullet points, classification, location, etc.

---

### 🧩 Missing Values Analysis

| Feature            | Missing Percentage |
|--------------------|--------------------|
| Salary Text        | 67%                |
| Bullet Points      | 47%                |
| Area Name          | 34%                |
| Suburb Name        | 26%                |

✅ **No missing values** in: `Title`, `Abstract`, `Content`, `Classification`, `Sub-class`, `Location Name`, `Work Type`

---

### 📍 Key Findings

- **Job Type**: ~70% of positions are **Full-Time**  
- **Location**: ~47% of roles are in **Sydney (26%)** and **Melbourne (21%)**  
- **Text Length**:  
  - `Abstract` typically contains ~**20 words**  
  - `Content` typically contains ~**300 words**

---

### 📈 Top Job Classifications

| Classification                              | Share |
|---------------------------------------------|-------|
| Information & Communication Technology     | 11%  |
| Trades & Services                          | 10%  |
| Healthcare & Medical                       |  9%  |
| Manufacturing, Transport & Logistics      |  6%  |
| Accounting                                 |  5%  |

---


In [11]:
# Function to generate null percentage dataframe
def generate_null_percentage(df: pd.DataFrame) -> pd.DataFrame:
    # Calculate null count and percentage for each column
    null_counts = df.isnull().sum()
    total_rows = len(df)
    null_percentages = (null_counts / total_rows * 100).round(2)
    
    # Create result dataframe
    null_df = pd.DataFrame({
        'Column': null_counts.index,
        'Null Count': null_counts.values,
        'Null Percentage': null_percentages.values
    })
    
    # Sort by null percentage in descending order
    null_df = null_df.sort_values('Null Percentage', ascending=False)
    
    return null_df

# Function to plot distribution of a specified column
def plot_distribution(df: pd.DataFrame, 
                      column_name: str,
                      title: str | None = None,
                      top_n: int | None = 10)-> None:
    # Calculate and sort percentage distribution in descending order
    dist = df[column_name].value_counts(normalize=True).sort_values(ascending=False).head(top_n) * 100
    
    # Sort for better visualization
    dist = dist.sort_values(ascending=True) 
    
    # Set default title if none provided
    if title is None:
        title = f'Distribution of {column_name} (%)'
    
    # Create horizontal bar chart
    fig = px.bar(
        dist,
        x=dist.values,
        y=dist.index,
        orientation='h',
        labels={'x': 'Percentage', 'y': ''},  # Empty y label
        title=title,
        text=dist.values.round(1)  # Show values on bars
    )
    
    # Format the text to show percentage with 1 decimal place
    fig.update_traces(texttemplate='%{text:.1f}%', textposition='inside')
    
    # Hide the y-axis title
    fig.update_layout(yaxis_title=None)
    
    fig.show()

def plot_word_count_distribution(df: pd.DataFrame, 
                                column_name: str, 
                                title: str | None = None,
                                bins: int = 20) -> None:

    # Calculate word counts for the specified column
    # Filter out missing values first
    word_counts = df[column_name].dropna().apply(lambda x: len(str(x).split()))
    
    # Set default title if none provided
    if title is None:
        title = f'Distribution of Word Count in {column_name}'
    
    # Create histogram
    fig = px.histogram(
        word_counts, 
        x=word_counts,
        nbins=bins,
        labels={'x': 'Number of Words', 'y': 'Count'},
        title=title,
        text_auto=True  # Show count values on bars
    )
    
    # Add average line
    mean_value = word_counts.mean()
    fig.add_vline(x=mean_value, line_dash="dash", line_color="red",
                  annotation_text=f"Mean: {mean_value:.1f} words",
                  annotation_position="top right")
    
    # Format layout
    fig.update_layout(bargap=0.1)
    
    fig.show()

In [12]:
# Display the first few rows of the job dataframe
job_df.head()

Unnamed: 0,id,title,abstract,content,metadata,metadata.additionalSalaryText,metadata.standout.bullet1,metadata.standout.bullet2,metadata.standout.bullet3,metadata.classification.name,metadata.subClassification.name,metadata.location.name,metadata.workType.name,metadata.area.name,metadata.suburb.name,cleaned_content
0,38915469,Recruitment Consultant,We are looking for someone to focus purely on ...,<HTML><p>Are you looking to join a thriving bu...,{'standout': {'bullet1': 'Join a Sector that i...,commission,Join a Sector that is considered Recession Pro...,Excellent opportunity for Career Progression ...,Make a Diference whilst earning Money and havi...,Education & Training,Other,Sydney,Full Time,,,Are you looking to join a thriving business th...
1,38934839,Computers Salesperson - Coburg,Passionate about exceptional customer service?...,<HTML><p>&middot;&nbsp;&nbsp;Casual hours as r...,{'additionalSalaryText': 'Attractive Commissio...,Attractive Commission - Uncapped Earning Poten...,,,,Retail & Consumer Products,Retail Assistants,Melbourne,Casual/Vacation,Northern Suburbs,Coburg,&middot;&nbsp;&nbsp;Casual hours as required (...
2,38946054,Senior Developer | SA,Readifarians are known for discovering the lat...,<HTML><p>Readify helps organizations innovate ...,"{'standout': {'bullet1': 'Design, develop, tes...",,"Design, develop, test and deliver custom softw...",Keep your skills current with 20 x paid profes...,Flexible & inclusive work environment,Information & Communication Technology,Consultants,Adelaide,Full Time,,,Readify helps organizations innovate with tech...
3,38833950,Senior Commercial Property Manager | Leading T...,~ Rare opportunity for a Senior PM to step int...,<HTML><p><strong>WayPoint Recruitment&nbsp;</s...,{'additionalSalaryText': '$140k + Car Park - C...,$140k + Car Park - Call James Calleja 0430 058...,,,,Real Estate & Property,"Commercial Sales, Leasing & Property Mgmt",Melbourne,Full Time,CBD & Inner Suburbs,Melbourne,WayPoint Recruitment&nbsp; have partnered up w...
4,38856271,Technology Manager | Travel Industry,Rare opportunity for an experienced Technology...,<HTML>This is a key role within a market leadi...,{'standout': {'bullet1': 'Lead overarching str...,$110k - $120k p.a. + Numerous Perks!,Lead overarching strategy around Technology wi...,You will be responsible for all Technology and...,Competitive Salary package of $110K - $120K + ...,Information & Communication Technology,Management,Auckland,Full Time,,,This is a key role within a market leading Tra...


In [13]:
# Generate null percentage dataframe
null_df = generate_null_percentage(job_df)
null_df

Unnamed: 0,Column,Null Count,Null Percentage
5,metadata.additionalSalaryText,33651,67.3
6,metadata.standout.bullet1,23315,46.63
7,metadata.standout.bullet2,23315,46.63
8,metadata.standout.bullet3,23315,46.63
13,metadata.area.name,17156,34.31
14,metadata.suburb.name,12998,26.0
0,id,0,0.0
1,title,0,0.0
2,abstract,0,0.0
3,content,0,0.0


In [14]:
# Plot distributions for key categorical columns
columns_to_plot_distribution = [
    'metadata.classification.name',
    'metadata.subClassification.name',
    'metadata.location.name',
    'metadata.area.name',
    'metadata.suburb.name',
    'metadata.workType.name'
]

# Plot distributions
for col in columns_to_plot_distribution:
    plot_distribution(job_df, col)


In [15]:
# Plot word count distributions for text columns
columns_to_plot_word_count = [
    'abstract',
    'cleaned_content'
]

# Plot word count distributions
for col in columns_to_plot_word_count:
    plot_word_count_distribution(job_df, col)

## 3. LLM-Based Information Extraction

In this section, we implement the **1st solution** by leveraging a **LLM** to automatically extract key information from job advertisements.

We will focus on extracting the following components from each job ad:

- 🧠 **Soft Skills** – Interpersonal, communication, teamwork, leadership, problem-solving, and other non-technical skills.
- 🛠️ **Hard Skills** – Technical skills, tools, programming languages, certifications, domain knowledge, etc.
- 📋 **Responsibilities** – Core tasks and duties the candidate will be expected to perform.
- ✅ **Other Requirements** – Qualifications, education, years of experience, work eligibility, language requirements, etc.

---

### 🧰 Implementation

1. **Define the System Prompt**  
   Craft a detailed prompt to instruct the LLM on how to parse and categorise job ad text accurately. This is a result from quick prompt engineering.

2. **Create a Function to Call the OpenAI API**  
   Use Python to send requests to the LLM with the system prompt and the job description as input.

3. **Apply to a Sample of Job Ads**  
   Test the extraction pipeline on a subset of the dataset and inspect the structured outputs.

---


In [50]:
SYSTEM_PROMPT = """
You are an expert job advertisement analyser. Your task is to read the provided job ad text and extract key information into a structured JSON format.

Follow these instructions carefully:

1. Be precise and concise — extract only what is explicitly or implicitly mentioned in the ad.
2. Organise the extracted information into the following categories:
   - soft_skills: Interpersonal, communication, teamwork, leadership, problem-solving, and other non-technical skills.
   - hard_skills: Technical skills, tools, programming languages, certifications, domain knowledge, etc.
   - responsibilities: Core tasks and duties the candidate will be expected to perform.
   - other_requirements: Qualifications, education, years of experience, work eligibility, language requirements, etc.
3. If a category is not mentioned, return it as an empty array [].
4. Always return valid JSON in this format:

{
  "soft_skills": [],
  "hard_skills": [],
  "responsibilities": [],
  "other_requirements": []
}

Example:

Input job ad:
"We’re seeking a Senior Data Analyst who excels in communication and stakeholder management. You’ll design dashboards, write complex SQL queries, and present insights to leadership. Must have 5+ years of experience in data analytics and a bachelor’s degree in statistics or related field."

Output:
{
  "soft_skills": ["communication", "stakeholder management", "presentation skills"],
  "hard_skills": ["dashboard design", "SQL", "data analytics"],
  "responsibilities": ["design dashboards", "write SQL queries", "present insights to leadership"],
  "other_requirements": ["5+ years experience", "bachelor’s degree in statistics or related field"]
}

"""

In [53]:
# Sample 50 job ads for evaluation
sample_job_df = job_df.sample(n=50, random_state=42, replace=False)
sample_job_df= sample_job_df[['id','title','cleaned_content']]

In [54]:
# Initialise OpenAI client
client = OpenAI()

# Define the response model using Pydantic
class JobInfoResponse(BaseModel):
    soft_skills: list[str]
    hard_skills: list[str]
    responsibilities: list[str]
    other_requirements: list[str]

# Create a function to extract job info using OpenAI API    
def extract_job_info(job_ad: str) -> dict:
    try:
        response = client.responses.parse(
        model="gpt-5-mini",
        input=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": job_ad,
            },
        ],
        text_format=JobInfoResponse,
    )
        
        result = response.output_parsed

        return {
            'soft_skills': result.soft_skills,
            'hard_skills': result.hard_skills,
            'responsibilities': result.responsibilities,
            'other_requirements': result.other_requirements
        }
    except Exception as e:
        print(f"Error processing job ad: {e}")
        return {
            'soft_skills': [],
            'hard_skills': [],
            'responsibilities': [],
            'other_requirements': []
        }

# Apply the function to create new columns
results = sample_job_df['cleaned_content'].apply(extract_job_info)

# Convert the results into separate columns
sample_job_df.loc[:,'soft_skills'] = results.apply(lambda x: x['soft_skills'])
sample_job_df.loc[:,'hard_skills'] = results.apply(lambda x: x['hard_skills'])
sample_job_df.loc[:,'responsibilities'] = results.apply(lambda x: x['responsibilities'])
sample_job_df.loc[:,'other_requirements'] = results.apply(lambda x: x['other_requirements'])

In [56]:
# Display the extracted information for the first few job ads
sample_job_df[['soft_skills','hard_skills','responsibilities','other_requirements']].head()

Unnamed: 0,soft_skills,hard_skills,responsibilities,other_requirements
33553,"[excellent communication, customer service, pr...","[Move CRM, LJ Hooker systems, CRM and back-end...",[support development and delivery of systems t...,[minimum 3 years' experience in a variety of R...
9427,"[good phone manner, attention to detail, stron...","[accurate keyboarding, practical knowledge of ...","[front desk reception, office and administrati...","[must hold valid certificate of registration, ..."
199,"[communication, attention to detail, customer ...","[Point of Sales (POS) system operation, MYOB R...",[identify customer needs and assist with resea...,"[valid Australian manual driver's licence, own..."
12447,[approachable (for internal and external custo...,[operation of Boeing 737 (fixed wing flight op...,[carry out commands or requests related to air...,[New Zealand and/or Australian Airline Transpo...
39489,"[resourcefulness, teamwork / ability to work i...","[operation, monitoring and maintenance of ship...",[repair and maintain a wide variety of systems...,"[minimum age 17 years, Australian citizen, Yea..."


In [60]:
# Define the path to save the evaluation dataset
EVALUATION_DATASET_PATH = 'data/eval_jobinfo_data.csv'

# Save the sample job DataFrame with extracted information to CSV
sample_job_df.to_csv(EVALUATION_DATASET_PATH, index=False)

In [127]:
sample_job_df

Unnamed: 0,id,title,cleaned_content,soft_skills,hard_skills,responsibilities,other_requirements
33553,38885743,"SYSTEMS TRAINING & SUPPORT SPECIALIST, LJ Hook...",Newly located in Sydney’s CBD A great opportun...,"[excellent communication, customer service, pr...","[Move CRM, LJ Hooker systems, CRM and back-end...",[support development and delivery of systems t...,[minimum 3 years' experience in a variety of R...
9427,38885221,Receptionist,This is an outstanding opportunity for someone...,"[good phone manner, attention to detail, stron...","[accurate keyboarding, practical knowledge of ...","[front desk reception, office and administrati...","[must hold valid certificate of registration, ..."
199,38918237,Sales Person / Forklift Driver / Warehouse,ABOUT THE COMPANY Simon’s Seconds and iPave is...,"[communication, attention to detail, customer ...","[Point of Sales (POS) system operation, MYOB R...",[identify customer needs and assist with resea...,"[valid Australian manual driver's licence, own..."
12447,38872849,B737 First Officers,Airwork Fixed Wing (AFW) Division is part of t...,[approachable (for internal and external custo...,[operation of Boeing 737 (fixed wing flight op...,[carry out commands or requests related to air...,[New Zealand and/or Australian Airline Transpo...
39489,38957774,Marine Technician Submariner,"Opportunity As a Marine Technician Submariner,...","[resourcefulness, teamwork / ability to work i...","[operation, monitoring and maintenance of ship...",[repair and maintain a wide variety of systems...,"[minimum age 17 years, Australian citizen, Yea..."
42724,38835993,QA Manager,Freedom Foods Group Limited is an ASX listed i...,"[collaboration with multiple stakeholders, inf...","[Good Manufacturing Practices (GMP), SQF (Safe...","[implement, maintain and ensure functioning of...",[tertiary qualification in Food Science or a r...
10822,38952335,Real Estate Sales,REAP Recruitment specialise in recruitment for...,[],"[REA licence, completion of required REA study...",[pursue Real Estate Sales roles with leading N...,[be licenced by the Real Estate Authority (REA...
49498,38900437,Maintenance Project Engineer - Transport / Kai...,Porirua's changing. Thanks to new housing deve...,"[adaptability / embrace change, ownership / ac...","[transport network maintenance, consent manage...","[maintain transport networks, pick up and deli...",[several years' experience in contract managem...
4144,38859133,Heavy Vehicle Mechanic - Narellan,Qualified heavy vehicle mechanic wanted to wor...,"[well organised, motivated, high energy, auton...","[heavy vehicle / truck mechanic, light and med...","[perform light and medium truck servicing, per...","[qualified heavy vehicle mechanic, experienced..."
36958,38946240,Contracts Administrator - Fit-out/Construction,One of Melbourne's most&nbsp; talked about and...,"[drive and determination, people skills, cultu...","[contract administration, project management, ...",[contract administration for commercial fit-ou...,"[based in Melbourne (implied), current or prio..."


# 4. Evaluation

We evaluate the extraction performance of our LLM (from Section 3) using a stronger model (GPT-5) as a judge. The judge compares extracted information against the original job ad and classifies each item as:

- ✅ Correct - accurately extracted
- ❌ Incorrect – wrong or hallucinated
- 🕳️ Missing – relevant but not extracted

From these results, we compute precision and recall for each category.
We performed manual spot checks to confirm the judge’s reliability. Using an LLM as a judge is a cost-effective, scalable evaluation method. For production, a hybrid approach combining LLM judgments with human annotations is recommended.

In [118]:
# Load the evaluation dataset
eval_df =pd.read_csv(EVALUATION_DATASET_PATH)

In [119]:
# Create a structured message for LLM evaluation
def create_eval_message(row):
    # Format the extracted lists properly
    soft_skills = row['soft_skills']
    hard_skills = row['hard_skills']
    responsibilities = row['responsibilities']
    other_requirements = row['other_requirements']

    message = f"""
Job Ad Title: {row['title']}

Original Job Ad Content:
{row['cleaned_content']}

Extracted Information:
- soft_skills: {soft_skills}
- hard_skills: {hard_skills}
- responsibilities: {responsibilities}
- other_requirements: {other_requirements}

"""
    return message

# Add the evaluation message column
eval_df['eval_text'] = eval_df.apply(create_eval_message, axis=1)

In [120]:
JUDGE_SYSTEM_PROMPT = """
You are a specialised evaluation system for job advertisement information extraction. Your task is to compare extracted job information against the original job advertisement and assess extraction quality.

Review the original job ad text and the extracted information in these categories:
- soft_skills: Interpersonal, communication, teamwork, and non-technical skills
- hard_skills: Technical skills, tools, programming languages, certifications
- responsibilities: Core tasks and duties for the role
- other_requirements: Qualifications, education, experience, eligibility requirements

For each category, determine:
- Count correct extractions (items that appear in the ad)
- Count incorrect extractions (items not mentioned in the ad)
- List important items from the job ad that were missed in the extraction

Return your evaluation in this JSON format:
{
  "soft_skills": {
    "correct_extractions": [list of correctly extracted items],
    "incorrect_extractions": [list of incorrectly extracted items],
    "missing_extractions": [list of items that should have been extracted]
  },
  "hard_skills": {
    "correct_extractions": [list of correctly extracted items],
    "incorrect_extractions": [list of incorrectly extracted items],
    "missing_extractions": [list of items that should have been extracted]
  },
  "responsibilities": {
    "correct_extractions": [list of correctly extracted items],
    "incorrect_extractions": [list of incorrectly extracted items],
    "missing_extractions": [list of items that should have been extracted]
  },
  "other_requirements": {
    "correct_extractions": [list of correctly extracted items],
    "incorrect_extractions": [list of incorrectly extracted items],
    "missing_extractions": [list of items that should have been extracted]
  }
}
"""

In [122]:
class CategoryEvaluation(BaseModel):
    correct_extractions: list[str]
    incorrect_extractions: list[str]
    missing_extractions: list[str]
        
class EvalResponse(BaseModel):
    soft_skills: CategoryEvaluation
    hard_skills: CategoryEvaluation
    responsibilities: CategoryEvaluation
    other_requirements: CategoryEvaluation

In [123]:
# Update the llm_evaluate function to return all fields including metrics
def llm_evaluate(eval_text: str) -> dict:
    try:
        response = client.responses.parse(
            model="gpt-5",
            input=[
                {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
                {
                    "role": "user",
                    "content": eval_text,
                },
            ],
            text_format=EvalResponse,
        )

        result = response.output_parsed

        return {
            'soft_skills': result.soft_skills,
            'hard_skills': result.hard_skills,
            'responsibilities': result.responsibilities,
            'other_requirements': result.other_requirements,
        }
    except Exception as e:
        print(f"Error processing job ad: {e}")
        return {
            'soft_skills': {'correct_extractions': [], 'incorrect_extractions': [], 'missing_extractions': []},
            'hard_skills': {'correct_extractions': [], 'incorrect_extractions': [], 'missing_extractions': []},
            'responsibilities': {'correct_extractions': [], 'incorrect_extractions': [], 'missing_extractions': []},
            'other_requirements': {'correct_extractions': [], 'incorrect_extractions': [], 'missing_extractions': []},
        }


In [124]:
# Apply the evaluation function to the first few rows for testing
eval_results = eval_df['eval_text'].apply(llm_evaluate)

In [125]:
# Extract all the fields into separate columns
for category in ['soft_skills', 'hard_skills', 'responsibilities', 'other_requirements']:
    # Correct extractions
    eval_df.loc[:,f'{category}_correct'] = eval_results.apply(
        lambda x: x[category].correct_extractions
    )
    
    # Incorrect extractions
    eval_df.loc[:,f'{category}_incorrect'] = eval_results.apply(
        lambda x: x[category].incorrect_extractions
    )

    # Missing extractions
    eval_df.loc[:,f'{category}_missing'] = eval_results.apply(
        lambda x: x[category].missing_extractions
    )


# Calculate counts for precision and recall analysis
for category in ['soft_skills', 'hard_skills', 'responsibilities', 'other_requirements']:
    # Count of correct extractions
    eval_df.loc[:,f'{category}_correct_count'] = eval_df[f'{category}_correct'].apply(len)

    # Count of incorrect extractions
    eval_df.loc[:,f'{category}_incorrect_count'] = eval_df[f'{category}_incorrect'].apply(len)

    # Count of missing extractions
    eval_df.loc[:,f'{category}_missing_count'] = eval_df[f'{category}_missing'].apply(len)

In [126]:
# Calculate precision, recall, and F1 score for the whole dataset
summary_metrics = []

for category in ['soft_skills', 'hard_skills', 'responsibilities', 'other_requirements']:
    # Sum up counts across all job ads
    total_correct = eval_df[f'{category}_correct_count'].sum()
    total_incorrect = eval_df[f'{category}_incorrect_count'].sum()
    total_missing = eval_df[f'{category}_missing_count'].sum()
    
    # Calculate metrics
    precision = total_correct / (total_correct + total_incorrect) 
    recall = total_correct / (total_correct + total_missing) 
    
    # Store results
    summary_metrics.append({
        'Category': category,
        'Precision': round(precision, 3),
        'Recall': round(recall, 3),
        'Correct Count': total_correct,
        'Incorrect Count': total_incorrect,
        'Missing Count': total_missing
    })

# Create summary dataframe
metrics_df = pd.DataFrame(summary_metrics)

# Display the metrics
metrics_df

Unnamed: 0,Category,Precision,Recall,Correct Count,Incorrect Count,Missing Count
0,soft_skills,0.899,0.937,355,40,24
1,hard_skills,0.954,0.958,435,21,19
2,responsibilities,0.856,0.953,346,58,17
3,other_requirements,0.91,0.865,263,26,41
