# Multi-Stage Job Advertisement Analysis ‚Äî LLM Skill Extractor

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mansamoussa/llm-skill-extractor/blob/main/notebooks/04_skill_extraction_llm.ipynb)

---

### Objective
Extract professional skills from job advertisements using a Large Language Model (OpenAI GPT) after identifying the relevant "Skills and Content" zones with our fine-tuned BERT model.

This notebook will:
1. Load the fine-tuned BERT model from Task 2 (zone identification)
2. Load job advertisement data (both annotated and scraped datasets)
3. Use BERT to identify "F√§higkeiten und Inhalte" (Skills and Content) zones
4. Extract those text sections containing skills
5. Use OpenAI GPT API to intelligently extract individual professional skills
6. Structure and save the extracted skills in JSON format
7. Evaluate the quality and coverage of skill extraction

### Why This Task is Important
The BERT model from Tasks 1-3 can identify WHERE skills are mentioned in a job ad, but it cannot extract the SPECIFIC skills themselves. This is a perfect task for a Large Language Model (LLM) because:
- LLMs understand context and can identify implicit skills
- They can handle multilingual text (German, French, English)
- They can normalize skill names (e.g., "JS" ‚Üí "JavaScript")
- They can distinguish between skills and other job requirements

### Input Data
- `model/best_model.pt` ‚Äî fine-tuned BERT model from Task 2
- `model/id2label.json` and `model/label2id.json` ‚Äî label mappings
- `data/annotated.json` ‚Äî annotated job advertisements (German/French)
- `data/scraped/remoteok_jobs.jsonl` ‚Äî scraped job postings (English)

### Output
- `data/extracted_skills.json` ‚Äî structured skill extraction results
- `data/skill_statistics.json` ‚Äî statistics and analysis
- Evaluation metrics and visualizations


---
# 1. Setting up my Environment

**Objective:** Prepare the Google Colab environment by cloning the project repository and installing necessary dependencies.

**Why I need this:**
* **The Issue:** A fresh Colab session is empty and doesn't have my project files or the required libraries.
* **The Fix:** I clone my repository to get the code, and install dependencies including OpenAI API library.

**What I did:**
1. **Clone Repository:** Downloaded project files from GitHub
2. **Install Dependencies:** Installed Python libraries including `openai`, `torch`, `transformers`

In [None]:
# --- SETUP STEPS ---
# This cell prepares the Colab environment by downloading the code and installing libraries.
import os

# 1. Clone the repository if it doesn't exist in the notebook
if not os.path.exists('llm-skill-extractor'):
    !git clone https://github.com/mansamoussa/llm-skill-extractor.git
else:
    print("Repository already cloned.")

# 2. Install core dependencies (including both OpenAI and Google Gemini)
!pip install -q torch transformers openai google-generativeai pandas tqdm scikit-learn matplotlib seaborn

print("‚úÖ Environment setup complete!")

# 2. Loading my Data

**Objective:** Load both the annotated dataset and scraped job postings from Google Drive.

**Why I need this:**
* **The Issue:** The annotated data is private and not in the GitHub repo.
* **The Fix:** I mount Google Drive and copy the necessary data files to the project folder.

**What I did:**
1. **Mount Drive:** Connected to Google Drive
2. **Copy Files:** Copied `annotated.json` to the data folder

In [None]:
# --- DATA LOADING STEP ---
# This cell brings the data from Google Drive into the project environment
from google.colab import drive
import shutil
import os

# 1. Mount Google Drive
drive.mount('/content/drive')

# 2. Define paths (adjust if your Drive structure is different)
source_path = '/content/drive/MyDrive/GEN03/annotated.json'
destination_folder = '/content/llm-skill-extractor/data/'

# 3. Copy the annotated data file
if os.path.exists(source_path):
    os.makedirs(destination_folder, exist_ok=True)
    shutil.copy(source_path, destination_folder)
    print(f"‚úÖ Success! annotated.json copied to {destination_folder}")
else:
    print(f"‚ö†Ô∏è File not found at {source_path}")
    print("Please verify the path or upload 'annotated.json' manually to 'llm-skill-extractor/data/' folder.")

print("\nüìä Checking data files...")
data_files = ['annotated.json', 'scraped/remoteok_jobs.jsonl']
for file in data_files:
    full_path = os.path.join(destination_folder, file)
    if os.path.exists(full_path):
        size_mb = os.path.getsize(full_path) / (1024 * 1024)
        print(f"‚úÖ Found: {file} ({size_mb:.2f} MB)")
    else:
        print(f"‚ùå Missing: {file}")

# 3. Loading the Trained BERT Model

**Objective:** Load the fine-tuned BERT model and label mappings from Task 2 (or from Google Drive backup).

**Why I need this:**
* **The Issue:** I need the trained model to identify which parts of the job ads contain skills.
* **The Fix:** Load the model weights, tokenizer, and label mappings from the previous training task.

**What I did:**
1. **Check Local Files:** Look for model files in the project folder
2. **Load from Drive:** If not found locally, copy from Google Drive backup
3. **Initialize Model:** Load BERT model with trained weights
4. **Load Mappings:** Load the label2id and id2label dictionaries

In [None]:
# --- MODEL LOADING STEP ---
# Load the trained BERT model for zone identification
import torch
import json
from transformers import BertTokenizerFast, BertForTokenClassification

# Define paths
PROJECT_ROOT = '/content/llm-skill-extractor'
MODEL_DIR = os.path.join(PROJECT_ROOT, 'model')
BEST_MODEL_PATH = os.path.join(MODEL_DIR, 'best_model.pt')
ID2LABEL_PATH = os.path.join(MODEL_DIR, 'id2label.json')
LABEL2ID_PATH = os.path.join(MODEL_DIR, 'label2id.json')

# Check if model files exist locally, if not try to load from Drive
drive_model_path = '/content/drive/MyDrive/GEN03/model'
if not os.path.exists(BEST_MODEL_PATH) and os.path.exists(drive_model_path):
    print("üì• Model not found locally, copying from Google Drive...")
    shutil.copytree(drive_model_path, MODEL_DIR, dirs_exist_ok=True)
    print("‚úÖ Model files copied from Drive")

# Load label mappings
print("üìñ Loading label mappings...")
with open(ID2LABEL_PATH, 'r') as f:
    id2label = json.load(f)
    # Convert keys to integers
    id2label = {int(k): v for k, v in id2label.items()}

with open(LABEL2ID_PATH, 'r') as f:
    label2id = json.load(f)

print(f"‚úÖ Found {len(label2id)} labels: {list(label2id.keys())}")

# Initialize tokenizer
print("üî§ Loading tokenizer...")
MODEL_NAME = 'bert-base-multilingual-cased'
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)

# Initialize and load the model
print("ü§ñ Loading BERT model...")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

num_labels = len(label2id)
model = BertForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

# Load trained weights
checkpoint = torch.load(BEST_MODEL_PATH, map_location=device)
model.load_state_dict(checkpoint)
model.to(device)
model.eval()

print("‚úÖ BERT model loaded successfully!")
print(f"   Model has {num_labels} output labels")
print(f"   Target label for skill extraction: 'F√§higkeiten und Inhalte'")

# 4. Configuring LLM APIs (OpenAI & Gemini)

**Objective:** Set up both OpenAI and Google Gemini API connections for skill extraction.

**Why I need this:**
* **The Issue:** I need to authenticate with LLM providers to use their models for intelligent skill extraction.
* **The Fix:** Configure both API keys and allow choosing which model to use.

**Why Both APIs?**
* **Flexibility:** Choose the best model for your needs
* **Comparison:** Test which LLM performs better for skill extraction  
* **Cost:** Gemini is free, OpenAI costs money but might be more accurate

**Security Note:** In production, never hardcode API keys. Use environment variables or secret management systems.

**What I did:**
1. **Set API Keys:** Configure both OpenAI and Gemini authentication
2. **Test Connections:** Verify both APIs are accessible
3. **Choose Default:** Select which LLM to use (can be changed later)

In [None]:
# --- LLM SETUP (OPENAI & GEMINI) ---
# Configure both OpenAI and Google Gemini APIs for skill extraction
from openai import OpenAI
import google.generativeai as genai
import os

# Set your API keys  
# IMPORTANT: Replace with your own API keys before running
# Option 1: Hardcode them here (not recommended for public repos)
# Option 2: Use Google Colab secrets (recommended)
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY_HERE"  # Replace with your OpenAI API key
GEMINI_API_KEY = "YOUR_GEMINI_API_KEY_HERE"   # Replace with your Gemini API key

# Initialize OpenAI client
openai_client = OpenAI(api_key=OPENAI_API_KEY)

# Initialize Gemini
genai.configure(api_key=GEMINI_API_KEY)
gemini_model = genai.GenerativeModel('gemini-1.5-flash')

# Test OpenAI connection
print("üîó Testing OpenAI API connection...")
try:
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Say 'OpenAI API test successful' in 5 words or less."}
        ],
        max_tokens=20
    )
    print(f"‚úÖ OpenAI API Connection successful!")
    print(f"   Response: {response.choices[0].message.content}")
except Exception as e:
    print(f"‚ùå OpenAI API Connection failed: {e}")
    openai_client = None

# Test Gemini connection
print("\nüîó Testing Gemini API connection...")
try:
    response = gemini_model.generate_content("Say 'Gemini API test successful' in 5 words or less.")
    print(f"‚úÖ Gemini API Connection successful!")
    print(f"   Response: {response.text.strip()}")
except Exception as e:
    print(f"‚ùå Gemini API Connection failed: {e}")
    gemini_model = None

# Choose which LLM to use (you can change this later)
# Options: "openai" or "gemini"
USE_LLM = "gemini"  # Default to Gemini (free tier)

print(f"\nüéØ Selected LLM: {USE_LLM.upper()}")
if USE_LLM == "openai" and openai_client is None:
    print("‚ö†Ô∏è OpenAI selected but connection failed. Falling back to Gemini.")
    USE_LLM = "gemini"
elif USE_LLM == "gemini" and gemini_model is None:
    print("‚ö†Ô∏è Gemini selected but connection failed. Falling back to OpenAI.")
    USE_LLM = "openai"

# 5. Helper Function: Extract Skills Section Using BERT

**Objective:** Create a function that uses the trained BERT model to identify and extract the "F√§higkeiten und Inhalte" (Skills and Content) sections from job advertisements.

**Why I need this:**
* **The Issue:** Job ads contain many sections (company info, benefits, application process, etc.). I only want the parts that list skills.
* **The Fix:** Use BERT to classify each token, then extract continuous spans labeled as "F√§higkeiten und Inhalte".

**How it works:**
1. Tokenize the job ad text
2. Run BERT to predict labels for each token
3. Find all tokens labeled as "F√§higkeiten und Inhalte"
4. Reconstruct the original text from those tokens
5. Return the extracted skill sections

**What I did:**
Created a reusable function that processes any job advertisement text and returns only the skills sections.

In [None]:
# --- BERT EXTRACTION FUNCTION ---
# Function to extract skills sections using the trained BERT model

def extract_skills_section_with_bert(text, model, tokenizer, device, target_label="F√§higkeiten und Inhalte"):
    """
    Extract sections labeled as target_label from the input text using BERT.
    
    Args:
        text (str): Input job advertisement text
        model: Trained BERT model for token classification
        tokenizer: BERT tokenizer
        device: torch device (cuda or cpu)
        target_label (str): The label to extract (default: "F√§higkeiten und Inhalte")
    
    Returns:
        list: List of extracted text sections labeled as target_label
    """
    if not text or len(text.strip()) == 0:
        return []
    
    # Tokenize with offset mapping to track character positions
    encoding = tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=512,
        padding=False,
        return_offsets_mapping=True
    )
    
    offset_mapping = encoding.pop('offset_mapping')[0]
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    
    # Get predictions
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs.logits, dim=-1)[0]
    
    # Convert predictions to labels
    predicted_labels = [id2label[pred.item()] for pred in predictions]
    
    # Extract skill sections
    skill_sections = []
    current_section = []
    current_start = None
    
    for idx, (label, (start, end)) in enumerate(zip(predicted_labels, offset_mapping)):
        # Skip special tokens ([CLS], [SEP], [PAD])
        if start == end == 0:
            continue
            
        if label == target_label:
            if current_start is None:
                current_start = start.item()
            current_section.append((start.item(), end.item()))
        else:
            # End of a skill section
            if current_section:
                # Extract the text for this section
                section_start = current_section[0][0]
                section_end = current_section[-1][1]
                section_text = text[section_start:section_end].strip()
                if section_text:
                    skill_sections.append(section_text)
                current_section = []
                current_start = None
    
    # Don't forget the last section if it ends at the document end
    if current_section:
        section_start = current_section[0][0]
        section_end = current_section[-1][1]
        section_text = text[section_start:section_end].strip()
        if section_text:
            skill_sections.append(section_text)
    
    return skill_sections

print("‚úÖ BERT extraction function defined!")
print("   Function: extract_skills_section_with_bert()")

# 6. Helper Function: Extract Skills Using LLM

**Objective:** Create a function that uses OpenAI GPT to extract individual skills from the text sections identified by BERT.

**Why I need this:**
* **The Issue:** The BERT model tells me WHERE skills are mentioned, but not WHAT the specific skills are.
* **The Fix:** Use an LLM with a carefully crafted prompt to intelligently extract and normalize skill names.

**How it works:**
1. Take a text section containing skills
2. Send it to GPT with a specialized prompt
3. GPT identifies individual skills (technical skills, soft skills, languages, tools)
4. Return structured JSON with categorized skills

**Prompt Engineering:**
The prompt instructs GPT to:
- Extract only professional skills (not job titles or company names)
- Normalize skill names (e.g., "JS" ‚Üí "JavaScript")
- Categorize skills (technical, soft, languages, tools, certifications)
- Return results in a structured JSON format
- Handle multilingual text (German, French, English)

**What I did:**
Created a reusable function with a well-engineered prompt for consistent skill extraction.

In [None]:
# --- LLM EXTRACTION FUNCTIONS ---
# Functions to extract individual skills using OpenAI GPT or Google Gemini

import json
import re

def extract_skills_with_openai(text_section, client, model_name="gpt-4o-mini"):
    """Extract skills using OpenAI GPT."""
    if not text_section or len(text_section.strip()) == 0:
        return get_empty_skills_dict()
    
    system_prompt = get_skill_extraction_prompt()
    user_prompt = f"""Extract and categorize all professional skills from this job advertisement text:

TEXT:
{text_section}

Return only the JSON object with categorized skills."""

    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.3,
            max_tokens=1000,
            response_format={"type": "json_object"}
        )
        
        skills_data = json.loads(response.choices[0].message.content)
        return ensure_all_skill_keys(skills_data)
        
    except Exception as e:
        print(f"‚ö†Ô∏è OpenAI extraction error: {e}")
        return get_empty_skills_dict(error=str(e))

def extract_skills_with_gemini(text_section, model):
    """Extract skills using Google Gemini."""
    if not text_section or len(text_section.strip()) == 0:
        return get_empty_skills_dict()
    
    system_prompt = get_skill_extraction_prompt()
    full_prompt = f"""{system_prompt}

Extract and categorize all professional skills from this job advertisement text:

TEXT:
{text_section}

Return ONLY the JSON object with categorized skills, no other text."""

    try:
        response = model.generate_content(
            full_prompt,
            generation_config=genai.GenerationConfig(
                temperature=0.3,
                max_output_tokens=1000,
            )
        )
        
        # Extract JSON from response (Gemini might add markdown code blocks)
        response_text = response.text.strip()
        
        # Remove markdown code blocks if present
        json_match = re.search(r'```(?:json)?\s*({.*?})\s*```', response_text, re.DOTALL)
        if json_match:
            json_str = json_match.group(1)
        else:
            # Try to find JSON object directly
            json_match = re.search(r'{.*}', response_text, re.DOTALL)
            json_str = json_match.group(0) if json_match else response_text
        
        skills_data = json.loads(json_str)
        return ensure_all_skill_keys(skills_data)
        
    except Exception as e:
        print(f"‚ö†Ô∏è Gemini extraction error: {e}")
        return get_empty_skills_dict(error=str(e))

def extract_skills_with_llm(text_section, use_llm="gemini"):
    """
    Unified function to extract skills using either OpenAI or Gemini.
    
    Args:
        text_section (str): Text containing skill descriptions
        use_llm (str): Which LLM to use - "openai" or "gemini"
    
    Returns:
        dict: Structured skills data with categories
    """
    if use_llm == "openai" and openai_client:
        return extract_skills_with_openai(text_section, openai_client)
    elif use_llm == "gemini" and gemini_model:
        return extract_skills_with_gemini(text_section, gemini_model)
    else:
        print(f"‚ö†Ô∏è Selected LLM '{use_llm}' not available")
        return get_empty_skills_dict(error=f"LLM {use_llm} not available")

# Helper functions
def get_empty_skills_dict(error=None):
    """Return an empty skills dictionary."""
    result = {
        "technical_skills": [],
        "soft_skills": [],
        "languages": [],
        "tools": [],
        "certifications": [],
        "other_skills": []
    }
    if error:
        result["error"] = error
    return result

def ensure_all_skill_keys(skills_data):
    """Ensure all expected keys exist in the skills dictionary."""
    expected_keys = ["technical_skills", "soft_skills", "languages", "tools", "certifications", "other_skills"]
    for key in expected_keys:
        if key not in skills_data:
            skills_data[key] = []
    return skills_data

def get_skill_extraction_prompt():
    """Get the standard prompt for skill extraction."""
    return """You are an expert HR analyst specializing in extracting professional skills from job advertisements.
Your task is to extract and categorize skills mentioned in the provided text.

RULES:
1. Extract only genuine professional skills, tools, technologies, and qualifications
2. Do NOT extract:
   - Job titles or positions
   - Company names
   - General job responsibilities
   - Benefits or salary information
3. Normalize skill names (e.g., "JS" ‚Üí "JavaScript", "ML" ‚Üí "Machine Learning")
4. Handle multilingual text (German, French, English)
5. Categorize skills appropriately

Return ONLY a JSON object with this exact structure (no additional text):
{
  "technical_skills": ["skill1", "skill2"],
  "soft_skills": ["skill1", "skill2"],
  "languages": ["language1", "language2"],
  "tools": ["tool1", "tool2"],
  "certifications": ["cert1", "cert2"],
  "other_skills": ["skill1", "skill2"]
}"""

print("‚úÖ LLM extraction functions defined!")
print(f"   Functions: extract_skills_with_llm() [supports both OpenAI and Gemini]")
print(f"   Currently using: {USE_LLM.upper()}")

# 7. Load and Prepare Job Advertisement Data

**Objective:** Load both the annotated dataset and scraped job postings, preparing them for skill extraction.

**Why I need this:**
* **The Issue:** I have two different data sources with different formats.
* **The Fix:** Load both datasets and standardize them into a common format.

**What I did:**
1. Load annotated.json (German/French job ads)
2. Load scraped remoteok_jobs.jsonl (English job ads)
3. Create a unified data structure
4. Display sample data for verification

In [None]:
# --- DATA LOADING ---
# Load and prepare job advertisement data
import pandas as pd
import json

DATA_PATH = os.path.join(PROJECT_ROOT, 'data')

# 1. Load annotated data (German/French)
print("üìñ Loading annotated job advertisements...")
annotated_path = os.path.join(DATA_PATH, 'annotated.json')
df_annotated = pd.read_json(annotated_path)

# Extract the text content from the data field
df_annotated['text'] = df_annotated['data'].apply(lambda x: x.get('content_clean', '') if isinstance(x, dict) else '')
df_annotated['source'] = 'annotated'
df_annotated['language'] = 'de/fr'  # German/French

print(f"‚úÖ Loaded {len(df_annotated)} annotated job ads")

# 2. Load scraped data (English)
print("üìñ Loading scraped job advertisements...")
scraped_path = os.path.join(DATA_PATH, 'scraped', 'remoteok_jobs.jsonl')

scraped_jobs = []
with open(scraped_path, 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            scraped_jobs.append(json.loads(line))

df_scraped = pd.DataFrame(scraped_jobs)
df_scraped['text'] = df_scraped['description']
df_scraped['source'] = 'scraped'
df_scraped['language'] = 'en'

print(f"‚úÖ Loaded {len(df_scraped)} scraped job ads")

# 3. Create unified dataset
print("\nüîÑ Creating unified dataset...")
jobs_data = []

# Add annotated jobs
for idx, row in df_annotated.iterrows():
    jobs_data.append({
        'id': f"annotated_{idx}",
        'text': row['text'],
        'source': row['source'],
        'language': row['language'],
        'title': '',  # Not available in annotated data
        'company': ''  # Not available in annotated data
    })

# Add scraped jobs
for idx, row in df_scraped.iterrows():
    jobs_data.append({
        'id': f"scraped_{idx}",
        'text': row['text'],
        'source': row['source'],
        'language': row['language'],
        'title': row.get('title', ''),
        'company': row.get('company', '')
    })

print(f"‚úÖ Total dataset: {len(jobs_data)} job advertisements")
print(f"   - Annotated (de/fr): {len(df_annotated)}")
print(f"   - Scraped (en): {len(df_scraped)}")

# Display sample
print("\nüìä Sample job advertisement:")
sample = jobs_data[0]
print(f"ID: {sample['id']}")
print(f"Source: {sample['source']}")
print(f"Language: {sample['language']}")
print(f"Text preview: {sample['text'][:200]}...")

# 8. Test Extraction Pipeline on Sample

**Objective:** Test the complete extraction pipeline (BERT + LLM) on a sample job advertisement before processing the full dataset.

**Why I need this:**
* **The Issue:** Processing all job ads with the API costs money and time. I need to verify the pipeline works correctly first.
* **The Fix:** Run the pipeline on a few sample ads and inspect the results.

**What I did:**
1. Select a sample job advertisement
2. Extract skills sections using BERT
3. Extract individual skills using LLM
4. Display and verify the results

In [None]:
# --- PIPELINE TEST ---
# Test the extraction pipeline on sample data

print("üß™ Testing extraction pipeline on sample job advertisement...\n")

# Select a sample from scraped data (usually has clearer skill descriptions)
test_sample = [job for job in jobs_data if job['source'] == 'scraped'][0]

print(f"üìÑ Sample Job: {test_sample['title']}")
print(f"   Company: {test_sample['company']}")
print(f"   Source: {test_sample['source']}")
print(f"   Text length: {len(test_sample['text'])} characters")
print("\n" + "="*80)

# Step 1: Extract skills sections using BERT
print("\nü§ñ Step 1: Extracting skills sections with BERT...")
skill_sections = extract_skills_section_with_bert(
    test_sample['text'],
    model,
    tokenizer,
    device
)

print(f"‚úÖ Found {len(skill_sections)} skills section(s)")
for i, section in enumerate(skill_sections, 1):
    print(f"\n   Section {i} ({len(section)} chars):")
    print(f"   {section[:200]}..." if len(section) > 200 else f"   {section}")

# Step 2: Extract individual skills using LLM
print("\nüß† Step 2: Extracting individual skills with LLM...")
all_extracted_skills = {
    "technical_skills": [],
    "soft_skills": [],
    "languages": [],
    "tools": [],
    "certifications": [],
    "other_skills": []
}

for i, section in enumerate(skill_sections, 1):
    print(f"\n   Processing section {i}...")
    skills = extract_skills_with_llm(section, use_llm=USE_LLM)
    
    # Merge results
    for key in all_extracted_skills.keys():
        if key in skills:
            all_extracted_skills[key].extend(skills[key])

# Remove duplicates
for key in all_extracted_skills.keys():
    all_extracted_skills[key] = list(set(all_extracted_skills[key]))

# Display results
print("\n" + "="*80)
print("üìä EXTRACTION RESULTS:")
print("="*80)

for category, skills in all_extracted_skills.items():
    if skills:
        print(f"\n{category.upper().replace('_', ' ')}:")
        for skill in sorted(skills):
            print(f"  ‚Ä¢ {skill}")

total_skills = sum(len(v) for v in all_extracted_skills.values())
print(f"\n‚úÖ Total unique skills extracted: {total_skills}")

# 9. Process Full Dataset

**Objective:** Process all job advertisements through the extraction pipeline and save results.

**Why I need this:**
* **The Issue:** I need to extract skills from all job ads, not just samples.
* **The Fix:** Loop through all job ads, extract skills, save progress regularly.

**What I did:**
1. Process each job advertisement through the pipeline
2. Save progress every 10 jobs (in case of errors)
3. Display progress and statistics
4. Handle errors gracefully

**Note:** This cell may take 10-30 minutes depending on dataset size and API speed.

In [None]:
# --- FULL DATASET PROCESSING ---
# Process all job advertisements and extract skills

from tqdm import tqdm
import time

print("üöÄ Starting full dataset processing...")
print(f"   Total jobs to process: {len(jobs_data)}")
print(f"   Estimated time: ~{len(jobs_data) * 3 / 60:.1f} minutes\n")

# Initialize results storage
extraction_results = []
errors = []

# Process each job
for i, job in enumerate(tqdm(jobs_data, desc="Extracting skills")):
    try:
        # Extract skills sections with BERT
        skill_sections = extract_skills_section_with_bert(
            job['text'],
            model,
            tokenizer,
            device
        )
        
        # Extract individual skills with LLM
        all_skills = {
            "technical_skills": [],
            "soft_skills": [],
            "languages": [],
            "tools": [],
            "certifications": [],
            "other_skills": []
        }
        
        for section in skill_sections:
            if section:  # Only process non-empty sections
                skills = extract_skills_with_llm(section, use_llm=USE_LLM)
                
                # Merge results
                for key in all_skills.keys():
                    if key in skills:
                        all_skills[key].extend(skills[key])
        
        # Remove duplicates
        for key in all_skills.keys():
            all_skills[key] = list(set(all_skills[key]))
        
        # Store result
        result = {
            'job_id': job['id'],
            'source': job['source'],
            'language': job['language'],
            'title': job.get('title', ''),
            'company': job.get('company', ''),
            'num_skill_sections': len(skill_sections),
            'extracted_skills': all_skills,
            'total_skills_count': sum(len(v) for v in all_skills.values())
        }
        extraction_results.append(result)
        
        # Rate limiting: small delay to avoid hitting API limits
        time.sleep(0.5)
        
        # Save intermediate results every 10 jobs
        if (i + 1) % 10 == 0:
            intermediate_path = os.path.join(DATA_PATH, f'extracted_skills_checkpoint_{i+1}.json')
            with open(intermediate_path, 'w', encoding='utf-8') as f:
                json.dump(extraction_results, f, indent=2, ensure_ascii=False)
        
    except Exception as e:
        error_info = {
            'job_id': job['id'],
            'error': str(e),
            'index': i
        }
        errors.append(error_info)
        print(f"\n‚ö†Ô∏è Error processing job {job['id']}: {e}")
        continue

print("\n‚úÖ Processing complete!")
print(f"   Successfully processed: {len(extraction_results)} jobs")
print(f"   Errors encountered: {len(errors)} jobs")

if errors:
    print("\n‚ö†Ô∏è Jobs with errors:")
    for err in errors[:5]:  # Show first 5 errors
        print(f"   - {err['job_id']}: {err['error'][:100]}")

# 10. Save Extraction Results

**Objective:** Save the extracted skills data to JSON files for further analysis.

**Why I need this:**
* **The Issue:** The extraction results are in memory and will be lost when the session ends.
* **The Fix:** Save results to structured JSON files.

**What I did:**
1. Save complete extraction results
2. Save error log (if any)
3. Create a summary statistics file

In [None]:
# --- SAVE RESULTS ---
# Save extraction results to JSON files

import json

# 1. Save main extraction results
output_path = os.path.join(DATA_PATH, 'extracted_skills.json')
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(extraction_results, f, indent=2, ensure_ascii=False)
print(f"üíæ Saved extraction results to: {output_path}")

# 2. Save errors (if any)
if errors:
    errors_path = os.path.join(DATA_PATH, 'extraction_errors.json')
    with open(errors_path, 'w', encoding='utf-8') as f:
        json.dump(errors, f, indent=2, ensure_ascii=False)
    print(f"üíæ Saved error log to: {errors_path}")

# 3. Create and save statistics
stats = {
    'total_jobs_processed': len(extraction_results),
    'total_errors': len(errors),
    'jobs_by_source': {},
    'jobs_by_language': {},
    'total_skills_by_category': {
        'technical_skills': 0,
        'soft_skills': 0,
        'languages': 0,
        'tools': 0,
        'certifications': 0,
        'other_skills': 0
    },
    'average_skills_per_job': 0,
    'jobs_with_no_skills': 0
}

# Calculate statistics
for result in extraction_results:
    # By source
    source = result['source']
    stats['jobs_by_source'][source] = stats['jobs_by_source'].get(source, 0) + 1
    
    # By language
    lang = result['language']
    stats['jobs_by_language'][lang] = stats['jobs_by_language'].get(lang, 0) + 1
    
    # Skills counts
    for category, skills in result['extracted_skills'].items():
        stats['total_skills_by_category'][category] += len(skills)
    
    # Jobs with no skills
    if result['total_skills_count'] == 0:
        stats['jobs_with_no_skills'] += 1

# Average skills per job
total_skills = sum(result['total_skills_count'] for result in extraction_results)
stats['average_skills_per_job'] = total_skills / len(extraction_results) if extraction_results else 0

# Save statistics
stats_path = os.path.join(DATA_PATH, 'skill_statistics.json')
with open(stats_path, 'w', encoding='utf-8') as f:
    json.dump(stats, f, indent=2, ensure_ascii=False)
print(f"üíæ Saved statistics to: {stats_path}")

print("\n‚úÖ All results saved successfully!")

# 11. Analysis and Visualization

**Objective:** Analyze and visualize the extracted skills data to understand patterns and quality.

**Why I need this:**
* **The Issue:** Raw extraction results are hard to interpret.
* **The Fix:** Create visualizations and summary statistics.

**What I did:**
1. Display key statistics
2. Show most common skills by category
3. Create visualizations
4. Analyze skill distribution across sources

In [None]:
# --- ANALYSIS AND VISUALIZATION ---
# Analyze and visualize the extracted skills

import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

sns.set_style("whitegrid")

print("üìä EXTRACTION STATISTICS")
print("="*80)
print(f"\nDataset Overview:")
print(f"  ‚Ä¢ Total jobs processed: {stats['total_jobs_processed']}")
print(f"  ‚Ä¢ Jobs with errors: {stats['total_errors']}")
print(f"  ‚Ä¢ Jobs with no skills found: {stats['jobs_with_no_skills']}")
print(f"  ‚Ä¢ Average skills per job: {stats['average_skills_per_job']:.2f}")

print(f"\nJobs by Source:")
for source, count in stats['jobs_by_source'].items():
    print(f"  ‚Ä¢ {source}: {count}")

print(f"\nJobs by Language:")
for lang, count in stats['jobs_by_language'].items():
    print(f"  ‚Ä¢ {lang}: {count}")

print(f"\nTotal Skills by Category:")
for category, count in stats['total_skills_by_category'].items():
    print(f"  ‚Ä¢ {category.replace('_', ' ').title()}: {count}")

# Collect all skills by category for frequency analysis
all_skills_by_category = {
    'technical_skills': [],
    'soft_skills': [],
    'languages': [],
    'tools': [],
    'certifications': [],
    'other_skills': []
}

for result in extraction_results:
    for category, skills in result['extracted_skills'].items():
        all_skills_by_category[category].extend(skills)

# Show top skills in each category
print("\n" + "="*80)
print("üèÜ TOP SKILLS BY CATEGORY")
print("="*80)

for category, skills in all_skills_by_category.items():
    if skills:
        skill_counts = Counter(skills)
        top_10 = skill_counts.most_common(10)
        print(f"\n{category.replace('_', ' ').upper()}:")
        for skill, count in top_10:
            print(f"  {count:3d}x  {skill}")

In [None]:
# Create visualizations

# 1. Skills per job distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Distribution of total skills per job
skills_counts = [result['total_skills_count'] for result in extraction_results]
axes[0, 0].hist(skills_counts, bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribution of Total Skills per Job', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Number of Skills')
axes[0, 0].set_ylabel('Number of Jobs')
axes[0, 0].axvline(stats['average_skills_per_job'], color='red', linestyle='--', label=f'Average: {stats["average_skills_per_job"]:.1f}')
axes[0, 0].legend()

# Plot 2: Skills by category
categories = list(stats['total_skills_by_category'].keys())
counts = list(stats['total_skills_by_category'].values())
axes[0, 1].barh(categories, counts, color='skyblue', edgecolor='black')
axes[0, 1].set_title('Total Skills by Category', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Number of Skills')
axes[0, 1].set_ylabel('Category')

# Plot 3: Jobs by source
sources = list(stats['jobs_by_source'].keys())
source_counts = list(stats['jobs_by_source'].values())
axes[1, 0].pie(source_counts, labels=sources, autopct='%1.1f%%', startangle=90)
axes[1, 0].set_title('Jobs by Source', fontsize=12, fontweight='bold')

# Plot 4: Jobs by language
languages = list(stats['jobs_by_language'].keys())
lang_counts = list(stats['jobs_by_language'].values())
axes[1, 1].pie(lang_counts, labels=languages, autopct='%1.1f%%', startangle=90)
axes[1, 1].set_title('Jobs by Language', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig(os.path.join(DATA_PATH, 'skill_extraction_analysis.png'), dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Visualizations created!")
print(f"   Saved to: {os.path.join(DATA_PATH, 'skill_extraction_analysis.png')}")

# 12. Qualitative Evaluation

**Objective:** Manually inspect a sample of extraction results to assess quality.

**Why I need this:**
* **The Issue:** Quantitative metrics don't tell the full story. I need to see if the extracted skills are actually correct and useful.
* **The Fix:** Display sample results for manual inspection.

**What I did:**
Display random samples from the extraction results with full details for quality assessment.

In [None]:
# --- QUALITATIVE EVALUATION ---
# Display sample extraction results for manual quality assessment

import random

print("üîç QUALITATIVE EVALUATION - Sample Extraction Results")
print("="*80)

# Select 3 random samples
samples = random.sample(extraction_results, min(3, len(extraction_results)))

for i, sample in enumerate(samples, 1):
    print(f"\n{'='*80}")
    print(f"SAMPLE {i}")
    print(f"{'='*80}")
    print(f"Job ID: {sample['job_id']}")
    print(f"Source: {sample['source']}")
    print(f"Language: {sample['language']}")
    if sample.get('title'):
        print(f"Title: {sample['title']}")
    if sample.get('company'):
        print(f"Company: {sample['company']}")
    print(f"Number of skill sections found by BERT: {sample['num_skill_sections']}")
    print(f"Total skills extracted: {sample['total_skills_count']}")
    
    print(f"\nüìã Extracted Skills:")
    for category, skills in sample['extracted_skills'].items():
        if skills:
            print(f"\n  {category.replace('_', ' ').upper()}:")
            for skill in sorted(skills):
                print(f"    ‚Ä¢ {skill}")
    
    if sample['total_skills_count'] == 0:
        print("\n  ‚ö†Ô∏è No skills extracted for this job")

print("\n" + "="*80)
print("\nüí° Quality Assessment Questions:")
print("  1. Are the extracted skills actually mentioned in the job ads?")
print("  2. Are job titles incorrectly classified as skills?")
print("  3. Are skills properly categorized?")
print("  4. Are there obvious skills that were missed?")
print("  5. Are the skill names properly normalized?")

# 13. Save Results to Google Drive

**Objective:** Backup all extraction results and analyses to Google Drive for permanent storage.

**Why I need this:**
* **The Issue:** Google Colab sessions are temporary - all files will be lost when the session ends.
* **The Fix:** Copy results to Google Drive for permanent storage and access.

**What I did:**
1. Create a backup folder in Google Drive
2. Copy all result files (JSON, images, logs)
3. Verify successful backup

In [None]:
# --- SAVE TO GOOGLE DRIVE ---
# Backup all results to Google Drive

import shutil
import os

# Define Google Drive save path
drive_save_path = '/content/drive/MyDrive/GEN03/skill_extraction_results'

# Create directory if it doesn't exist
os.makedirs(drive_save_path, exist_ok=True)

print(f"üíæ Backing up results to Google Drive...")
print(f"   Destination: {drive_save_path}\n")

# List of files to backup
files_to_backup = [
    ('extracted_skills.json', 'Main extraction results'),
    ('skill_statistics.json', 'Summary statistics'),
    ('extraction_errors.json', 'Error log'),
    ('skill_extraction_analysis.png', 'Analysis visualization')
]

# Copy files
backed_up = 0
for filename, description in files_to_backup:
    source_path = os.path.join(DATA_PATH, filename)
    dest_path = os.path.join(drive_save_path, filename)
    
    if os.path.exists(source_path):
        shutil.copy(source_path, dest_path)
        file_size = os.path.getsize(source_path) / 1024  # Size in KB
        print(f"‚úÖ {filename}")
        print(f"   ({description}, {file_size:.1f} KB)")
        backed_up += 1
    else:
        print(f"‚ö†Ô∏è  {filename} not found (skipped)")

print(f"\nüéâ Backup complete!")
print(f"   {backed_up}/{len(files_to_backup)} files backed up successfully")
print(f"   Location: {drive_save_path}")
print("\n‚úÖ You can now safely close this notebook. All data is saved!")

---
# 14. Summary and Next Steps

**üéâ Task 5 Complete!**

### What We Accomplished:
1. ‚úÖ Loaded the fine-tuned BERT model for zone identification
2. ‚úÖ Implemented two-stage extraction pipeline (BERT + LLM)
3. ‚úÖ Processed both annotated and scraped job advertisements
4. ‚úÖ Extracted and categorized professional skills
5. ‚úÖ Generated comprehensive statistics and visualizations
6. ‚úÖ Saved all results to permanent storage

### Key Outputs:
- **extracted_skills.json** - Complete extraction results for all jobs
- **skill_statistics.json** - Summary statistics and metrics
- **skill_extraction_analysis.png** - Visual analysis of results
- **extraction_errors.json** - Log of any errors encountered

### Methodology:
1. **Zone Identification (BERT)**: Used fine-tuned multilingual BERT to identify text sections labeled as "F√§higkeiten und Inhalte" (Skills and Content)
2. **Skill Extraction (LLM)**: Used LLM (OpenAI GPT-4o-mini or Google Gemini 1.5 Flash) with carefully crafted prompts to extract individual skills from identified sections
3. **Categorization**: Organized skills into 6 categories: technical skills, soft skills, languages, tools, certifications, and other skills
4. **Dual LLM Support**: Implemented support for both OpenAI and Gemini APIs, allowing comparison and choice of best model

