# LinkedIn Profile Classification - Rule-Based Baseline

## Capstone Project: Predicting Career Domain and Seniority from LinkedIn Profiles

**Author:** Veedan  
**Institution:** Julius-Maximilians-Universität Würzburg  
**Date:** January 2026

---

### Project Overview

This notebook implements the **Rule-Based Matching baseline** approach for predicting:
1. **Department** (Career Domain) - 11 categories
2. **Seniority Level** - 6 categories

### Critical Data Discovery

**Label Mismatch Issue:** The test data contains a "Professional" seniority level (35% of samples) that does NOT exist in the training label file. This represents mid-level professionals without explicit seniority indicators.

### Approach

Multi-strategy matching:
1. **Exact Match** - Direct lookup from label CSV files
2. **Keyword Matching** - Weighted multilingual keywords
3. **Fuzzy Matching** - Sequence similarity
4. **Intelligent Fallback** - "Professional" for ambiguous cases

### Table of Contents
1. Setup and Data Loading
2. Exploratory Data Analysis
3. Data Preprocessing
4. Rule-Based Classifiers
5. Model Evaluation
6. Error Analysis
7. Results Summary

---
## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score, precision_recall_fscore_support
import re
from difflib import SequenceMatcher
import warnings
from functools import lru_cache
import time

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print('Libraries loaded successfully!')

In [None]:
# Load data
df_seniority = pd.read_csv('seniority-v2.csv')
df_department = pd.read_csv('department-v2.csv')

with open('testdata.txt', 'r', encoding='utf-8') as f:
    test_cvs = json.load(f)

print('=' * 60)
print('DATA LOADING SUMMARY')
print('=' * 60)
print(f'Seniority labels (training): {len(df_seniority):,} entries')
print(f'Department labels (training): {len(df_department):,} entries')
print(f'Test CVs: {len(test_cvs)} profiles')
print(f'\nDepartment categories: {sorted(df_department["label"].unique())}')
print(f'\nSeniority categories (training): {sorted(df_seniority["label"].unique())}')

In [None]:
def extract_active_jobs(cvs):
    """Extract only ACTIVE (current) job positions from CVs."""
    active_jobs = []
    for cv in cvs:
        for job in cv:
            if job.get('status') == 'ACTIVE':
                active_jobs.append(job)
    return active_jobs

test_active_jobs = extract_active_jobs(test_cvs)
df_test = pd.DataFrame(test_active_jobs)

print(f'Active jobs extracted: {len(df_test)}')
print(f'\nSeniority in TEST: {sorted(df_test["seniority"].unique())}')
print(f'Seniority in TRAIN: {sorted(df_seniority["label"].unique())}')
print(f'\n⚠️ "Professional" is in TEST but NOT in TRAINING!')

---
## 2. Exploratory Data Analysis

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Training distributions
dept_train = df_department['label'].value_counts()
axes[0, 0].barh(dept_train.index, dept_train.values, color='steelblue', alpha=0.7)
axes[0, 0].set_title('Department (Training)', fontweight='bold')
axes[0, 0].invert_yaxis()

sen_train = df_seniority['label'].value_counts()
axes[0, 1].barh(sen_train.index, sen_train.values, color='darkorange', alpha=0.7)
axes[0, 1].set_title('Seniority (Training)', fontweight='bold')
axes[0, 1].invert_yaxis()

# Test distributions
dept_test = df_test['department'].value_counts()
axes[1, 0].barh(dept_test.index, dept_test.values, color='steelblue')
axes[1, 0].set_title('Department (Test)', fontweight='bold')
axes[1, 0].invert_yaxis()
for i, v in enumerate(dept_test.values):
    axes[1, 0].text(v + 2, i, f'{v} ({v/len(df_test)*100:.1f}%)', va='center', fontsize=9)

sen_test = df_test['seniority'].value_counts()
colors = ['red' if x == 'Professional' else 'darkorange' for x in sen_test.index]
axes[1, 1].barh(sen_test.index, sen_test.values, color=colors)
axes[1, 1].set_title('Seniority (Test) - Red=Not in Training!', fontweight='bold')
axes[1, 1].invert_yaxis()
for i, v in enumerate(sen_test.values):
    axes[1, 1].text(v + 2, i, f'{v} ({v/len(df_test)*100:.1f}%)', va='center', fontsize=9)

plt.tight_layout()
plt.show()

In [None]:
# Job title analysis
df_test['title_length'] = df_test['position'].apply(lambda x: len(str(x)) if pd.notna(x) else 0)
df_test['title_words'] = df_test['position'].apply(lambda x: len(str(x).split()) if pd.notna(x) else 0)

print('Job Title Statistics:')
print('=' * 40)
print(f'  Average length: {df_test["title_length"].mean():.1f} chars')
print(f'  Average words: {df_test["title_words"].mean():.1f}')

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].hist(df_test['title_length'], bins=30, color='teal', edgecolor='white')
axes[0].set_title('Title Length Distribution', fontweight='bold')
axes[0].set_xlabel('Characters')
axes[1].hist(df_test['title_words'], bins=15, color='purple', edgecolor='white')
axes[1].set_title('Title Word Count Distribution', fontweight='bold')
axes[1].set_xlabel('Words')
plt.tight_layout()
plt.show()

In [None]:
# Sample titles by seniority
print('Sample Job Titles by Seniority:')
print('=' * 70)
for sen in ['Director', 'Management', 'Lead', 'Senior', 'Professional', 'Junior']:
    samples = df_test[df_test['seniority'] == sen]['position'].head(4).tolist()
    count = len(df_test[df_test['seniority'] == sen])
    print(f'\n{sen} (n={count}):')
    for s in samples:
        print(f'  • {str(s)[:60]}')

In [None]:
# Language detection
def detect_language(text):
    if pd.isna(text): return 'Unknown'
    text = str(text).lower()
    german = ['geschäftsführer', 'leiter', 'mitarbeiter', 'ä', 'ö', 'ü', 'ß']
    french = ['directeur', 'responsable', 'chef de', 'é', 'è', 'ê']
    g_score = sum(1 for p in german if p in text)
    f_score = sum(1 for p in french if p in text)
    if g_score > f_score and g_score > 0: return 'German'
    if f_score > g_score and f_score > 0: return 'French'
    return 'English/Other'

df_test['language'] = df_test['position'].apply(detect_language)
print('Language Distribution:')
print(df_test['language'].value_counts())

### EDA Key Insights

1. **Critical Label Mismatch**: "Professional" = 35% of test data, but NOT in training!
2. **Class Imbalance**: "Other" dept (55%), "Professional" seniority (35%)
3. **Multilingual**: ~25% German, ~5% French titles
4. **Avg title**: 3-4 words

---
## 3. Data Preprocessing

In [None]:
@lru_cache(maxsize=20000)
def preprocess_text(text):
    """Normalize text for matching."""
    if pd.isna(text) or text is None:
        return ''
    text = str(text).lower().strip()
    text = re.sub(r'[^a-zäöüßàâçéèêëîïôûùüÿñæœ\s]', ' ', text)
    return ' '.join(text.split())

def fast_similarity(s1, s2, threshold=0.75):
    """Calculate sequence similarity with early termination."""
    if not s1 or not s2: return 0.0
    len_ratio = min(len(s1), len(s2)) / max(len(s1), len(s2))
    if len_ratio < threshold - 0.2: return 0.0
    return SequenceMatcher(None, s1, s2).ratio()

# Test
for t in ['Senior Engineer', 'Geschäftsführer', 'Chef de projet']:
    print(f'{t} -> {preprocess_text(t)}')

---
## 4. Rule-Based Classifiers

In [None]:
class DepartmentClassifier:
    def __init__(self, label_df):
        self.valid_depts = set(label_df['label'].unique())
        self.exact_match = {preprocess_text(r['text']): r['label'] 
                           for _, r in label_df.iterrows() if preprocess_text(r['text'])}
        self.keywords = self._build_keywords()
        self.examples = defaultdict(list)
        for _, r in label_df.iterrows():
            c = preprocess_text(r['text'])
            if c: self.examples[r['label']].append(c)
        print(f'  Exact matches: {len(self.exact_match):,}')
    
    def _build_keywords(self):
        return {
            'Sales': [('sales director', 5), ('sales manager', 5), ('vertriebsleiter', 5),
                     ('sales', 4), ('vertrieb', 4), ('verkauf', 4), ('account manager', 4),
                     ('account executive', 5), ('key account', 4), ('vente', 4), ('commercial', 3)],
            'Marketing': [('marketing manager', 5), ('marketing director', 5), ('marketingleiter', 5),
                         ('digital marketing', 5), ('marketing', 4), ('brand', 3), ('pr ', 4),
                         ('social media', 4), ('seo', 4), ('kommunikation', 3), ('werbung', 4),
                         ('chargé de communication', 5), ('chargée de marketing', 5)],
            'Information Technology': [('software engineer', 5), ('software developer', 5), 
                         ('softwareentwickler', 5), ('data scientist', 5), ('data engineer', 5),
                         ('it manager', 5), ('cto', 5), ('cio', 5), ('solutions architect', 5),
                         ('developer', 4), ('entwickler', 4), ('devops', 5), ('cloud', 4),
                         ('network', 3), ('database', 4), ('sap', 4), ('it ', 4), ('informatik', 4),
                         ('systemadministrator', 5), ('architect', 3), ('technical', 3)],
            'Human Resources': [('human resources', 5), ('hr manager', 5), ('personalleiter', 5),
                         ('hr business partner', 5), ('hrbp', 5), ('recruiter', 5), ('recruiting', 4),
                         ('talent acquisition', 5), ('hr ', 4), ('personal', 3), ('drh', 5)],
            'Project Management': [('project manager', 5), ('projektleiter', 5), ('projektmanager', 5),
                         ('program manager', 5), ('chef de projet', 5), ('scrum master', 5),
                         ('pmo', 5), ('agile coach', 5), ('product owner', 4), ('projektmanagement', 4)],
            'Business Development': [('business development', 5), ('bizdev', 5), ('bd manager', 5),
                         ('partnerships', 4), ('strategic partnerships', 5), ('geschäftsentwicklung', 5)],
            'Customer Support': [('customer service', 5), ('customer support', 5), ('kundenservice', 5),
                         ('customer success', 5), ('helpdesk', 5), ('service client', 5),
                         ('technical support', 5), ('customer care', 4)],
            'Administrative': [('office manager', 5), ('executive assistant', 5), ('secretary', 4),
                         ('administrative', 4), ('verwaltung', 4), ('sachbearbeiter', 3),
                         ('buchhalter', 4), ('buchhalterin', 4), ('assistenz', 3)],
            'Consulting': [('management consultant', 5), ('consultant', 4), ('consulting', 4),
                         ('berater', 4), ('beratung', 4), ('unternehmensberater', 5), ('advisory', 4)],
            'Purchasing': [('purchasing', 5), ('procurement', 5), ('einkauf', 5), ('einkäufer', 5),
                         ('buyer', 4), ('sourcing', 4), ('supply chain', 4), ('achat', 4)],
        }
    
    def predict(self, text):
        if pd.isna(text): return 'Other'
        cleaned = preprocess_text(text)
        if not cleaned: return 'Other'
        
        # Exact match
        if cleaned in self.exact_match:
            return self.exact_match[cleaned]
        
        # Keyword matching
        scores = defaultdict(float)
        for dept, kws in self.keywords.items():
            if dept not in self.valid_depts: continue
            for kw, weight in kws:
                if kw in cleaned:
                    scores[dept] += weight + len(kw.split()) * 0.3
        
        if scores:
            best = max(scores, key=scores.get)
            if scores[best] >= 4.0:
                return best
        
        # Fuzzy match
        candidates = list(scores.keys()) if scores else list(self.valid_depts)
        best_sim, best_dept = 0, None
        for dept in candidates:
            for ex in self.examples.get(dept, [])[:40]:
                sim = fast_similarity(cleaned, ex)
                if sim > best_sim:
                    best_sim, best_dept = sim, dept
                    if sim > 0.92: return best_dept
        
        if best_sim > 0.78: return best_dept
        if scores: return max(scores, key=scores.get)
        return 'Other'
    
    def predict_batch(self, texts):
        return [self.predict(t) for t in texts]

print('Initializing Department Classifier...')
dept_clf = DepartmentClassifier(df_department)
print('✓ Ready!')

In [None]:
class SeniorityClassifier:
    """Handles 'Professional' label that exists in test but not training."""
    
    HIERARCHY = ['Director', 'Management', 'Lead', 'Senior', 'Professional', 'Junior']
    
    def __init__(self, label_df):
        self.train_levels = set(label_df['label'].unique())
        self.valid_levels = self.train_levels | {'Professional'}
        self.exact_match = {preprocess_text(r['text']): r['label'] 
                           for _, r in label_df.iterrows() if preprocess_text(r['text'])}
        self.patterns = self._build_patterns()
        self.examples = defaultdict(list)
        for _, r in label_df.iterrows():
            c = preprocess_text(r['text'])
            if c: self.examples[r['label']].append(c)
        print(f'  Exact matches: {len(self.exact_match):,}')
        print(f'  Valid levels: {sorted(self.valid_levels)}')
    
    def _build_patterns(self):
        return {
            'Director': [
                ('ceo', 6), ('chief executive', 6), ('geschäftsführer', 6), ('geschäftsführerin', 6),
                ('cto', 6), ('cfo', 6), ('coo', 6), ('cmo', 6), ('cio', 6), ('chief ', 6),
                ('president', 5), ('vice president', 5), ('vp ', 5), ('svp', 6), ('evp', 6),
                ('managing director', 6), ('general manager', 5),
                ('director', 5), ('directeur', 5), ('direktor', 5),
                ('vorstand', 6), ('prokurist', 5), ('prokuristin', 5), ('inhaber', 5),
                ('founder', 5), ('owner', 5), ('partner', 4), ('shareholder', 4), ('member of', 4),
            ],
            'Management': [
                ('head of', 5), ('leiter', 5), ('leiterin', 5), ('leitung', 4),
                ('abteilungsleiter', 5), ('bereichsleiter', 5),
                ('manager', 4), ('managerin', 4),
                ('group manager', 5), ('department head', 5),
                ('responsable', 4), ('supervisor', 4),
            ],
            'Lead': [
                ('team lead', 5), ('teamlead', 5), ('tech lead', 5), ('technical lead', 5),
                ('lead developer', 5), ('lead engineer', 5), ('teamleiter', 5),
                ('lead ', 4), ('group leader', 5), ('team leader', 5),
                ('chef de projet', 4), ('coordinator', 3),
                ('scrum master', 4), ('product owner', 4),
            ],
            'Senior': [
                ('senior', 5), ('sr ', 5), ('sr.', 5),
                ('principal', 5), ('staff ', 4),
                ('expert', 4), ('experte', 4),
                ('spezialist', 3), ('specialist', 3), ('architect', 3),
            ],
            'Junior': [
                ('junior', 5), ('jr ', 5), ('jr.', 5),
                ('trainee', 5), ('praktikant', 5), ('praktikantin', 5),
                ('intern', 5), ('stagiaire', 5), ('werkstudent', 5),
                ('graduate', 4), ('entry level', 5),
                ('apprentice', 5), ('azubi', 5),
                ('associate', 3), ('assistant', 3), ('analyst', 2),
            ],
            'Professional': [
                ('consultant', 2), ('engineer', 2), ('developer', 2),
                ('administrator', 2), ('generalist', 3),
                ('betriebswirt', 3), ('fachkraft', 3),
            ],
        }
    
    def predict(self, text):
        if pd.isna(text): return 'Professional'
        cleaned = preprocess_text(text)
        if not cleaned: return 'Professional'
        
        # Exact match
        if cleaned in self.exact_match:
            return self.exact_match[cleaned]
        
        # Pattern matching
        scores = {}
        for level in self.HIERARCHY:
            if level not in self.patterns: continue
            score = sum(w for p, w in self.patterns[level] if p in cleaned)
            if score > 0: scores[level] = score
        
        if scores:
            max_score = max(scores.values())
            for level in self.HIERARCHY:
                if level in scores and scores[level] >= max_score - 1:
                    return level
        
        # Fuzzy match
        best_sim, best_level = 0, None
        for level in self.train_levels:
            for ex in self.examples.get(level, [])[:25]:
                sim = fast_similarity(cleaned, ex)
                if sim > best_sim:
                    best_sim, best_level = sim, level
                    if sim > 0.90: return best_level
        
        if best_sim > 0.75: return best_level
        return 'Professional'
    
    def predict_batch(self, texts):
        return [self.predict(t) for t in texts]

print('\nInitializing Seniority Classifier...')
sen_clf = SeniorityClassifier(df_seniority)
print('✓ Ready!')

In [None]:
# Sanity check
tests = ['CEO', 'Geschäftsführer', 'Head of Marketing', 'Team Lead', 'Senior Engineer',
         'Solutions Architect', 'IT-Systemadministrator', 'Junior Analyst', 'Trainee']

print('Sanity Check:')
print('=' * 75)
print(f'{"Title":<35} {"Department":<22} {"Seniority":<15}')
print('-' * 75)
for t in tests:
    print(f'{t:<35} {dept_clf.predict(t):<22} {sen_clf.predict(t):<15}')

---
## 5. Model Evaluation

In [None]:
print('Generating predictions...')
start = time.time()
df_test['pred_dept'] = dept_clf.predict_batch(df_test['position'].tolist())
df_test['pred_sen'] = sen_clf.predict_batch(df_test['position'].tolist())
elapsed = time.time() - start
print(f'✓ Done in {elapsed:.2f}s ({len(df_test)/elapsed:.0f}/sec)')

In [None]:
dept_acc = accuracy_score(df_test['department'], df_test['pred_dept'])
sen_acc = accuracy_score(df_test['seniority'], df_test['pred_sen'])
dept_f1_m = f1_score(df_test['department'], df_test['pred_dept'], average='macro', zero_division=0)
sen_f1_m = f1_score(df_test['seniority'], df_test['pred_sen'], average='macro', zero_division=0)
dept_f1_w = f1_score(df_test['department'], df_test['pred_dept'], average='weighted', zero_division=0)
sen_f1_w = f1_score(df_test['seniority'], df_test['pred_sen'], average='weighted', zero_division=0)

print('\n' + '=' * 65)
print('           RULE-BASED BASELINE RESULTS')
print('=' * 65)
print(f'\n{"Metric":<30} {"Department":>15} {"Seniority":>15}')
print('-' * 65)
print(f'{"Accuracy":<30} {dept_acc*100:>14.2f}% {sen_acc*100:>14.2f}%')
print(f'{"F1 (Macro)":<30} {dept_f1_m:>15.3f} {sen_f1_m:>15.3f}')
print(f'{"F1 (Weighted)":<30} {dept_f1_w:>15.3f} {sen_f1_w:>15.3f}')
print('=' * 65)

In [None]:
print('\nDEPARTMENT CLASSIFICATION REPORT')
print('=' * 70)
print(classification_report(df_test['department'], df_test['pred_dept'], zero_division=0))

In [None]:
print('\nSENIORITY CLASSIFICATION REPORT')
print('=' * 70)
print(classification_report(df_test['seniority'], df_test['pred_sen'], zero_division=0))

In [None]:
# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

dept_labels = sorted(df_test['department'].unique())
cm_dept = confusion_matrix(df_test['department'], df_test['pred_dept'], labels=dept_labels)
sns.heatmap(cm_dept, annot=True, fmt='d', cmap='Blues',
            xticklabels=dept_labels, yticklabels=dept_labels, ax=axes[0])
axes[0].set_title(f'Department (Acc: {dept_acc*100:.1f}%)', fontweight='bold')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].tick_params(axis='x', rotation=45)

sen_labels = sorted(df_test['seniority'].unique())
cm_sen = confusion_matrix(df_test['seniority'], df_test['pred_sen'], labels=sen_labels)
sns.heatmap(cm_sen, annot=True, fmt='d', cmap='Oranges',
            xticklabels=sen_labels, yticklabels=sen_labels, ax=axes[1])
axes[1].set_title(f'Seniority (Acc: {sen_acc*100:.1f}%)', fontweight='bold')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Per-class F1
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

d_p, d_r, d_f1, d_s = precision_recall_fscore_support(df_test['department'], df_test['pred_dept'], 
                                                       labels=dept_labels, zero_division=0)
colors = ['green' if f > 0.5 else 'orange' if f > 0.3 else 'red' for f in d_f1]
axes[0].barh(dept_labels, d_f1, color=colors, alpha=0.7)
axes[0].set_title('Department F1 by Class', fontweight='bold')
axes[0].set_xlim(0, 1)
axes[0].axvline(0.5, color='gray', linestyle='--', alpha=0.5)
for i, (f, s) in enumerate(zip(d_f1, d_s)):
    axes[0].text(f + 0.02, i, f'{f:.2f} (n={s})', va='center', fontsize=9)
axes[0].invert_yaxis()

s_p, s_r, s_f1, s_s = precision_recall_fscore_support(df_test['seniority'], df_test['pred_sen'], 
                                                       labels=sen_labels, zero_division=0)
colors = ['green' if f > 0.5 else 'orange' if f > 0.3 else 'red' for f in s_f1]
axes[1].barh(sen_labels, s_f1, color=colors, alpha=0.7)
axes[1].set_title('Seniority F1 by Class', fontweight='bold')
axes[1].set_xlim(0, 1)
axes[1].axvline(0.5, color='gray', linestyle='--', alpha=0.5)
for i, (f, s) in enumerate(zip(s_f1, s_s)):
    axes[1].text(f + 0.02, i, f'{f:.2f} (n={s})', va='center', fontsize=9)
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

---
## 6. Error Analysis

In [None]:
df_test['dept_ok'] = df_test['department'] == df_test['pred_dept']
df_test['sen_ok'] = df_test['seniority'] == df_test['pred_sen']

print('Error Summary:')
print('=' * 55)
print(f'Department errors: {(~df_test["dept_ok"]).sum()} ({(~df_test["dept_ok"]).mean()*100:.1f}%)')
print(f'Seniority errors: {(~df_test["sen_ok"]).sum()} ({(~df_test["sen_ok"]).mean()*100:.1f}%)')
print(f'Both correct: {(df_test["dept_ok"] & df_test["sen_ok"]).sum()}')

In [None]:
# Top error patterns
dept_err = df_test[~df_test['dept_ok']]
patterns = dept_err.groupby(['department', 'pred_dept']).size().reset_index(name='n')
patterns = patterns.sort_values('n', ascending=False).head(10)

print('\nTop Department Error Patterns:')
print('=' * 60)
print(f'{"Actual":<22} {"Predicted":<22} {"Count":>10}')
print('-' * 60)
for _, r in patterns.iterrows():
    print(f'{r["department"]:<22} {r["pred_dept"]:<22} {r["n"]:>10}')

In [None]:
sen_err = df_test[~df_test['sen_ok']]
patterns = sen_err.groupby(['seniority', 'pred_sen']).size().reset_index(name='n')
patterns = patterns.sort_values('n', ascending=False).head(10)

print('\nTop Seniority Error Patterns:')
print('=' * 55)
print(f'{"Actual":<18} {"Predicted":<18} {"Count":>10}')
print('-' * 55)
for _, r in patterns.iterrows():
    print(f'{r["seniority"]:<18} {r["pred_sen"]:<18} {r["n"]:>10}')

In [None]:
# Sample errors
print('\nSample Misclassified Titles:')
print('=' * 80)
for _, p in patterns.head(3).iterrows():
    samples = sen_err[(sen_err['seniority']==p['seniority']) & 
                      (sen_err['pred_sen']==p['pred_sen'])]['position'].head(3)
    print(f'\n{p["seniority"]} -> {p["pred_sen"]} (n={p["n"]}):')
    for s in samples:
        print(f'  • {str(s)[:65]}')

In [None]:
# Accuracy by language
lang_acc = df_test.groupby('language').agg({'dept_ok': 'mean', 'sen_ok': 'mean', 'position': 'count'})
lang_acc.columns = ['Dept Acc', 'Sen Acc', 'Count']

print('\nAccuracy by Language:')
print('=' * 55)
print(f'{"Language":<18} {"Dept Acc":>12} {"Sen Acc":>12} {"Count":>10}')
print('-' * 55)
for idx, r in lang_acc.iterrows():
    print(f'{idx:<18} {r["Dept Acc"]*100:>11.1f}% {r["Sen Acc"]*100:>11.1f}% {int(r["Count"]):>10}')

---
## 7. Results Summary

In [None]:
# Save results
cols = ['position', 'organization', 'department', 'pred_dept', 'seniority', 'pred_sen', 'dept_ok', 'sen_ok']
df_test[cols].to_csv('predictions_rule_based.csv', index=False)
print('✓ Saved predictions to predictions_rule_based.csv')

In [None]:
print('''
╔══════════════════════════════════════════════════════════════════╗
║              RULE-BASED BASELINE - FINAL SUMMARY                 ║
╚══════════════════════════════════════════════════════════════════╝
''')
print(f'''
PERFORMANCE
{'─' * 50}
Department:  Accuracy={dept_acc*100:.1f}%, F1(macro)={dept_f1_m:.3f}
Seniority:   Accuracy={sen_acc*100:.1f}%, F1(macro)={sen_f1_m:.3f}

METHODOLOGY
{'─' * 50}
• Multi-strategy: Exact → Keywords → Fuzzy → Fallback
• Multilingual keywords (EN, DE, FR)
• Handles "Professional" (test-only label)
• Hierarchical seniority resolution

KEY FINDINGS
{'─' * 50}
• "Professional" label (35% of test) not in training
• Department "Other" dominates (55%)
• German titles: ~25% of data

NEXT STEPS
{'─' * 50}
1. TF-IDF + Logistic Regression
2. Embedding-based (Sentence-BERT)
3. Fine-tuned transformer
4. Include organization name as feature
''')

---
## Appendix: GenAI Usage

**AI Tools Used:** Claude (Anthropic)

**Uses:**
- Keyword list generation (multilingual)
- Code structure and evaluation pipeline
- Documentation and markdown

**Human Contributions:**
- Problem definition
- Data exploration
- Discovery of "Professional" label issue
- Threshold tuning
- Result interpretation