# Baseline Model - Rule-based Matching

In [8]:
#from google.colab import drive
# drive.mount("/content/drive")

In [9]:
import pandas as pd

In [10]:
import sys
from pathlib import Path

# ensure repo root
%cd /content/drive/MyDrive/Final-Project-snapAddy/src

sys.path.append(str(Path.cwd()))

[Errno 2] No such file or directory: '/content/drive/MyDrive/Final-Project-snapAddy/src'
/Users/luisadosch/Documents/Final-Project-snapAddy/src


  bkms = self.shell.db.get('bookmarks', {})


In [11]:
#DATA_DIR = Path("/content/sample_data")
#department_path = DATA_DIR / "department-v2.csv"
#seniority_path = DATA_DIR / "seniority-v2.csv"

In [12]:
department_df = pd.read_csv("../data/raw/department-v2.csv")
seniority_df = pd.read_csv("../data/raw/seniority-v2.csv")
jobs_annotated_df = pd.read_csv("../data/processed/jobs_annotated.csv")

In [13]:
from utils.data_preparation_pipeline import data_preparation_pipeline
from utils.eval_utils import evaluate_predictions
from utils.eval_utils import evaluate_predictions
from utils.results_utils import add_result, save_results

As the baseline, we implement a rule-based classifier that assigns department and seniority labels by matching job titles against the predefined label lists

the predefined label lists are:

- department_df
- seniority_df

**Rule logic**

Substring matching against predefined label lists is used for the rule-based baseline. Only minimal syntactic normalization is applied, while semantic transformations are deliberately avoided to preserve the baseline’s role as a simple, non-learning reference model.

In [14]:
department_df.head()

Unnamed: 0,text,label
0,Adjoint directeur communication,Marketing
1,Advisor Strategy and Projects,Project Management
2,Beratung & Projekte,Project Management
3,Beratung & Projektmanagement,Project Management
4,Beratung und Projektmanagement kommunale Partner,Project Management


In [15]:
seniority_df.head()

Unnamed: 0,text,label
0,Analyst,Junior
1,Analyste financier,Junior
2,Anwendungstechnischer Mitarbeiter,Junior
3,Application Engineer,Senior
4,Applications Engineer,Senior


In [16]:
jobs_annotated_df.head()

Unnamed: 0,row_id,cv_id,job_index,organization,position,startDate,endDate,status,department,seniority
0,0,0,0,Depot4Design GmbH,Prokurist,2019-08,,ACTIVE,Other,Management
1,1,0,1,Depot4Design GmbH,CFO,2019-07,,ACTIVE,Other,Management
2,2,0,2,Depot4Design GmbH,Betriebswirtin,2019-07,,ACTIVE,Other,Professional
3,3,0,3,Depot4Design GmbH,Prokuristin,2019-07,,ACTIVE,Other,Management
4,4,0,4,Depot4Design GmbH,CFO,2019-07,,ACTIVE,Other,Management


## 1. Data Preparation

### 1.1 Normalization function

For rule-based matching, job titles and label-list entries are normalized using lowercasing and whitespace stripping before applying simple substring matching:

- lowercasing
- stripping whitespace and collapsing multiple internal spaces

This is done to avoid any accidental mismatching due to lower/uppercasing or spacing (spaces, tabs, line breaks).

We want to make sure the normalization does not alter meaning or generalize semantics, but simply removes formatting noise. Therefore, no further text processing is applied in order to keep the baseline fully interpretable. We apply the same normalization to CV titles and the label-list titles.

In [17]:
def normalize_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower().strip()
    text = " ".join(text.split())
    return text

The function is directly applied to the label lists and will be build into the rule-based prediction function for the dataset.

In [18]:
department_df["text_norm"] = department_df["text"].apply(normalize_text)
seniority_df["text_norm"] = seniority_df["text"].apply(normalize_text)

### 1.2 Prediction Functions

Our rule-based matching functions implement a rule-based classifier using the provided department and seniority label list where job titles are matched via substring rules. Unmatched titles to departments are assigned to the label ‘Other’. For seniority we explicitly use "Professional" as default for unmatched titles, because no label such as "Other" exists in seniority. Since this is the most frequent seniority class, it is used as the fallback label to avoid artificially inflating rare classes.

In [19]:
def predict_department_rule_based(title, department_df):
    title_norm = normalize_text(title)

    for _, row in department_df.iterrows():
        if row["text_norm"] in title_norm:
            return row["label"]

    return "Other"

In [20]:
def predict_seniority_rule_based(title, seniority_df):
    title_norm = normalize_text(title)

    for _, row in seniority_df.iterrows():
        if row["text_norm"] in title_norm:
            return row["label"]

    # Default fallback if nothing matches
    return "Professional"

### 1.4 Rule-based Matching
The rule-based matching baseline does not involve training and is therefore evaluated on the full dataset (in contrast all learning-based models will be evaluated out-of-sample using a fixed train–test split)

**Only Current-Job**

In [21]:
X_base, y_dep_base, y_sen_base, meta_base = data_preparation_pipeline(
    jobs_annotated_df,
    extension=False,
    annotated=True
)

In [22]:
X_base.head()

0         Solutions Architect
1     Medizintechnik Beratung
2    APL-ansvarig, samordning
3       Kaufmännischer Leiter
4              Lab-Supervisor
Name: text, dtype: object

In [23]:
y_sen_base.head()

0    Professional
1    Professional
2            Lead
3            Lead
4            Lead
Name: seniority, dtype: object

In [24]:
y_dep_base.head()

0    Information Technology
1                Consulting
2            Administrative
3                     Sales
4                     Other
Name: department, dtype: object

**With Extension**

In [25]:
X_ext, y_dep_ext, y_sen_ext, meta_ext = data_preparation_pipeline(
    jobs_annotated_df,
    extension=True,
    annotated=True
)

#### Department Matching

**Only Current Job**

In [26]:
y_dep_pred_base = X_base.apply(
    lambda title: predict_department_rule_based(title, department_df)
)

In [27]:
metrics_dep_base = evaluate_predictions(y_dep_base, y_dep_pred_base)
metrics_dep_base

{'accuracy': 0.6026315789473684, 'macro_f1': 0.44918940911600075}

**With Extension**

In [28]:
y_dep_pred_ext = X_ext.apply(
    lambda title: predict_department_rule_based(title, department_df)
)

In [29]:
metrics_dep_ext = evaluate_predictions(y_dep_ext, y_dep_pred_ext)
metrics_dep_ext

{'accuracy': 0.4368421052631579, 'macro_f1': 0.30458751010937224}

#### Seniority Matching

**Only Current Job**

In [30]:
y_sen_pred_base = X_base.apply(
    lambda title: predict_seniority_rule_based(title, seniority_df)
)

In [31]:
metrics_sen_base = evaluate_predictions(y_sen_base, y_sen_pred_base)
metrics_sen_base

{'accuracy': 0.5368421052631579, 'macro_f1': 0.42616200731397247}

**With Extension**

In [34]:
y_sen_pred_ext = X_ext.apply(
    lambda title: predict_seniority_rule_based(title, seniority_df)
)

In [None]:
metrics_sen_ext = evaluate_predictions(y_sen_ext, y_sen_pred_ext)
metrics_sen_ext

{'accuracy': 0.13421052631578947, 'macro_f1': 0.02639751552795031}

In [35]:
def build_error_dfs(X, y_true, y_pred, meta=None, target_name="target", extension=False):
    df = pd.DataFrame(
        {
            "title": X.values,
            "true": pd.Series(y_true).values,
            "pred": pd.Series(y_pred).values,
        }
    )
    df["correct"] = df["true"].eq(df["pred"])
    df["extension"] = extension
    df["target"] = target_name

    if meta is not None:
        meta_df = meta.reset_index(drop=True).copy()
        df = pd.concat([meta_df, df.reset_index(drop=True)], axis=1)

    # per-class accuracy (where it's good/bad)
    by_true = (
        df.groupby(["target", "extension", "true"], dropna=False)["correct"]
        .agg(n="size", accuracy="mean")
        .reset_index()
        .sort_values(["target", "extension", "accuracy", "n"], ascending=[True, True, True, False])
    )

    # confusion pairs (where it's bad)
    confusion = (
        df.loc[~df["correct"]]
        .groupby(["target", "extension", "true", "pred"], dropna=False)
        .size()
        .reset_index(name="count")
        .sort_values(["target", "extension", "count"], ascending=[True, True, False])
    )

    # examples to inspect
    best_examples = df.loc[df["correct"]].head(50).copy()
    worst_examples = df.loc[~df["correct"]].head(50).copy()

    return df, by_true, confusion, best_examples, worst_examples


# Department
dep_base_df, dep_base_by_true, dep_base_conf, dep_base_best, dep_base_worst = build_error_dfs(
    X_base, y_dep_base, y_dep_pred_base, meta=meta_base, target_name="department", extension=False
)
dep_ext_df, dep_ext_by_true, dep_ext_conf, dep_ext_best, dep_ext_worst = build_error_dfs(
    X_ext, y_dep_ext, y_dep_pred_ext, meta=meta_ext, target_name="department", extension=True
)

# Seniority
sen_base_df, sen_base_by_true, sen_base_conf, sen_base_best, sen_base_worst = build_error_dfs(
    X_base, y_sen_base, y_sen_pred_base, meta=meta_base, target_name="seniority", extension=False
)
sen_ext_df, sen_ext_by_true, sen_ext_conf, sen_ext_best, sen_ext_worst = build_error_dfs(
    X_ext, y_sen_ext, y_sen_pred_ext, meta=meta_ext, target_name="seniority", extension=True
)

# One combined df if you want everything in one place
all_preds_df = pd.concat([dep_base_df, dep_ext_df, sen_base_df, sen_ext_df], ignore_index=True)

# Summaries (good vs bad areas)
all_by_true_df = pd.concat(
    [dep_base_by_true, dep_ext_by_true, sen_base_by_true, sen_ext_by_true], ignore_index=True
)
all_confusions_df = pd.concat(
    [dep_base_conf, dep_ext_conf, sen_base_conf, sen_ext_conf], ignore_index=True
)

display(all_by_true_df.head(50))
display(all_confusions_df.head(50))
display(all_preds_df.sample(50, random_state=0))

Unnamed: 0,target,extension,true,n,accuracy
0,department,False,Human Resources,13,0.0
1,department,False,Customer Support,4,0.0
2,department,False,Administrative,7,0.142857
3,department,False,Information Technology,50,0.34
4,department,False,Project Management,26,0.384615
5,department,False,Marketing,15,0.4
6,department,False,Business Development,12,0.416667
7,department,False,Purchasing,10,0.5
8,department,False,Sales,31,0.548387
9,department,False,Consulting,27,0.666667


Unnamed: 0,target,extension,true,pred,count
0,department,False,Other,Information Technology,34
1,department,False,Information Technology,Other,30
2,department,False,Project Management,Other,10
3,department,False,Business Development,Other,7
4,department,False,Consulting,Other,7
5,department,False,Human Resources,Other,7
6,department,False,Sales,Other,7
7,department,False,Human Resources,Information Technology,6
8,department,False,Project Management,Information Technology,6
9,department,False,Marketing,Information Technology,5


Unnamed: 0,cv_id,title,true,pred,correct,extension,target
1226,142,Managing Dircector,Management,Management,True,True,seniority
376,603,Owner - CSO,Sales,Other,False,False,department
9,14,energy and sustainability consultant bei,Consulting,Consulting,True,False,department
308,493,Director Of Food And Beverage,Other,Other,True,False,department
299,476,"Abteilungsleiter Vertrieb, Dipl. Bankbetriebsw...",Sales,Information Technology,False,False,department
483,166,Physiotherapeutin,Other,Other,True,True,department
317,510,Regionaler Vertriebsleiter,Sales,Information Technology,False,False,department
721,550,DevSecOps CoE Lead | Solution Architect | Tool...,Information Technology,Information Technology,True,True,department
670,459,Special Education Advocate/Education Consultan...,Consulting,Consulting,True,True,department
644,415,Senior Manager B2B Contract & Implementation M...,Project Management,Sales,False,True,department


For the rule-based baseline, incorporating historical job titles leads to a decrease in both accuracy and macro F1-score. This suggests that simple keyword-matching approaches are sensitive to additional, potentially conflicting signals introduced by past roles. The result highlights the limitations of non-learning methods in exploiting richer contextual information and motivates the use of learning-based models for handling extended inputs Adding historical context alone does not help a simple rule-based system and can even degrade performance, highlighting the limitations of non-learning approaches.

In [None]:
results = []

add_result(
    results,
    model_name="Rule-based",
    target="department",
    extension=False,
    metrics=metrics_dep_base
)

add_result(
    results,
    model_name="Rule-based",
    target="department",
    extension=True,
    metrics=metrics_dep_ext
)

add_result(
    results,
    model_name="Rule-based",
    target="seniority",
    extension=False,
    metrics=metrics_sen_base
)

add_result(
    results,
    model_name="Rule-based",
    target="seniority",
    extension=True,
    metrics=metrics_sen_ext
)

save_results(results)

In [None]:
results_df_baseline = pd.DataFrame(results)
results_df_baseline

Unnamed: 0,model,target,extension,accuracy,macro_f1
0,Rule-based,department,False,0.602632,0.449189
1,Rule-based,department,True,0.436842,0.304588
2,Rule-based,seniority,False,0.536842,0.426162
3,Rule-based,seniority,True,0.134211,0.026398


In [None]:
import os
print(os.getcwd())

/content/drive/MyDrive/Final-Project-snapAddy/src


In [None]:
!ls ..

archive  data  data_prep_eda.ipynb  results_comparison.ipynb  src


In [None]:
!ls ../data

all_results.csv     jobs_not_annotated.csv	     seniority-v2.csv
department-v2.csv   linkedin-cvs-annotated.json
jobs_annotated.csv  linkedin-cvs-not-annotated.json
