# Baseline Model - Rule-based Matching

In [11]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
import pandas as pd

In [13]:
import sys
from pathlib import Path

# ensure repo root
%cd /content/drive/MyDrive/Final-Project-snapAddy/src

sys.path.append(str(Path.cwd()))

/content/drive/MyDrive/Final-Project-snapAddy/src


In [17]:
DATA_DIR = Path("/content/sample_data")
department_path = DATA_DIR / "department-v2.csv"
seniority_path = DATA_DIR / "seniority-v2.csv"

In [18]:
department_df = pd.read_csv(department_path)
seniority_df = pd.read_csv(seniority_path)
jobs_annotated_df = pd.read_csv("../data/jobs_annotated.csv")

In [37]:
from utils.data_preparation_pipeline import data_preparation_pipeline
from utils.eval_utils import evaluate_predictions
from utils.eval_utils import evaluate_predictions

As the baseline, we implement a rule-based classifier that assigns department and seniority labels by matching job titles against the predefined label lists

the predefined label lists are:

- department_df
- seniority_df

**Rule logic**

Substring matching against predefined label lists is used for the rule-based baseline. Only minimal syntactic normalization is applied, while semantic transformations are deliberately avoided to preserve the baseline’s role as a simple, non-learning reference model.

In [19]:
department_df.head()

Unnamed: 0,text,label
0,Adjoint directeur communication,Marketing
1,Advisor Strategy and Projects,Project Management
2,Beratung & Projekte,Project Management
3,Beratung & Projektmanagement,Project Management
4,Beratung und Projektmanagement kommunale Partner,Project Management


In [20]:
seniority_df.head()

Unnamed: 0,text,label
0,Analyst,Junior
1,Analyste financier,Junior
2,Anwendungstechnischer Mitarbeiter,Junior
3,Application Engineer,Senior
4,Applications Engineer,Senior


In [21]:
jobs_annotated_df.head()

Unnamed: 0,cv_id,job_index,organization,position,startDate,endDate,status,department,seniority
0,0,0,Depot4Design GmbH,Prokurist,2019-08,,ACTIVE,Other,Management
1,0,1,Depot4Design GmbH,CFO,2019-07,,ACTIVE,Other,Management
2,0,2,Depot4Design GmbH,Betriebswirtin,2019-07,,ACTIVE,Other,Professional
3,0,3,Depot4Design GmbH,Prokuristin,2019-07,,ACTIVE,Other,Management
4,0,4,Depot4Design GmbH,CFO,2019-07,,ACTIVE,Other,Management


## 1. Data Preparation

### 1.1 Normalization function

For rule-based matching, job titles and label-list entries are normalized using lowercasing and whitespace stripping before applying simple substring matching:

- lowercasing
- stripping whitespace and collapsing multiple internal spaces

This is done to avoid any accidental mismatching due to lower/uppercasing or spacing (spaces, tabs, line breaks).

We want to make sure the normalization does not alter meaning or generalize semantics, but simply removes formatting noise. Therefore, no further text processing is applied in order to keep the baseline fully interpretable. We apply the same normalization to CV titles and the label-list titles.

In [22]:
def normalize_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower().strip()
    text = " ".join(text.split())
    return text

The function is directly applied to the label lists and will be build into the rule-based prediction function for the dataset.

In [23]:
department_df["text_norm"] = department_df["text"].apply(normalize_text)
seniority_df["text_norm"] = seniority_df["text"].apply(normalize_text)

### 1.2 Prediction Functions

Our rule-based matching functions implement a rule-based classifier using the provided department and seniority label list where job titles are matched via substring rules. Unmatched titles to departments are assigned to the label ‘Other’. For seniority we explicitly use "Professional" as default for unmatched titles, because no label such as "Other" exists in seniority. Since this is the most frequent seniority class, it is used as the fallback label to avoid artificially inflating rare classes.

In [24]:
def predict_department_rule_based(title, department_df):
    title_norm = normalize_text(title)

    for _, row in department_df.iterrows():
        if row["text_norm"] in title_norm:
            return row["label"]

    return "Other"

In [25]:
def predict_seniority_rule_based(title, seniority_df):
    title_norm = normalize_text(title)

    for _, row in seniority_df.iterrows():
        if row["text_norm"] in title_norm:
            return row["label"]

    # Default fallback if nothing matches
    return "Professional"

### 1.4 Rule-based Matching
The rule-based matching baseline does not involve training and is therefore evaluated on the full dataset (in contrast all learning-based models will be evaluated out-of-sample using a fixed train–test split)

**Only Current-Job**

In [31]:
X_base, y_dep_base, y_sen_base, meta_base = data_preparation_pipeline(
    jobs_annotated_df,
    extension=False,
    annotated=True
)

In [30]:
X_base.head()

Unnamed: 0,text
0,Solutions Architect
1,Medizintechnik Beratung
2,"APL-ansvarig, samordning"
3,Kaufmännischer Leiter
4,Lab-Supervisor


In [29]:
y_sen_base.head()

Unnamed: 0,seniority
0,Professional
1,Professional
2,Lead
3,Lead
4,Lead


In [28]:
y_dep_base.head()

Unnamed: 0,department
0,Information Technology
1,Consulting
2,Administrative
3,Sales
4,Other


**With Extension**

In [32]:
X_ext, y_dep_ext, y_sen_ext, meta_ext = data_preparation_pipeline(
    jobs_annotated_df,
    extension=True,
    annotated=True
)

#### Department Matching

**Only Current Job**

In [33]:
y_dep_pred_base = X_base.apply(
    lambda title: predict_department_rule_based(title, department_df)
)

In [36]:
metrics_dep_base = evaluate_predictions(y_dep_base, y_dep_pred_base)
metrics_dep_base

{'accuracy': 0.6026315789473684, 'macro_f1': 0.44918940911600075}

**With Extension**

In [38]:
y_dep_pred_ext = X_ext.apply(
    lambda title: predict_department_rule_based(title, department_df)
)

In [39]:
metrics_dep_ext = evaluate_predictions(y_dep_ext, y_dep_pred_ext)
metrics_dep_ext

{'accuracy': 0.4368421052631579, 'macro_f1': 0.30458751010937224}

#### Seniority Matching

**Only Current Job**

In [40]:
y_sen_pred_base = X_base.apply(
    lambda title: predict_seniority_rule_based(title, seniority_df)
)

In [41]:
metrics_sen_base = evaluate_predictions(y_sen_base, y_sen_pred_base)
metrics_sen_base

{'accuracy': 0.5368421052631579, 'macro_f1': 0.42616200731397247}

**With Extension**

In [42]:
y_sen_pred_ext = X_ext.apply(
    lambda title: predict_seniority_rule_based(title, department_df)
)

In [43]:
metrics_sen_ext = evaluate_predictions(y_sen_ext, y_sen_pred_ext)
metrics_sen_ext

{'accuracy': 0.13421052631578947, 'macro_f1': 0.02639751552795031}

For the rule-based baseline, incorporating historical job titles leads to a decrease in both accuracy and macro F1-score. This suggests that simple keyword-matching approaches are sensitive to additional, potentially conflicting signals introduced by past roles. The result highlights the limitations of non-learning methods in exploiting richer contextual information and motivates the use of learning-based models for handling extended inputs Adding historical context alone does not help a simple rule-based system and can even degrade performance, highlighting the limitations of non-learning approaches.