# Baseline Model - Rule-based Matching

In [11]:
import pandas as pd

In [1]:
from google.colab import auth
auth.authenticate_user()

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
%cd /content/drive/MyDrive/Final-Project-snapAddy

/content/drive/MyDrive/Final-Project-snapAddy


In [4]:
import sys
from pathlib import Path
sys.path.append(str(Path.cwd() / "src"))

In [5]:
!ls

archive  data  data_prep_eda.ipynb  results_comparison.ipynb  src


In [6]:
from utils.data_preparation_pipeline import data_preparation_pipeline
from utils.eval_utils import evaluate_predictions

In [6]:
%cd /content/drive/MyDrive/Final-Project-snapAddy/data

/content/drive/MyDrive/Final-Project-snapAddy/data


In [9]:
DATA_DIR = Path("/content/sample_data")
department_path = DATA_DIR / "department-v2.csv"
seniority_path = DATA_DIR / "seniority-v2.csv"

In [12]:
department_df = pd.read_csv(department_path)
seniority_df = pd.read_csv(seniority_path)
jobs_annotated_df = pd.read_csv("jobs_annotated.csv")

As the baseline, we implement a rule-based classifier that assigns department and seniority labels by matching job titles against the predefined label lists

the predefined label lists are:

- department_df
- seniority_df

**Rule logic**

Substring matching against predefined label lists is used for the rule-based baseline. Only minimal syntactic normalization is applied, while semantic transformations are deliberately avoided to preserve the baseline’s role as a simple, non-learning reference model.

In [13]:
department_df.head()

Unnamed: 0,text,label
0,Adjoint directeur communication,Marketing
1,Advisor Strategy and Projects,Project Management
2,Beratung & Projekte,Project Management
3,Beratung & Projektmanagement,Project Management
4,Beratung und Projektmanagement kommunale Partner,Project Management


In [14]:
seniority_df.head()

Unnamed: 0,text,label
0,Analyst,Junior
1,Analyste financier,Junior
2,Anwendungstechnischer Mitarbeiter,Junior
3,Application Engineer,Senior
4,Applications Engineer,Senior


In [15]:
jobs_annotated_df.head()

Unnamed: 0,cv_id,job_index,organization,position,startDate,endDate,status,department,seniority
0,0,0,Depot4Design GmbH,Prokurist,2019-08,,ACTIVE,Other,Management
1,0,1,Depot4Design GmbH,CFO,2019-07,,ACTIVE,Other,Management
2,0,2,Depot4Design GmbH,Betriebswirtin,2019-07,,ACTIVE,Other,Professional
3,0,3,Depot4Design GmbH,Prokuristin,2019-07,,ACTIVE,Other,Management
4,0,4,Depot4Design GmbH,CFO,2019-07,,ACTIVE,Other,Management


## 1. Data Preparation
### 1.1 Normalization function
For rule-based matching, job titles and label-list entries are normalized using lowercasing and whitespace stripping before applying simple substring matching:

- lowercasing
- stripping whitespace and collapsing multiple internal spaces

This is done to avoid any accidental mismatching due to lower/uppercasing or spacing (spaces, tabs, line breaks).

We want to make sure the normalization does not alter meaning or generalize semantics, but simply removes formatting noise. Therefore, no further text processing is applied in order to keep the baseline fully interpretable. We apply the same normalization to CV titles and the label-list titles.

In [16]:
def normalize_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower().strip()
    text = " ".join(text.split())
    return text

The function is directly applied to the label lists and will be build into the rule-based prediction function for the dataset.

In [17]:
department_df["text_norm"] = department_df["text"].apply(normalize_text)
seniority_df["text_norm"] = seniority_df["text"].apply(normalize_text)

### 1.2 Prediction Functions

Our rule-based matching functions implement a rule-based classifier using the provided department and seniority label list where job titles are matched via substring rules. Unmatched titles to departments are assigned to the label ‘Other’. For seniority we explicitly use "Professional" as default for unmatched titles, because no label such as "Other" exists in seniority. Since this is the most frequent seniority class, it is used as the fallback label to avoid artificially inflating rare classes.

In [18]:
def predict_department_rule_based(title, department_df):
    title_norm = normalize_text(title)

    for _, row in department_df.iterrows():
        if row["text_norm"] in title_norm:
            return row["label"]

    return "Other"

In [19]:
def predict_seniority_rule_based(title, seniority_df):
    title_norm = normalize_text(title)

    for _, row in seniority_df.iterrows():
        if row["text_norm"] in title_norm:
            return row["label"]

    # Default fallback if nothing matches
    return "Professional"

### 1.3 Rule-based Matching

The rule-based matching baseline does not involve training and is therefore evaluated on the full dataset (in contrast all learning-based models will be evaluated out-of-sample using a fixed train–test split)

In [30]:
%cd /content/drive/MyDrive/Final-Project-snapAddy/src/utils/data_utils.py

[Errno 20] Not a directory: '/content/drive/MyDrive/Final-Project-snapAddy/src/utils/data_utils.py'
/content/drive/MyDrive/Final-Project-snapAddy/src/utils


In [26]:
from utils.data_preparation_pipeline import data_preparation_pipeline

In [32]:
import os
print("CWD:", os.getcwd())
print("Files here:", os.listdir("."))

CWD: /content/drive/MyDrive/Final-Project-snapAddy/src/utils
Files here: ['split_utils.py', 'data_preparation_pipeline.py', 'runmodel_utils.py', 'eval_utils.py', 'data_utils.py', '__pycache__']


In [34]:
X, y_dep, y_sen, meta = data_preparation_pipeline(
    jobs_annotated_df,
    extension=False,
    annotated=True
)

NameError: name 'pd' is not defined

----

In [None]:
X, y_dep, y_sen, meta = data_preparation_pipeline(
    jobs_annotated_df,
    extension=False,
    annotated=True
)

X_norm = X.apply(normalize_text)

In [None]:
from utils.eval_utils import evaluate_predictions

dep_metrics = evaluate_predictions(y_dep, y_dep_pred)
sen_metrics = evaluate_predictions(y_sen, y_sen_pred)

dep_metrics, sen_metrics

In [None]:
result = {
    "model": "Rule-based",
    "extension": False,
    "target": "department",
    "accuracy": metrics["accuracy"],
    "macro_f1": metrics["macro_f1"]
}

In [None]:
import pandas as pd
from pathlib import Path

results_path = Path("../results")
results_path.mkdir(exist_ok=True)

pd.DataFrame([result]).to_csv(
    results_path / "baseline_rule_based.csv",
    index=False
)