<a href="https://colab.research.google.com/github/luisadosch/Final-Project-snapAddy/blob/main/model6_Bag_of_Words_TF%E2%80%93IDF_%2B_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Github-Zugangsdaten

In [174]:
# GitHub-Zugangsdaten
import pandas as pd

GH_USER = "luisadosch"
GH_REPO = "Final-Project-snapAddy"
BRANCH = "main"

def get_github_url(relative_path):
    return f"https://raw.githubusercontent.com/{GH_USER}/{GH_REPO}/{BRANCH}/{relative_path}"


jobs_annotated_active_df = pd.read_csv(get_github_url("data/processed/jobs_annotated_active.csv"))

department_df = pd.read_csv(get_github_url("data/raw/department-v2.csv"))

seniority_df = pd.read_csv(get_github_url("data/raw/seniority-v2.csv"))

# 2. Modell Seniority

In [175]:
#Seniority Daten sortieren
sdf = seniority_df.copy()

sdf["text"] = sdf["text"].astype(str).str.lower()
sdf["label"] = sdf["label"].astype(str)

sdf = sdf.dropna(subset=["text", "label"])

Prepare the seniority dataset for modeling. Lowercasing ensures uniform text representation. Dropping missing values prevents errors in the model.

In [176]:
#Seniority Train/Test Split
from sklearn.model_selection import train_test_split

sx = sdf["text"]
sy = sdf["label"]

sx_train, sx_test, sy_train, sy_test = train_test_split(
    sx,
    sy,
    test_size=0.2,
    random_state=42,
    stratify=sy
)

# Print dataset sizes
print("Seniority dataset sizes:")
print("Total:", len(sx))
print("Train:", len(sx_train))
print("Test:", len(sx_test))

Seniority dataset sizes:
Total: 9428
Train: 7542
Test: 1886


Split into training and test sets. stratify ensures rare classes are represented proportionally. The total is 9428, while the train set is 7542 and the test set is 1886.

In [177]:
#Seniority TF–IDF + Logistic Regression Pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

smodel = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=3,
        max_df=0.9
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])

Pipeline converts job titles into TF–IDF features and applies logistic regression. class_weight="balanced" ensures rare seniority levels get enough importance.

In [178]:
#Seniority Modell trainieren
smodel.fit(sx_train, sy_train)

# Vorhersagen auf Testdaten
sy_pred = smodel.predict(sx_test)

# Accuracy ausgeben
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(sy_test, sy_pred))

Accuracy: 0.9703075291622482


Train the seniority classifier and generate predictions on the test set. The model achieves a high accuracy of 0.97 on the test set.

In [179]:
#Seniority Evaluation
from sklearn.metrics import accuracy_score, f1_score, classification_report

sy_pred = smodel.predict(sx_test)

saccuracy = accuracy_score(sy_test, sy_pred)
smacro_f1 = f1_score(sy_test, sy_pred, average="macro")

print("Accuracy:", round(saccuracy, 3))
print("Macro F1:", round(smacro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sy_test, sy_pred))

Accuracy: 0.97
Macro F1: 0.956

Classification Report:

              precision    recall  f1-score   support

    Director       0.99      0.98      0.98       197
      Junior       0.85      1.00      0.92        82
        Lead       0.97      0.98      0.98       709
  Management       0.92      0.93      0.92       151
      Senior       0.99      0.97      0.98       747

    accuracy                           0.97      1886
   macro avg       0.94      0.97      0.96      1886
weighted avg       0.97      0.97      0.97      1886



Evaluate using accuracy and macro F1, which accounts for class imbalance. Classification report shows precision, recall, and F1 per seniority level. The evaluation yields an accuracy of 0.97 and a macro F1 score of 0.956, reflecting strong performance across all seniority classes.

In [180]:
# Seniority Evaluation on Annotated ACTIVE Jobs

# Prepare evaluation data
s_eval_df = jobs_annotated_active_df.dropna(subset=["position", "seniority"]).copy()
s_eval_text = s_eval_df["position"].astype(str).str.lower()
s_eval_labels = s_eval_df["seniority"].astype(str)

# Predict seniority
s_eval_pred = smodel.predict(s_eval_text)

# Evaluation metrics
from sklearn.metrics import accuracy_score, f1_score, classification_report

s_eval_accuracy = accuracy_score(s_eval_labels, s_eval_pred)
s_eval_macro_f1 = f1_score(s_eval_labels, s_eval_pred, average="macro")

print("Seniority Evaluation on ACTIVE Jobs")
print("Accuracy:", round(s_eval_accuracy, 3))
print("Macro F1:", round(s_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(s_eval_labels, s_eval_pred))

Seniority Evaluation on ACTIVE Jobs
Accuracy: 0.437
Macro F1: 0.409

Classification Report:

              precision    recall  f1-score   support

    Director       0.58      0.88      0.70        34
      Junior       0.18      0.33      0.24        12
        Lead       0.33      0.71      0.45       125
  Management       0.90      0.59      0.72       192
Professional       0.00      0.00      0.00       216
      Senior       0.23      0.80      0.36        44

    accuracy                           0.44       623
   macro avg       0.37      0.55      0.41       623
weighted avg       0.40      0.44      0.38       623



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluates the seniority model on annotated ACTIVE LinkedIn job entries using the job position title as input. The predictions are compared against manually labeled seniority levels, providing a realistic assessment of how well the model generalizes from label-based training data to real-world CV data. The accuracy ist 0.437 and the macro F1 score is 0.409.

In [181]:
#Seniority Top Features
import numpy as np

feature_names = smodel.named_steps["tfidf"].get_feature_names_out()
coefs = smodel.named_steps["clf"].coef_

for i, label in enumerate(smodel.named_steps["clf"].classes_):
    top = np.argsort(coefs[i])[-10:]
    print(f"\nTop words for {label}:")
    print(feature_names[top])


Top words for Director:
['marketing director' 'managing directors' 'managing' 'director marketing'
 'vertriebsdirektor' 'director sales' 'directors' 'sales director'
 'abteilungsdirektor' 'director']

Top words for Junior:
['marketing' 'assistent' 'associate' 'assistentin' 'mitarbeiter'
 'mitarbeiterin' 'referent' 'referentin' 'analyst' 'junior']

Top words for Lead:
['abteilungsleiter' 'projektleiter' 'geschäftsleitung' 'teamleiter'
 'leiterin' 'head of' 'head' 'vertriebsleiter' 'leitung' 'leiter']

Top words for Management:
['cio' 'vice president' 'vice' 'founder' 'chief' 'owner' 'ceo'
 'geschäftsführung' 'vp' 'geschäftsführer']

Top words for Senior:
['marketing manager' 'engineer' 'executive' 'assistant' 'managerin'
 'responsable' 'consultant' 'management' 'senior' 'manager']


Shows the most influential words for predicting each seniority class. Helps interpret the model’s decisions.

3. Modell Department

In [182]:
#Department Daten sortieren
ddf = department_df.copy()

ddf["text"] = ddf["text"].astype(str).str.lower()
ddf["label"] = ddf["label"].astype(str)

ddf = ddf.dropna(subset=["text", "label"])

Prepare department dataset. Lowercasing ensures consistent text representation. Drop missing values.

In [183]:
# Department Train/Test Split
from sklearn.model_selection import train_test_split

dx = ddf["text"]
dy = ddf["label"]

dx_train, dx_test, dy_train, dy_test = train_test_split(
    dx,
    dy,
    test_size=0.2,
    random_state=42,
    stratify=dy
)

# Print dataset sizes
print("Department dataset sizes:")
print("Total:", len(dx))
print("Train:", len(dx_train))
print("Test:", len(dx_test))

Department dataset sizes:
Total: 10145
Train: 8116
Test: 2029


Split into train/test sets with proportional label distribution. The total is 10145, while the train set is 8116 and the test set is 2029.

In [184]:
# Department TF–IDF + Logistic Regression Pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

dmodel = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=3,
        max_df=0.9
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])

Same as seniority pipeline but for department prediction.

In [185]:
# Department Modell trainieren
dmodel.fit(dx_train, dy_train)

# Vorhersagen auf Testdaten
dy_pred = dmodel.predict(dx_test)

# Accuracy ausgeben
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(dy_test, dy_pred))

Accuracy: 0.9344504682109414


Train the department classifier and predict on test set. The model achieves a high accuracy of 0.93 in the test set.

In [186]:
# Department Evaluation
from sklearn.metrics import accuracy_score, f1_score, classification_report

dy_pred = dmodel.predict(dx_test)

daccuracy = accuracy_score(dy_test, dy_pred)
dmacro_f1 = f1_score(dy_test, dy_pred, average="macro")

print("Accuracy:", round(daccuracy, 3))
print("Macro F1:", round(dmacro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(dy_test, dy_pred))

Accuracy: 0.934
Macro F1: 0.86

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.62      0.94      0.74        17
  Business Development       0.83      0.99      0.90       124
            Consulting       0.82      0.97      0.89        33
      Customer Support       0.88      1.00      0.93         7
       Human Resources       0.75      1.00      0.86         6
Information Technology       0.92      0.95      0.94       261
             Marketing       0.99      0.92      0.96       859
                 Other       0.50      1.00      0.67         8
    Project Management       0.57      0.88      0.69        40
            Purchasing       0.89      1.00      0.94         8
                 Sales       0.96      0.93      0.94       666

              accuracy                           0.93      2029
             macro avg       0.79      0.96      0.86      2029
          weighted avg       0.95      0.93   

Accuracy and macro F1 score for department model. Classification report for detailed class-level metrics.
Evaluate using accuracy and macro F1, which accounts for class imbalance. Classification report shows precision, recall, and F1 per department level. The evaluation yields an accuracy of 0.934 and a macro F1 score of 0.86, reflecting strong performance across all department classes.

In [187]:
# Department Evaluation on Annotated ACTIVE Jobs

# Prepare evaluation data
d_eval_df = jobs_annotated_active_df.dropna(subset=["position", "department"]).copy()
d_eval_text = d_eval_df["position"].astype(str).str.lower()
d_eval_labels = d_eval_df["department"].astype(str)

# Predict department
d_eval_pred = dmodel.predict(d_eval_text)

# Evaluation metrics
d_eval_accuracy = accuracy_score(d_eval_labels, d_eval_pred)
d_eval_macro_f1 = f1_score(d_eval_labels, d_eval_pred, average="macro")

print("Department Evaluation on ACTIVE Jobs")
print("Accuracy:", round(d_eval_accuracy, 3))
print("Macro F1:", round(d_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(d_eval_labels, d_eval_pred))

Department Evaluation on ACTIVE Jobs
Accuracy: 0.223
Macro F1: 0.338

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.17      0.07      0.10        14
  Business Development       0.38      0.30      0.33        20
            Consulting       0.86      0.46      0.60        39
      Customer Support       1.00      0.17      0.29         6
       Human Resources       0.73      0.50      0.59        16
Information Technology       0.31      0.44      0.36        62
             Marketing       0.17      0.41      0.24        22
                 Other       0.00      0.00      0.00       344
    Project Management       0.27      0.56      0.37        39
            Purchasing       0.80      0.53      0.64        15
                 Sales       0.12      0.85      0.21        46

              accuracy                           0.22       623
             macro avg       0.44      0.39      0.34       623
        

Evaluates the department model on annotated ACTIVE LinkedIn job entries based on the job position title. Evaluating on manually labeled CV data allows for assessing the model’s robustness and applicability in realistic LinkedIn profile scenarios. The accuracy is 0.223 and the macro F1 score is 0.338.

In [188]:
# Department Top Features
import numpy as np

feature_names = dmodel.named_steps["tfidf"].get_feature_names_out()
coefs = dmodel.named_steps["clf"].coef_

for i, label in enumerate(dmodel.named_steps["clf"].classes_):
    top = np.argsort(coefs[i])[-10:]
    print(f"\nTop words for {label}:")
    print(feature_names[top])


Top words for Administrative:
['assistentin des' 'geschäftsführung' 'gf' 'assistent der'
 'geschäftsleitung' 'der' 'sekretärin' 'assistent' 'assistenz'
 'assistentin']

Top words for Business Development:
['business intelligence' 'crm' 'digital business' 'of business'
 'business process' 'ebusiness' 'it business' 'development'
 'business development' 'business']

Top words for Consulting:
['senior berater' 'sap' 'coach' 'senior consultant' 'von' 'senior'
 'recruitment' 'beraterin' 'berater' 'consultant']

Top words for Customer Support:
['service and' 'it systems' 'customer' 'technical' 'it support' 'it'
 'customer support' 'technical support' 'supporter' 'support']

Top words for Human Resources:
['qualitätsmanagement' 'director digital' 'project director' 'gl'
 'manager hr' 'of human' 'resources' 'human resources' 'human' 'hr']

Top words for Information Technology:
['digitalization' 'administrator' 'entwickler' 'digitale' 'administration'
 'digitalisierung' 'sap' 'digital' 'it' 'cr

Shows the top words for each department class to interpret the model.

In [189]:
# Vergleiche Accuracy und Macro F1
comparison_metrics = pd.DataFrame({
    "Target": ["Seniority", "Department"],
    "Accuracy": [saccuracy, daccuracy],
    "Macro F1": [smacro_f1, dmacro_f1]
})

print("Modellvergleich:\n")
print(comparison_metrics)

# Optional: Top 5 Features pro Label für beide Modelle nebeneinander
def print_top_features(model, n=5):
    feature_names = model.named_steps["tfidf"].get_feature_names_out()
    coefs = model.named_steps["clf"].coef_
    for i, label in enumerate(model.named_steps["clf"].classes_):
        top = np.argsort(coefs[i])[-n:]
        print(f"\nTop {n} words for {label}:")
        print(feature_names[top])

print("\n--- Seniority Top Features ---")
print_top_features(smodel)

print("\n--- Department Top Features ---")
print_top_features(dmodel)

Modellvergleich:

       Target  Accuracy  Macro F1
0   Seniority  0.970308  0.956030
1  Department  0.934450  0.860444

--- Seniority Top Features ---

Top 5 words for Director:
['director sales' 'directors' 'sales director' 'abteilungsdirektor'
 'director']

Top 5 words for Junior:
['mitarbeiterin' 'referent' 'referentin' 'analyst' 'junior']

Top 5 words for Lead:
['head of' 'head' 'vertriebsleiter' 'leitung' 'leiter']

Top 5 words for Management:
['owner' 'ceo' 'geschäftsführung' 'vp' 'geschäftsführer']

Top 5 words for Senior:
['responsable' 'consultant' 'management' 'senior' 'manager']

--- Department Top Features ---

Top 5 words for Administrative:
['der' 'sekretärin' 'assistent' 'assistenz' 'assistentin']

Top 5 words for Business Development:
['ebusiness' 'it business' 'development' 'business development' 'business']

Top 5 words for Consulting:
['senior' 'recruitment' 'beraterin' 'berater' 'consultant']

Top 5 words for Customer Support:
['it' 'customer support' 'technical su

In [190]:
# Comparison: Training Evaluation vs ACTIVE Job Evaluation

comparison_metrics = pd.DataFrame({
    "Target": [
        "Seniority (Label Data)",
        "Department (Label Data)",
        "Seniority (ACTIVE Jobs)",
        "Department (ACTIVE Jobs)"
    ],
    "Accuracy": [
        saccuracy,
        daccuracy,
        s_eval_accuracy,
        d_eval_accuracy
    ],
    "Macro F1": [
        smacro_f1,
        dmacro_f1,
        s_eval_macro_f1,
        d_eval_macro_f1
    ]
})

print("Model Comparison:\n")
print(comparison_metrics)


#Top Features per Label

import numpy as np

def print_top_features(model, n=5):
    feature_names = model.named_steps["tfidf"].get_feature_names_out()
    coefs = model.named_steps["clf"].coef_
    for i, label in enumerate(model.named_steps["clf"].classes_):
        top = np.argsort(coefs[i])[-n:]
        print(f"\nTop {n} words for {label}:")
        print(feature_names[top])

print("\n--- Seniority Top Features ---")
print_top_features(smodel)

print("\n--- Department Top Features ---")
print_top_features(dmodel)

Model Comparison:

                     Target  Accuracy  Macro F1
0    Seniority (Label Data)  0.970308  0.956030
1   Department (Label Data)  0.934450  0.860444
2   Seniority (ACTIVE Jobs)  0.436597  0.409319
3  Department (ACTIVE Jobs)  0.223114  0.338219

--- Seniority Top Features ---

Top 5 words for Director:
['director sales' 'directors' 'sales director' 'abteilungsdirektor'
 'director']

Top 5 words for Junior:
['mitarbeiterin' 'referent' 'referentin' 'analyst' 'junior']

Top 5 words for Lead:
['head of' 'head' 'vertriebsleiter' 'leitung' 'leiter']

Top 5 words for Management:
['owner' 'ceo' 'geschäftsführung' 'vp' 'geschäftsführer']

Top 5 words for Senior:
['responsable' 'consultant' 'management' 'senior' 'manager']

--- Department Top Features ---

Top 5 words for Administrative:
['der' 'sekretärin' 'assistent' 'assistenz' 'assistentin']

Top 5 words for Business Development:
['ebusiness' 'it business' 'development' 'business development' 'business']

Top 5 words for Consul

The comparison table summarizes model performance across two evaluation settings with concrete quantitative results. On the label-based datasets, the seniority classifier achieves a very high accuracy of 0.97 with a macro F1 score of 0.96, while the department classifier reaches an accuracy of 0.93 and a macro F1 score of 0.86. These results indicate that both TF–IDF + logistic regression models perform extremely well when trained and evaluated on curated label data.

When evaluated on annotated ACTIVE job entries from real CV data, performance drops substantially. Seniority prediction achieves an accuracy of 0.44 and a macro F1 score of 0.41, while department prediction performs considerably worse with an accuracy of 0.22 and a macro F1 score of 0.34. This sharp decline highlights a strong domain shift between clean label data and real-world job titles, which are shorter, noisier, more ambiguous, and often lack explicit domain or seniority cues.

The observed performance gap confirms that while simple bag-of-words baselines are effective on controlled datasets, they struggle to generalize to realistic CV data.

# 4. Modell Seniority mit synthetic Daten

In [191]:
ORD_MAP = {
    "Junior": 1.0,
    "Professional": 2.0,
    "Senior": 3.0,
    "Lead": 4.0,
    "Management": 5.0,
    "Director": 6.0,
}
INV_ORD = {v: k for k, v in ORD_MAP.items()}

In [192]:
def add_synthetic(train_df: pd.DataFrame, synthetic_csv_relpath: str) -> pd.DataFrame:
    syn = pd.read_csv(get_github_url(synthetic_csv_relpath))
    syn = syn[["position", "seniority"]].copy()

    id2label = {v: k for k, v in ORD_MAP.items()}
    syn["label"] = syn["seniority"].map(id2label)
    syn = syn.rename(columns={"position": "text"})
    syn = syn.dropna(subset=["text", "label"])

    out = pd.concat([train_df[["text", "label"]], syn[["text", "label"]]], ignore_index=True)
    return out

In [193]:
strain_df_aug = add_synthetic(sdf, "data/results/gemini_synthetic.csv")
strain_df_aug

Unnamed: 0,text,label
0,analyst,Junior
1,analyste financier,Junior
2,anwendungstechnischer mitarbeiter,Junior
3,application engineer,Senior
4,applications engineer,Senior
...,...,...
11309,Juristischer Berater,Professional
11310,"Leitung Personal, Finanzen, Einkauf, IT | Folk...",Management
11311,Verwaltungsleitung Landesspracheninstitut in d...,Management
11312,"Leitung Gebäudemanagement, Einkauf und Control...",Management


In [194]:
#Seniority Train/Test Split mit synthetic Daten
from sklearn.model_selection import train_test_split

ssx = strain_df_aug["text"]
ssy = strain_df_aug["label"]

ssx_train, ssx_test, ssy_train, ssy_test = train_test_split(
    ssx,
    ssy,
    test_size=0.2,
    random_state=42,
    stratify=ssy
)

# Print dataset sizes
print("Seniority dataset sizes:")
print("Total:", len(ssx))
print("Train:", len(ssx_train))
print("Test:", len(ssx_test))

Seniority dataset sizes:
Total: 11314
Train: 9051
Test: 2263


In [195]:
smodel_syn = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),
        min_df=3,
        max_df=0.9
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])

smodel_syn.fit(ssx_train, ssy_train)

In [196]:
#Seniority Modell trainieren mit synthetic Daten
smodel_syn.fit(ssx_train, ssy_train)

# Vorhersagen auf Testdaten
ssy_pred = smodel_syn.predict(ssx_test)

# Accuracy ausgeben
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(ssy_test, ssy_pred))

Accuracy: 0.8709677419354839


In [197]:
#Seniority Evaluation mit synthetic Daten
from sklearn.metrics import accuracy_score, f1_score, classification_report

ssy_pred = smodel_syn.predict(ssx_test)

ssaccuracy = accuracy_score(ssy_test, ssy_pred)
ssmacro_f1 = f1_score(ssy_test, ssy_pred, average="macro")

print("Accuracy:", round(ssaccuracy, 3))
print("Macro F1:", round(ssmacro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(ssy_test, ssy_pred))

Accuracy: 0.871
Macro F1: 0.811

Classification Report:

              precision    recall  f1-score   support

    Director       0.98      0.89      0.93       242
      Junior       0.81      0.75      0.78       165
        Lead       0.95      0.92      0.93       739
  Management       0.77      0.79      0.78       257
Professional       0.41      0.84      0.55        83
      Senior       0.92      0.88      0.90       777

    accuracy                           0.87      2263
   macro avg       0.81      0.84      0.81      2263
weighted avg       0.89      0.87      0.88      2263



In [198]:
# Seniority Evaluation on Annotated ACTIVE Jobs mit synthetic Daten

# Prepare evaluation data
ss_eval_df = jobs_annotated_active_df.dropna(subset=["position", "seniority"]).copy()
ss_eval_text = ss_eval_df["position"].astype(str).str.lower()
ss_eval_labels = ss_eval_df["seniority"].astype(str)

# Predict seniority
ss_eval_pred = smodel_syn.predict(ss_eval_text)

# Evaluation metrics
from sklearn.metrics import accuracy_score, f1_score, classification_report

ss_eval_accuracy = accuracy_score(ss_eval_labels, ss_eval_pred)
ss_eval_macro_f1 = f1_score(ss_eval_labels, ss_eval_pred, average="macro")

print("Seniority Evaluation on ACTIVE Jobs")
print("Accuracy:", round(ss_eval_accuracy, 3))
print("Macro F1:", round(ss_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(ss_eval_labels, ss_eval_pred))

Seniority Evaluation on ACTIVE Jobs
Accuracy: 0.645
Macro F1: 0.571

Classification Report:

              precision    recall  f1-score   support

    Director       0.50      0.85      0.63        34
      Junior       0.17      0.58      0.26        12
        Lead       0.86      0.50      0.63       125
  Management       0.85      0.72      0.78       192
Professional       0.69      0.61      0.64       216
      Senior       0.35      0.77      0.48        44

    accuracy                           0.65       623
   macro avg       0.57      0.67      0.57       623
weighted avg       0.73      0.65      0.66       623



In [199]:
ssynthetic_results = pd.DataFrame({
    "Dataset": [
        "Label Data (with Synthetic)",
        "ACTIVE Jobs (with Synthetic)"
    ],
    "Accuracy": [
        ssaccuracy,
        ss_eval_accuracy
    ],
    "Macro F1": [
        ssmacro_f1,
        ss_eval_macro_f1
    ]
})

print("Seniority Model – Performance with Synthetic Data\n")
print(ssynthetic_results)


Seniority Model – Performance with Synthetic Data

                        Dataset  Accuracy  Macro F1
0   Label Data (with Synthetic)  0.870968  0.811299
1  ACTIVE Jobs (with Synthetic)  0.645265  0.571373


In [200]:
# Comparison: Baseline vs Synthetic – Training & ACTIVE Job Evaluation

comparison_metrics_seniority = pd.DataFrame({
    "Target": [
        "Seniority (Label Data – no Synthetic)",
        "Seniority (Label Data – with Synthetic)",
        "Seniority (ACTIVE Jobs – no Synthetic)",
        "Seniority (ACTIVE Jobs – with Synthetic)",
    ],
    "Accuracy": [
        saccuracy,        # Seniority baseline – label data
        ssaccuracy,       # Seniority synthetic – label data
        s_eval_accuracy, # Seniority baseline – ACTIVE jobs
        ss_eval_accuracy,# Seniority synthetic – ACTIVE jobs
    ],
    "Macro F1": [
        smacro_f1,
        ssmacro_f1,
        s_eval_macro_f1,
        ss_eval_macro_f1,
    ]
})

print("Model Comparison:\n")
print(comparison_metrics_seniority)


Model Comparison:

                                     Target  Accuracy  Macro F1
0     Seniority (Label Data – no Synthetic)  0.970308  0.956030
1   Seniority (Label Data – with Synthetic)  0.870968  0.811299
2    Seniority (ACTIVE Jobs – no Synthetic)  0.436597  0.409319
3  Seniority (ACTIVE Jobs – with Synthetic)  0.645265  0.571373


# 5. Modell Department mit synthetic Daten

In [201]:
def add_synthetic_department(train_df: pd.DataFrame, synthetic_csv_relpath: str) -> pd.DataFrame:
    syn = pd.read_csv(get_github_url(synthetic_csv_relpath))

    # expect columns: position, department
    syn = syn[["position", "department"]].copy()
    syn = syn.rename(columns={"position": "text", "department": "label"})
    syn = syn.dropna(subset=["text", "label"])

    out = pd.concat([train_df[["text", "label"]], syn[["text", "label"]]], ignore_index=True)
    return out

In [202]:
dtrain_df_aug = add_synthetic_department(sdf, "data/results/gemini_synthetic.csv")
dtrain_df_aug

Unnamed: 0,text,label
0,analyst,Junior
1,analyste financier,Junior
2,anwendungstechnischer mitarbeiter,Junior
3,application engineer,Senior
4,applications engineer,Senior
...,...,...
11309,Juristischer Berater,Consulting
11310,"Leitung Personal, Finanzen, Einkauf, IT | Folk...",Human Resources
11311,Verwaltungsleitung Landesspracheninstitut in d...,Administrative
11312,"Leitung Gebäudemanagement, Einkauf und Control...",Purchasing


In [203]:
#Department Train/Test Split mit synthetic Daten
from sklearn.model_selection import train_test_split

dsx = dtrain_df_aug["text"]
dsy = dtrain_df_aug["label"]

dsx_train, dsx_test, dsy_train, dsy_test = train_test_split(
    dsx,
    dsy,
    test_size=0.2,
    random_state=42,
    stratify=dsy
)

# Print dataset sizes
print("Department dataset sizes:")
print("Total:", len(dsx))
print("Train:", len(dsx_train))
print("Test:", len(dsx_test))


Department dataset sizes:
Total: 11314
Train: 9051
Test: 2263


In [204]:
# Department TF–IDF + Logistic Regression Pipeline
dmodel_syn = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),
        min_df=3,
        max_df=0.9
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])

# Train the Department model on synthetic training data
dmodel_syn.fit(dsx_train, dsy_train)


In [205]:
# Department Modell trainieren mit synthetic Daten
dmodel_syn.fit(dsx_train, dsy_train)

# Vorhersagen auf Testdaten
dsy_pred = dmodel_syn.predict(dsx_test)

# Accuracy ausgeben
from sklearn.metrics import accuracy_score
print("Department Accuracy:", accuracy_score(dsy_test, dsy_pred))

Department Accuracy: 0.7277949624392399


In [206]:
# Department Evaluation mit synthetic Daten
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Vorhersagen auf Testdaten
dsy_pred = dmodel_syn.predict(dsx_test)

# Evaluation metrics
dsaccuracy = accuracy_score(dsy_test, dsy_pred)
dsmacro_f1 = f1_score(dsy_test, dsy_pred, average="macro")

print("Department Accuracy:", round(dsaccuracy, 3))
print("Department Macro F1:", round(dsmacro_f1, 3))
print("\nClassification Report (Department):\n")
print(classification_report(dsy_test, dsy_pred))

Department Accuracy: 0.728
Department Macro F1: 0.533

Classification Report (Department):

                        precision    recall  f1-score   support

        Administrative       0.27      0.55      0.36        20
  Business Development       0.19      0.54      0.28        28
            Consulting       0.29      0.68      0.41        25
      Customer Support       0.33      0.88      0.48         8
              Director       0.88      0.88      0.88       197
       Human Resources       0.32      0.62      0.43        16
Information Technology       0.38      0.74      0.51        34
                Junior       0.84      0.93      0.88        82
                  Lead       0.94      0.85      0.89       709
            Management       0.76      0.76      0.76       151
             Marketing       0.12      0.50      0.19        20
                 Other       0.55      0.56      0.56       156
    Project Management       0.26      0.62      0.37        24
           

In [207]:
# Department Evaluation on Annotated ACTIVE Jobs mit synthetic Daten

# Prepare evaluation data
ds_eval_df = jobs_annotated_active_df.dropna(subset=["position", "department"]).copy()
ds_eval_text = ds_eval_df["position"].astype(str).str.lower()
ds_eval_labels = ds_eval_df["department"].astype(str)

# Predict department
ds_eval_pred = dmodel_syn.predict(ds_eval_text)

# Evaluation metrics

ds_eval_accuracy = accuracy_score(ds_eval_labels, ds_eval_pred)
ds_eval_macro_f1 = f1_score(ds_eval_labels, ds_eval_pred, average="macro")

print("Department Evaluation on ACTIVE Jobs")
print("Accuracy:", round(ds_eval_accuracy, 3))
print("Macro F1:", round(ds_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(ds_eval_labels, ds_eval_pred))

Department Evaluation on ACTIVE Jobs
Accuracy: 0.43
Macro F1: 0.326

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.24      0.29      0.26        14
  Business Development       0.23      0.65      0.34        20
            Consulting       0.63      0.49      0.55        39
      Customer Support       0.50      0.83      0.62         6
              Director       0.00      0.00      0.00         0
       Human Resources       0.48      0.62      0.54        16
Information Technology       0.73      0.39      0.51        62
                Junior       0.00      0.00      0.00         0
                  Lead       0.00      0.00      0.00         0
            Management       0.00      0.00      0.00         0
             Marketing       0.44      0.18      0.26        22
                 Other       0.76      0.43      0.55       344
    Project Management       0.82      0.46      0.59        39
          

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [208]:
dsynthetic_results = pd.DataFrame({
    "Dataset": [
        "Label Data (with Synthetic)",
        "ACTIVE Jobs (with Synthetic)"
    ],
    "Accuracy": [
        dsaccuracy,
        ds_eval_accuracy
    ],
    "Macro F1": [
        dsmacro_f1,
        ds_eval_macro_f1
    ]
})

print("Seniority Model – Performance with Synthetic Data\n")
print(dsynthetic_results)


Seniority Model – Performance with Synthetic Data

                        Dataset  Accuracy  Macro F1
0   Label Data (with Synthetic)  0.727795  0.533366
1  ACTIVE Jobs (with Synthetic)  0.430177  0.326088


In [212]:
# Comparison: Baseline vs Synthetic – Training & ACTIVE Job Evaluation for Department

comparison_metrics_department = pd.DataFrame({
    "Target": [
        "Department (Label Data – no Synthetic)",
        "Department (Label Data – with Synthetic)",
        "Department (ACTIVE Jobs – no Synthetic)",
        "Department (ACTIVE Jobs – with Synthetic)",
    ],
    "Accuracy": [
        daccuracy,        # Department baseline – label data
        dsaccuracy,       # Department synthetic – label data
        d_eval_accuracy, # Department baseline – ACTIVE jobs
        ds_eval_accuracy, # Department synthetic – ACTIVE jobs
    ],
    "Macro F1": [
        dmacro_f1,
        dsmacro_f1,
        d_eval_macro_f1,
        ds_eval_macro_f1,
    ]
})

print("Department Model Comparison:\n")
print(comparison_metrics_department)


Department Model Comparison:

                                      Target  Accuracy  Macro F1
0     Department (Label Data – no Synthetic)  0.934450  0.860444
1   Department (Label Data – with Synthetic)  0.727795  0.533366
2    Department (ACTIVE Jobs – no Synthetic)  0.223114  0.338219
3  Department (ACTIVE Jobs – with Synthetic)  0.430177  0.326088


In [210]:
synthetic mit oversampling

SyntaxError: invalid syntax (ipython-input-727686667.py, line 1)