<a href="https://colab.research.google.com/github/luisadosch/Final-Project-snapAddy/blob/main/model2_embedding_based.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Github-Zugangsdaten

In [46]:
# GitHub-Zugangsdaten
import pandas as pd

GH_USER = "luisadosch"
GH_REPO = "Final-Project-snapAddy"
BRANCH = "main"

def get_github_url(relative_path):
    return f"https://raw.githubusercontent.com/{GH_USER}/{GH_REPO}/{BRANCH}/{relative_path}"


jobs_annotated_active_df = pd.read_csv(get_github_url("data/processed/jobs_annotated_active.csv"))

department_df = pd.read_csv(get_github_url("data/raw/department-v2.csv"))

seniority_df = pd.read_csv(get_github_url("data/raw/seniority-v2.csv"))

# 2. Modell Seniority

In [47]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
import numpy as np

# --- Labels & Beschreibungen aus seniority_df ---
slabel_names = seniority_df["label"].astype(str).tolist()
slabel_texts = seniority_df["text"].astype(str).tolist()

# --- Evaluation Labels (ACTIVE Jobs) ---
strue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()

# --- Embedding-Modell laden ---
sembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Label-Beschreibungen ---
X = sembed_model.encode(slabel_texts, convert_to_tensor=True)

# --- Zentroid je Seniority-Label ---
by_label = defaultdict(list)
for emb, lab in zip(X, slabel_names):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = sembed_model.encode(
    jobs_annotated_active_df["position"].astype(str).tolist(),
    convert_to_tensor=True
)

# --- Vorhersagen ---
spred_seniority = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    spred_seniority.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
s_eval_accuracy = accuracy_score(strue_seniority, spred_seniority)
s_eval_macro_f1 = f1_score(strue_seniority, spred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs")
print("Accuracy:", round(s_eval_accuracy, 3))
print("Macro F1:", round(s_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(strue_seniority, spred_seniority))

Embedding-based Seniority Prediction on ACTIVE Jobs
Accuracy: 0.409
Macro F1: 0.35

Classification Report:

              precision    recall  f1-score   support

    Director       0.41      0.82      0.55        34
      Junior       0.07      0.33      0.12        12
        Lead       0.39      0.43      0.41       125
  Management       0.73      0.70      0.71       192
Professional       0.00      0.00      0.00       216
      Senior       0.19      0.77      0.31        44

    accuracy                           0.41       623
   macro avg       0.30      0.51      0.35       623
weighted avg       0.34      0.41      0.36       623



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


This block performs zero-shot classification of job seniority using embeddings. Each job title is converted into a dense vector using a pre-trained embedding model, and each possible seniority label is also mapped into the same embedding space. Cosine similarity is then computed between the job title embeddings and the label embeddings, and the label with the highest similarity is assigned to the job. This method does not require any training, making it a simple and interpretable embedding-based approach.

Applying this embedding-based approach to predict seniority on ACTIVE jobs results in an accuracy of 0.409 and a macro F1 score of 0.35. The detailed classification report is as follows:

| Label            | Precision | Recall | F1-score | Support |
| ---------------- | --------- | ------ | -------- | ------- |
| Director         | 0.41      | 0.82   | 0.55     | 34      |
| Junior           | 0.07      | 0.33   | 0.12     | 12      |
| Lead             | 0.39      | 0.43   | 0.41     | 125     |
| Management       | 0.73      | 0.70   | 0.71     | 192     |
| Professional     | 0.00      | 0.00   | 0.00     | 216     |
| Senior           | 0.19      | 0.77   | 0.31     | 44      |
| **Accuracy**     |           |        | **0.41** | **623** |
| **Macro avg**    | 0.30      | 0.51   | 0.35     | 623     |
| **Weighted avg** | 0.34      | 0.41   | 0.36     | 623     |



These results indicate that while some seniority levels, such as Director and Senior, are predicted with reasonable recall, the overall performance across all classes is limited. The approach captures high-level trends but struggles with frequent and ambiguous classes like Professional, highlighting the challenges of zero-shot classification in a domain with class imbalance and short, noisy job titles.

# 3. Modell Department

In [48]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
import numpy as np

# --- Labels & Beschreibungen aus department_df ---
dlabel_names = department_df["label"].astype(str).tolist()
dlabel_texts = department_df["text"].astype(str).tolist()

# --- Evaluation Labels (ACTIVE Jobs) ---
dtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Label-Beschreibungen ---
X = dembed_model.encode(dlabel_texts, convert_to_tensor=True)

# --- Zentroid je Department-Label ---
by_label = defaultdict(list)
for emb, lab in zip(X, dlabel_names):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels])

# --- Embeddings für ACTIVE Jobs ---
E = dembed_model.encode(
    jobs_annotated_active_df["position"].astype(str).tolist(),
    convert_to_tensor=True)

# --- Vorhersagen ---
dpred_department = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    dpred_department.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
d_eval_accuracy = accuracy_score(dtrue_department, dpred_department)
d_eval_macro_f1 = f1_score(dtrue_department, dpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs")
print("Accuracy:", round(d_eval_accuracy, 3))
print("Macro F1:", round(d_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(dtrue_department, dpred_department))

Embedding-based Department Prediction on ACTIVE Jobs
Accuracy: 0.315
Macro F1: 0.315

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.03      0.21      0.05        14
  Business Development       0.17      0.35      0.23        20
            Consulting       0.28      0.67      0.40        39
      Customer Support       0.29      0.33      0.31         6
       Human Resources       0.31      0.62      0.42        16
Information Technology       0.35      0.31      0.32        62
             Marketing       0.39      0.50      0.44        22
                 Other       0.74      0.22      0.33       344
    Project Management       0.57      0.59      0.58        39
            Purchasing       0.02      0.07      0.03        15
                 Sales       0.29      0.43      0.35        46

              accuracy                           0.31       623
             macro avg       0.31      0.39      0.31   

This block applies the embedding-based zero-shot classification method to predict the department of each job. Each job title and each possible department label is converted into a dense vector using a pre-trained embedding model. Cosine similarity is then computed between the job title embeddings and the label embeddings, and the label with the highest similarity is assigned to each job. This method requires no training, making it a simple, interpretable embedding-based approach based purely on semantic similarity.

Applying this embedding-based approach to predict departments on ACTIVE jobs results in an accuracy of 0.315 and a macro F1 score of 0.315. The detailed classification report is:
| Label                    | Precision | Recall | F1-score | Support |
|--------------------------|-----------|--------|----------|---------|
| Administrative           | 0.03      | 0.21   | 0.05     | 14      |
| Business Development     | 0.17      | 0.35   | 0.23     | 20      |
| Consulting               | 0.28      | 0.67   | 0.40     | 39      |
| Customer Support         | 0.29      | 0.33   | 0.31     | 6       |
| Human Resources          | 0.31      | 0.62   | 0.42     | 16      |
| Information Technology   | 0.35      | 0.31   | 0.32     | 62      |
| Marketing                | 0.39      | 0.50   | 0.44     | 22      |
| Other                    | 0.74      | 0.22   | 0.33     | 344     |
| Project Management       | 0.57      | 0.59   | 0.58     | 39      |
| Purchasing               | 0.02      | 0.07   | 0.03     | 15      |
| Sales                    | 0.29      | 0.43   | 0.35     | 46      |
| **Accuracy**             |           |        | **0.31** | **623** |
| **Macro avg**            | **0.31**  | **0.39** | **0.31** | **623** |
| **Weighted avg**         | **0.55**  | **0.31** | **0.34** | **623** |

These results indicate that while some departments, such as Human Resources and Project Management, are predicted with moderate recall, the overall performance is low, particularly for highly frequent and ambiguous labels like Other. This demonstrates the limitations of zero-shot embedding-based classification in domains with unbalanced class distributions and short, noisy job titles.

# 4. Modell Seniority mit synthetic Daten

In [49]:
ORD_MAP = {
    "Junior": 1.0,
    "Professional": 2.0,
    "Senior": 3.0,
    "Lead": 4.0,
    "Management": 5.0,
    "Director": 6.0,
}
INV_ORD = {v: k for k, v in ORD_MAP.items()}

In [50]:
def add_synthetic(train_df: pd.DataFrame, synthetic_csv_relpath: str) -> pd.DataFrame:
    syn = pd.read_csv(get_github_url(synthetic_csv_relpath))
    syn = syn[["position", "seniority"]].copy()

    id2label = {v: k for k, v in ORD_MAP.items()}
    syn["label"] = syn["seniority"].map(id2label)
    syn = syn.rename(columns={"position": "text"})
    syn = syn.dropna(subset=["text", "label"])

    out = pd.concat([train_df[["text", "label"]], syn[["text", "label"]]], ignore_index=True)
    return out

In [51]:
sdf_aug = add_synthetic(seniority_df, "data/results/gemini_synthetic.csv")
sdf_aug

Unnamed: 0,text,label
0,Analyst,Junior
1,Analyste financier,Junior
2,Anwendungstechnischer Mitarbeiter,Junior
3,Application Engineer,Senior
4,Applications Engineer,Senior
...,...,...
11309,Juristischer Berater,Professional
11310,"Leitung Personal, Finanzen, Einkauf, IT | Folk...",Management
11311,Verwaltungsleitung Landesspracheninstitut in d...,Management
11312,"Leitung Gebäudemanagement, Einkauf und Control...",Management


In [52]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
import numpy as np

# --- Labels & Beschreibungen aus seniority_df ---
slabel_names = sdf_aug["label"].astype(str).tolist()
slabel_texts = sdf_aug["text"].astype(str).tolist()

# --- Evaluation Labels (ACTIVE Jobs) ---
sstrue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()

# --- Embedding-Modell laden ---
sembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Label-Beschreibungen ---
X = sembed_model.encode(slabel_texts, convert_to_tensor=True)

# --- Zentroid je Seniority-Label ---
by_label = defaultdict(list)
for emb, lab in zip(X, slabel_names):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = sembed_model.encode(
    jobs_annotated_active_df["position"].astype(str).tolist(),
    convert_to_tensor=True
)

# --- Vorhersagen ---
sspred_seniority = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    sspred_seniority.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
ss_eval_accuracy = accuracy_score(sstrue_seniority, sspred_seniority)
ss_eval_macro_f1 = f1_score(sstrue_seniority, sspred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs")
print("Accuracy:", round(ss_eval_accuracy, 3))
print("Macro F1:", round(ss_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sstrue_seniority, sspred_seniority))

Embedding-based Seniority Prediction on ACTIVE Jobs
Accuracy: 0.478
Macro F1: 0.409

Classification Report:

              precision    recall  f1-score   support

    Director       0.44      0.82      0.58        34
      Junior       0.03      0.17      0.05        12
        Lead       0.48      0.35      0.41       125
  Management       0.83      0.62      0.71       192
Professional       0.57      0.40      0.47       216
      Senior       0.17      0.41      0.24        44

    accuracy                           0.48       623
   macro avg       0.42      0.46      0.41       623
weighted avg       0.59      0.48      0.51       623



# 5. Modell Department mit synthetic Daten

In [53]:
def add_synthetic_department(train_df: pd.DataFrame, synthetic_csv_relpath: str) -> pd.DataFrame:
    syn = pd.read_csv(get_github_url(synthetic_csv_relpath))

    # expect columns: position, department
    syn = syn[["position", "department"]].copy()
    syn = syn.rename(columns={"position": "text", "department": "label"})
    syn = syn.dropna(subset=["text", "label"])

    out = pd.concat([train_df[["text", "label"]], syn[["text", "label"]]], ignore_index=True)
    return out

In [54]:
ddf_aug = add_synthetic_department(department_df, "data/results/gemini_synthetic.csv")
ddf_aug

Unnamed: 0,text,label
0,Adjoint directeur communication,Marketing
1,Advisor Strategy and Projects,Project Management
2,Beratung & Projekte,Project Management
3,Beratung & Projektmanagement,Project Management
4,Beratung und Projektmanagement kommunale Partner,Project Management
...,...,...
12026,Juristischer Berater,Consulting
12027,"Leitung Personal, Finanzen, Einkauf, IT | Folk...",Human Resources
12028,Verwaltungsleitung Landesspracheninstitut in d...,Administrative
12029,"Leitung Gebäudemanagement, Einkauf und Control...",Purchasing


In [55]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
import numpy as np

# --- Labels & Beschreibungen aus department_df ---
dlabel_names = ddf_aug["label"].astype(str).tolist()
dlabel_texts = ddf_aug["text"].astype(str).tolist()

# --- Evaluation Labels (ACTIVE Jobs) ---
sdtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Label-Beschreibungen ---
X = dembed_model.encode(dlabel_texts, convert_to_tensor=True)

# --- Zentroid je Department-Label ---
by_label = defaultdict(list)
for emb, lab in zip(X, dlabel_names):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels])

# --- Embeddings für ACTIVE Jobs ---
E = dembed_model.encode(
    jobs_annotated_active_df["position"].astype(str).tolist(),
    convert_to_tensor=True)

# --- Vorhersagen ---
sdpred_department = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    sdpred_department.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
sd_eval_accuracy = accuracy_score(sdtrue_department, sdpred_department)
sd_eval_macro_f1 = f1_score(sdtrue_department, sdpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs")
print("Accuracy:", round(sd_eval_accuracy, 3))
print("Macro F1:", round(sd_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sdtrue_department, sdpred_department))

Embedding-based Department Prediction on ACTIVE Jobs
Accuracy: 0.496
Macro F1: 0.44

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.03      0.21      0.05        14
  Business Development       0.16      0.35      0.22        20
            Consulting       0.43      0.51      0.47        39
      Customer Support       0.55      1.00      0.71         6
       Human Resources       0.56      0.62      0.59        16
Information Technology       0.54      0.32      0.40        62
             Marketing       0.48      0.45      0.47        22
                 Other       0.78      0.53      0.64       344
    Project Management       0.71      0.62      0.66        39
            Purchasing       0.07      0.20      0.10        15
                 Sales       0.63      0.48      0.54        46

              accuracy                           0.50       623
             macro avg       0.45      0.48      0.44    

# 6. Modell Seniority mit synthetic Daten und Oversampling

In [56]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
from imblearn.over_sampling import RandomOverSampler
import numpy as np

# --- Labels & Beschreibungen aus sdf_aug (mit synthetischen Daten) ---
slabel_names = sdf_aug["label"].astype(str).tolist()
slabel_texts = sdf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
slabel_texts_res, slabel_names_res = ros.fit_resample(
    np.array(slabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    slabel_names
)
slabel_texts_res = slabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
sostrue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
sembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = sembed_model.encode(slabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Seniority-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, slabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = sembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
sospred_seniority = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    sospred_seniority.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
sos_eval_accuracy = accuracy_score(sostrue_seniority, sospred_seniority)
sos_eval_macro_f1 = f1_score(sostrue_seniority, sospred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs (mit Oversampling)")
print("Accuracy:", round(sos_eval_accuracy, 3))
print("Macro F1:", round(sos_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sostrue_seniority, sospred_seniority))

Embedding-based Seniority Prediction on ACTIVE Jobs (mit Oversampling)
Accuracy: 0.474
Macro F1: 0.405

Classification Report:

              precision    recall  f1-score   support

    Director       0.44      0.82      0.57        34
      Junior       0.03      0.17      0.05        12
        Lead       0.47      0.35      0.40       125
  Management       0.83      0.61      0.70       192
Professional       0.56      0.40      0.47       216
      Senior       0.16      0.39      0.23        44

    accuracy                           0.47       623
   macro avg       0.42      0.46      0.40       623
weighted avg       0.58      0.47      0.51       623



# 7. Modell Department mit synthetic Datem und Oversampling

In [57]:
# --- Labels & Beschreibungen aus ddf_aug (mit synthetischen Daten) ---
dlabel_names = ddf_aug["label"].astype(str).tolist()
dlabel_texts = ddf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
dlabel_texts_res, dlabel_names_res = ros.fit_resample(
    np.array(dlabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    dlabel_names
)
dlabel_texts_res = dlabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
sodtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = dembed_model.encode(dlabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Department-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, dlabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = dembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
sodpred_department = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    sodpred_department.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
sod_eval_accuracy = accuracy_score(sodtrue_department, sodpred_department)
sod_eval_macro_f1 = f1_score(sodtrue_department, sodpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs (mit Oversampling)")
print("Accuracy:", round(sod_eval_accuracy, 3))
print("Macro F1:", round(sod_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sodtrue_department, sodpred_department))

Embedding-based Department Prediction on ACTIVE Jobs (mit Oversampling)
Accuracy: 0.502
Macro F1: 0.446

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.05      0.36      0.09        14
  Business Development       0.17      0.35      0.23        20
            Consulting       0.42      0.51      0.46        39
      Customer Support       0.55      1.00      0.71         6
       Human Resources       0.53      0.62      0.57        16
Information Technology       0.59      0.35      0.44        62
             Marketing       0.48      0.45      0.47        22
                 Other       0.79      0.53      0.64       344
    Project Management       0.71      0.64      0.68        39
            Purchasing       0.07      0.20      0.10        15
                 Sales       0.66      0.46      0.54        46

              accuracy                           0.50       623
             macro avg       0.45    

# 8. Modell Seniority mit synthetic Daten und Oversampling und Modell "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"

In [58]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
from imblearn.over_sampling import RandomOverSampler
import numpy as np

# --- Labels & Beschreibungen aus sdf_aug (mit synthetischen Daten) ---
slabel_names = sdf_aug["label"].astype(str).tolist()
slabel_texts = sdf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
slabel_texts_res, slabel_names_res = ros.fit_resample(
    np.array(slabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    slabel_names
)
slabel_texts_res = slabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m1sostrue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
sembed_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = sembed_model.encode(slabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Seniority-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, slabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = sembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m1sospred_seniority = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m1sospred_seniority.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m1sos_eval_accuracy = accuracy_score(m1sostrue_seniority, m1sospred_seniority)
m1sos_eval_macro_f1 = f1_score(m1sostrue_seniority, m1sospred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2')")
print("Accuracy:", round(m1sos_eval_accuracy, 3))
print("Macro F1:", round(m1sos_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m1sostrue_seniority, m1sospred_seniority))

Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
Accuracy: 0.528
Macro F1: 0.453

Classification Report:

              precision    recall  f1-score   support

    Director       0.55      0.85      0.67        34
      Junior       0.12      0.75      0.20        12
        Lead       0.57      0.25      0.35       125
  Management       0.79      0.78      0.78       192
Professional       0.57      0.45      0.50       216
      Senior       0.16      0.30      0.21        44

    accuracy                           0.53       623
   macro avg       0.46      0.56      0.45       623
weighted avg       0.60      0.53      0.54       623



# 9. Modell Department mit synthetic Daten und Oversampling und Modell "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"

In [59]:
# --- Labels & Beschreibungen aus ddf_aug (mit synthetischen Daten) ---
dlabel_names = ddf_aug["label"].astype(str).tolist()
dlabel_texts = ddf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
dlabel_texts_res, dlabel_names_res = ros.fit_resample(
    np.array(dlabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    dlabel_names
)
dlabel_texts_res = dlabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m1sodtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = dembed_model.encode(dlabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Department-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, dlabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = dembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m1sodpred_department = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m1sodpred_department.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m1sod_eval_accuracy = accuracy_score(m1sodtrue_department, m1sodpred_department)
m1sod_eval_macro_f1 = f1_score(m1sodtrue_department, m1sodpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2')")
print("Accuracy:", round(m1sod_eval_accuracy, 3))
print("Macro F1:", round(m1sod_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m1sodtrue_department, m1sodpred_department))

Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
Accuracy: 0.623
Macro F1: 0.512

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.21      0.57      0.30        14
  Business Development       0.22      0.55      0.31        20
            Consulting       0.65      0.51      0.57        39
      Customer Support       0.43      1.00      0.60         6
       Human Resources       0.50      0.69      0.58        16
Information Technology       0.56      0.35      0.44        62
             Marketing       0.67      0.36      0.47        22
                 Other       0.85      0.70      0.76       344
    Project Management       0.42      0.69      0.52        39
            Purchasing       0.38      0.73      0.50        15
                 Sales       0.65      0.52      0.58        46

              accuracy                       

# 10. Modell Seniority mit synthetic Daten und Oversampling und Modell "sentence-transformers/distiluse-base-multilingual-cased-v2"

In [60]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
from imblearn.over_sampling import RandomOverSampler
import numpy as np

# --- Labels & Beschreibungen aus sdf_aug (mit synthetischen Daten) ---
slabel_names = sdf_aug["label"].astype(str).tolist()
slabel_texts = sdf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
slabel_texts_res, slabel_names_res = ros.fit_resample(
    np.array(slabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    slabel_names
)
slabel_texts_res = slabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m2sostrue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
sembed_model = SentenceTransformer("sentence-transformers/distiluse-base-multilingual-cased-v2")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = sembed_model.encode(slabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Seniority-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, slabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = sembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m2sospred_seniority = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m2sospred_seniority.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m2sos_eval_accuracy = accuracy_score(m2sostrue_seniority, m2sospred_seniority)
m2sos_eval_macro_f1 = f1_score(m2sostrue_seniority, m2sospred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/distiluse-base-multilingual-cased-v2')")
print("Accuracy:", round(m2sos_eval_accuracy, 3))
print("Macro F1:", round(m2sos_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m2sostrue_seniority, m2sospred_seniority))

Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/distiluse-base-multilingual-cased-v2')
Accuracy: 0.502
Macro F1: 0.398

Classification Report:

              precision    recall  f1-score   support

    Director       0.43      0.74      0.54        34
      Junior       0.06      0.25      0.10        12
        Lead       0.78      0.26      0.39       125
  Management       0.64      0.68      0.66       192
Professional       0.59      0.53      0.56       216
      Senior       0.12      0.20      0.15        44

    accuracy                           0.50       623
   macro avg       0.44      0.44      0.40       623
weighted avg       0.59      0.50      0.51       623



# 11. Modell Department mit synthetic Daten und Oversampling und Modell "sentence-transformers/distiluse-base-multilingual-cased-v2"

In [61]:
# --- Labels & Beschreibungen aus ddf_aug (mit synthetischen Daten) ---
dlabel_names = ddf_aug["label"].astype(str).tolist()
dlabel_texts = ddf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
dlabel_texts_res, dlabel_names_res = ros.fit_resample(
    np.array(dlabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    dlabel_names
)
dlabel_texts_res = dlabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m2sodtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("sentence-transformers/distiluse-base-multilingual-cased-v2")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = dembed_model.encode(dlabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Department-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, dlabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = dembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m2sodpred_department = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m2sodpred_department.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m2sod_eval_accuracy = accuracy_score(m2sodtrue_department, m2sodpred_department)
m2sod_eval_macro_f1 = f1_score(m2sodtrue_department, m2sodpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/distiluse-base-multilingual-cased-v2')")
print("Accuracy:", round(m2sod_eval_accuracy, 3))
print("Macro F1:", round(m2sod_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m2sodtrue_department, m2sodpred_department))

Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'sentence-transformers/distiluse-base-multilingual-cased-v2')
Accuracy: 0.661
Macro F1: 0.546

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.21      0.21      0.21        14
  Business Development       0.20      0.40      0.27        20
            Consulting       0.64      0.72      0.67        39
      Customer Support       0.57      0.67      0.62         6
       Human Resources       0.41      0.69      0.51        16
Information Technology       0.62      0.61      0.62        62
             Marketing       0.50      0.36      0.42        22
                 Other       0.77      0.73      0.75       344
    Project Management       0.64      0.77      0.70        39
            Purchasing       0.67      0.67      0.67        15
                 Sales       0.75      0.46      0.57        46

              accuracy                        

# 12. Modell Seniority mit synthetic Daten und Oversampling und Modell  "intfloat/multilingual-e5-base"

In [62]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
from imblearn.over_sampling import RandomOverSampler
import numpy as np

# --- Labels & Beschreibungen aus sdf_aug (mit synthetischen Daten) ---
slabel_names = sdf_aug["label"].astype(str).tolist()
slabel_texts = sdf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
slabel_texts_res, slabel_names_res = ros.fit_resample(
    np.array(slabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    slabel_names
)
slabel_texts_res = slabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m3sostrue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
sembed_model = SentenceTransformer("intfloat/multilingual-e5-base")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = sembed_model.encode(slabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Seniority-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, slabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = sembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m3sospred_seniority = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m3sospred_seniority.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m3sos_eval_accuracy = accuracy_score(m3sostrue_seniority, m3sospred_seniority)
m3sos_eval_macro_f1 = f1_score(m3sostrue_seniority, m3sospred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'intfloat/multilingual-e5-base')")
print("Accuracy:", round(m3sos_eval_accuracy, 3))
print("Macro F1:", round(m3sos_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m3sostrue_seniority, m3sospred_seniority))

Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'intfloat/multilingual-e5-base')
Accuracy: 0.576
Macro F1: 0.483

Classification Report:

              precision    recall  f1-score   support

    Director       0.52      0.82      0.64        34
      Junior       0.12      0.58      0.20        12
        Lead       0.78      0.32      0.45       125
  Management       0.75      0.77      0.76       192
Professional       0.68      0.56      0.61       216
      Senior       0.18      0.34      0.24        44

    accuracy                           0.58       623
   macro avg       0.50      0.57      0.48       623
weighted avg       0.67      0.58      0.59       623



# 13. Modell Department mit synthetic Daten und Oversampling und Modell  "intfloat/multilingual-e5-base"

In [63]:
# --- Labels & Beschreibungen aus ddf_aug (mit synthetischen Daten) ---
dlabel_names = ddf_aug["label"].astype(str).tolist()
dlabel_texts = ddf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
dlabel_texts_res, dlabel_names_res = ros.fit_resample(
    np.array(dlabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    dlabel_names
)
dlabel_texts_res = dlabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m3sodtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("intfloat/multilingual-e5-base")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = dembed_model.encode(dlabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Department-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, dlabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = dembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m3sodpred_department = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m3sodpred_department.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m3sod_eval_accuracy = accuracy_score(m3sodtrue_department, m3sodpred_department)
m3sod_eval_macro_f1 = f1_score(m3sodtrue_department, m3sodpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'intfloat/multilingual-e5-base')")
print("Accuracy:", round(m3sod_eval_accuracy, 3))
print("Macro F1:", round(m3sod_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m3sodtrue_department, m3sodpred_department))

Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'intfloat/multilingual-e5-base')
Accuracy: 0.698
Macro F1: 0.601

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.13      0.43      0.20        14
  Business Development       0.25      0.45      0.32        20
            Consulting       0.69      0.62      0.65        39
      Customer Support       0.86      1.00      0.92         6
       Human Resources       0.62      0.62      0.62        16
Information Technology       0.64      0.52      0.57        62
             Marketing       0.64      0.41      0.50        22
                 Other       0.84      0.79      0.82       344
    Project Management       0.73      0.77      0.75        39
            Purchasing       0.44      0.80      0.57        15
                 Sales       0.93      0.54      0.68        46

              accuracy                           0.70       623
           

# 14. Modell Seniority mit synthetic Daten und Oversampling und Modell "BAAI/bge-m3"

In [64]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
from collections import defaultdict
from imblearn.over_sampling import RandomOverSampler
import numpy as np

# --- Labels & Beschreibungen aus sdf_aug (mit synthetischen Daten) ---
slabel_names = sdf_aug["label"].astype(str).tolist()
slabel_texts = sdf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
slabel_texts_res, slabel_names_res = ros.fit_resample(
    np.array(slabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    slabel_names
)
slabel_texts_res = slabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m4sostrue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
sembed_model = SentenceTransformer("BAAI/bge-m3")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = sembed_model.encode(slabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Seniority-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, slabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = sembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m4sospred_seniority = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m4sospred_seniority.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m4sos_eval_accuracy = accuracy_score(m4sostrue_seniority, m4sospred_seniority)
m4sos_eval_macro_f1 = f1_score(m4sostrue_seniority, m4sospred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'BAAI/bge-m3')")
print("Accuracy:", round(m4sos_eval_accuracy, 3))
print("Macro F1:", round(m4sos_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m4sostrue_seniority, m4sospred_seniority))

Embedding-based Seniority Prediction on ACTIVE Jobs (mit Modell 'BAAI/bge-m3')
Accuracy: 0.6
Macro F1: 0.527

Classification Report:

              precision    recall  f1-score   support

    Director       0.62      0.94      0.74        34
      Junior       0.13      0.75      0.23        12
        Lead       0.93      0.43      0.59       125
  Management       0.83      0.81      0.82       192
Professional       0.61      0.50      0.55       216
      Senior       0.17      0.32      0.23        44

    accuracy                           0.60       623
   macro avg       0.55      0.63      0.53       623
weighted avg       0.70      0.60      0.62       623



# 15. Modell Department mit synthetic Daten und Oversampling und Modell "BAAI/bge-m3"

In [65]:
# --- Labels & Beschreibungen aus ddf_aug (mit synthetischen Daten) ---
dlabel_names = ddf_aug["label"].astype(str).tolist()
dlabel_texts = ddf_aug["text"].astype(str).tolist()

# --- Oversampling: Labels ausgleichen ---
ros = RandomOverSampler(random_state=42)
dlabel_texts_res, dlabel_names_res = ros.fit_resample(
    np.array(dlabel_texts).reshape(-1,1),  # reshaped für RandomOverSampler
    dlabel_names
)
dlabel_texts_res = dlabel_texts_res.flatten()

# --- Evaluation Labels (ACTIVE Jobs) ---
m4sodtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()
eval_texts = jobs_annotated_active_df["position"].astype(str).tolist()

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("BAAI/bge-m3")

# --- Embeddings für Label-Beschreibungen (nach Oversampling) ---
X = dembed_model.encode(dlabel_texts_res, convert_to_tensor=True)

# --- Zentroid je Department-Label berechnen ---
by_label = defaultdict(list)
for emb, lab in zip(X, dlabel_names_res):
    by_label[lab].append(emb.cpu().numpy())

proto_labels = list(by_label.keys())
proto_embs = np.vstack([
    np.mean(by_label[label], axis=0)
    for label in proto_labels
])

# --- Embeddings für ACTIVE Jobs ---
E = dembed_model.encode(eval_texts, convert_to_tensor=True)

# --- Vorhersagen ---
m4sodpred_department = []
for e in E:
    sims = cosine_similarity(e.cpu().numpy().reshape(1, -1), proto_embs)[0]
    m4sodpred_department.append(proto_labels[int(np.argmax(sims))])

# --- Evaluation ---
m4sod_eval_accuracy = accuracy_score(m4sodtrue_department, m4sodpred_department)
m4sod_eval_macro_f1 = f1_score(m4sodtrue_department, m4sodpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'BAAI/bge-m3')")
print("Accuracy:", round(m4sod_eval_accuracy, 3))
print("Macro F1:", round(m4sod_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(m4sodtrue_department, m4sodpred_department))

Embedding-based Department Prediction on ACTIVE Jobs (mit Modell 'BAAI/bge-m3')
Accuracy: 0.695
Macro F1: 0.545

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.17      0.29      0.22        14
  Business Development       0.30      0.45      0.36        20
            Consulting       0.65      0.67      0.66        39
      Customer Support       0.33      1.00      0.50         6
       Human Resources       0.44      0.69      0.54        16
Information Technology       0.65      0.42      0.51        62
             Marketing       0.50      0.36      0.42        22
                 Other       0.83      0.81      0.82       344
    Project Management       0.64      0.69      0.67        39
            Purchasing       0.52      0.80      0.63        15
                 Sales       0.89      0.54      0.68        46

              accuracy                           0.70       623
             macro avg       

In [67]:
import pandas as pd

# --- Full comparison: Seniority & Department (Baseline vs Synthetic vs Synthetic + Oversampling) ---
full_comparison = pd.DataFrame({
    "Target": [
        "Seniority (ACTIVE Jobs – Baseline)",
        "Seniority (ACTIVE Jobs – with Synthetic)",
        "Seniority (ACTIVE Jobs – Synthetic + Oversampling)",
        "Seniority (ACTIVE Jobs – Synthetic + Oversampling mit Modell 1)",
        "Seniority (ACTIVE Jobs – Synthetic + Oversampling mit Modell 2)",
        "Seniority (ACTIVE Jobs – Synthetic + Oversampling mit Modell 3)",
        "Seniority (ACTIVE Jobs – Syn. + Overs. mit Modell 4)",
        "Department (ACTIVE Jobs – Baseline)",
        "Department (ACTIVE Jobs – with Synthetic)",
        "Department (ACTIVE Jobs – Synthetic + Oversampling)",
        "Department (ACTIVE Jobs – Synthetic + Oversampling mit Modell 1)",
        "Department (ACTIVE Jobs – Synthetic + Oversampling mit Modell 2)",
        "Department (ACTIVE Jobs – Synthetic + Oversampling mit Modell 3)",
        "Department (ACTIVE Jobs – Syn. + Overs. mit Modell 4)",
    ],
    "Accuracy": [
        s_eval_accuracy,      # Seniority baseline
        ss_eval_accuracy,     # Seniority + synthetic
        sos_eval_accuracy,    # Seniority + synthetic + oversampling
        m1sos_eval_accuracy,  # Seniority + synthetic + oversampling mit Modell 1
        m2sos_eval_accuracy,  # Seniority + synthetic + oversampling mit Modell 2
        m3sos_eval_accuracy,  # Seniority + synthetic + oversampling mit Modell 3
        m4sos_eval_accuracy,  # Seniority + synthetic + oversampling mit Modell 4
        d_eval_accuracy,      # Department baseline
        sd_eval_accuracy,     # Department + synthetic
        sod_eval_accuracy,     # Department + synthetic + oversampling
        m1sod_eval_accuracy,     # Department + synthetic + oversampling mit Modell 1
        m2sod_eval_accuracy,     # Department + synthetic + oversampling mit Modell 2
        m3sod_eval_accuracy,     # Department + synthetic + oversampling mit Modell 3
        m4sod_eval_accuracy,     # Department + synthetic + oversampling mit Modell 4
    ],
    "Macro F1": [
        s_eval_macro_f1,      # Seniority baseline
        ss_eval_macro_f1,     # Seniority + synthetic
        sos_eval_macro_f1,    # Seniority + synthetic + oversampling
        m1sos_eval_macro_f1,  # Seniority + synthetic + oversampling mit Modell 1
        m2sos_eval_macro_f1,  # Seniority + synthetic + oversampling mit Modell 2
        m3sos_eval_macro_f1,  # Seniority + synthetic + oversampling mit Modell 3
        m4sos_eval_macro_f1,  # Seniority + synthetic + oversampling mit Modell 4
        d_eval_macro_f1,      # Department baseline
        sd_eval_macro_f1,     # Department + synthetic
        sod_eval_macro_f1,     # Department + synthetic + oversampling
        m1sod_eval_macro_f1,     # Department + synthetic + oversampling mit Modell 1
        m2sod_eval_macro_f1,     # Department + synthetic + oversampling mit Modell 2
        m3sod_eval_macro_f1,     # Department + synthetic + oversampling mit Modell 3
        m4sod_eval_macro_f1,     # Department + synthetic + oversampling mit Modell 4
    ]
})

print("\nFull Model Comparison: Seniority & Department (Baseline vs mit Synthetic vs mit Oversampling vs mit Modell 1 vs mit Modell 2 vs mit Modell 3 vs mit Modell 4)\n")
print(full_comparison)



Full Model Comparison: Seniority & Department (Baseline vs mit Synthetic vs mit Oversampling vs mit Modell 1 vs mit Modell 2 vs mit Modell 3 vs mit Modell 4)

                                               Target  Accuracy  Macro F1
0                  Seniority (ACTIVE Jobs – Baseline)  0.409310  0.350376
1            Seniority (ACTIVE Jobs – with Synthetic)  0.478331  0.409459
2   Seniority (ACTIVE Jobs – Synthetic + Oversampl...  0.473515  0.404537
3   Seniority (ACTIVE Jobs – Synthetic + Oversampl...  0.528090  0.452838
4   Seniority (ACTIVE Jobs – Synthetic + Oversampl...  0.502408  0.398157
5   Seniority (ACTIVE Jobs – Synthetic + Oversampl...  0.576244  0.483127
6   Seniority (ACTIVE Jobs – Syn. + Overs. mit Mod...  0.600321  0.527059
7                 Department (ACTIVE Jobs – Baseline)  0.314607  0.314955
8           Department (ACTIVE Jobs – with Synthetic)  0.495987  0.440235
9   Department (ACTIVE Jobs – Synthetic + Oversamp...  0.502408  0.446475
10  Department (ACTIVE Job