<a href="https://colab.research.google.com/github/luisadosch/Final-Project-snapAddy/blob/main/model2_embedding_based.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Github-Zugangsdaten

In [21]:
# GitHub-Zugangsdaten
import pandas as pd

GH_USER = "luisadosch"
GH_REPO = "Final-Project-snapAddy"
BRANCH = "main"

def get_github_url(relative_path):
    return f"https://raw.githubusercontent.com/{GH_USER}/{GH_REPO}/{BRANCH}/{relative_path}"


jobs_annotated_active_df = pd.read_csv(get_github_url("data/processed/jobs_annotated_active.csv"))

department_df = pd.read_csv(get_github_url("data/raw/department-v2.csv"))

seniority_df = pd.read_csv(get_github_url("data/raw/seniority-v2.csv"))

# 2. Modell Seniority

In [22]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np

# MODEL KNOWLEDGE (ONLY seniority_df)
stexts = seniority_df["text"].astype(str).tolist()
slabels = seniority_df["label"].astype(str).tolist()

# EVALUATION DATA (ONLY ACTIVE jobs)
seval_texts = jobs_annotated_active_df["position"].astype(str).tolist()
strue_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()

# Embedding model
sembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Embeddings
smodel_embeddings = sembed_model.encode(stexts,convert_to_tensor=True)

seval_embeddings = sembed_model.encode(seval_texts,convert_to_tensor=True)

# Prediction: Job → Seniority text
spred_seniority = []

for emb in seval_embeddings:
    ssims = cosine_similarity(emb.reshape(1, -1),smodel_embeddings)[0]
    best_idx = np.argmax(ssims)
    spred_seniority.append(slabels[best_idx])

# Evaluation
s_eval_accuracy = accuracy_score(strue_seniority, spred_seniority)
s_eval_macro_f1 = f1_score(strue_seniority, spred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs")
print("Accuracy:", round(s_eval_accuracy, 3))
print("Macro F1:", round(s_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(strue_seniority, spred_seniority))


Embedding-based Seniority Prediction on ACTIVE Jobs
Accuracy: 0.43
Macro F1: 0.392

Classification Report:

              precision    recall  f1-score   support

    Director       0.42      0.82      0.55        34
      Junior       0.19      0.42      0.26        12
        Lead       0.45      0.57      0.50       125
  Management       0.87      0.65      0.75       192
Professional       0.00      0.00      0.00       216
      Senior       0.17      0.89      0.28        44

    accuracy                           0.43       623
   macro avg       0.35      0.56      0.39       623
weighted avg       0.40      0.43      0.39       623



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


This block performs zero-shot classification of job seniority using embeddings. Each job title is converted into a dense vector using a pre-trained embedding model, and each possible seniority label is also mapped into the same embedding space. Cosine similarity is then computed between the job title embeddings and the label embeddings, and the label with the highest similarity is assigned to the job. This method does not require any training, making it a simple and interpretable embedding-based approach.

Applying this embedding-based approach to predict seniority on ACTIVE jobs results in an accuracy of 0.43 and a macro F1 score of 0.392. The detailed classification report is as follows:

| Label            | Precision | Recall | F1-score | Support |
| ---------------- | --------- | ------ | -------- | ------- |
| Director         | 0.42      | 0.82   | 0.55     | 34      |
| Junior           | 0.19      | 0.42   | 0.26     | 12      |
| Lead             | 0.45      | 0.57   | 0.50     | 125     |
| Management       | 0.87      | 0.65   | 0.75     | 192     |
| Professional     | 0.00      | 0.00   | 0.00     | 216     |
| Senior           | 0.17      | 0.89   | 0.28     | 44      |
| **Accuracy**     |           |        | 0.43     | 623     |
| **Macro avg**    | 0.35      | 0.56   | 0.39     | 623     |
| **Weighted avg** | 0.40      | 0.43   | 0.39     | 623     |


These results indicate that while some seniority levels, such as Director and Senior, are predicted with reasonable recall, the overall performance across all classes is limited. The approach captures high-level trends but struggles with frequent and ambiguous classes like Professional, highlighting the challenges of zero-shot classification in a domain with class imbalance and short, noisy job titles.

# 3. Modell Department

In [23]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np

# --- Labels & Beschreibungen aus department_df ---
dlabels = department_df["label"].astype(str).tolist()
dtexts = department_df["text"].astype(str).tolist()  # nur für Zero-Shot Embeddings

# --- Evaluation: ACTIVE Jobs ---
dtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()
deval_texts = jobs_annotated_active_df["position"].astype(str).tolist()  # Jobtitel für Vorhersage

# --- Embedding-Modell laden ---
dembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Labels ---
dlabel_embeddings = dembed_model.encode(dlabels, convert_to_tensor=True)

# --- Embeddings für ACTIVE Jobs ---
deval_embeddings = dembed_model.encode(deval_texts, convert_to_tensor=True)

# --- Zero-Shot Vorhersagen auf ACTIVE Jobs ---
dpred_department = []
for emb in deval_embeddings:
    dsims = cosine_similarity(emb.reshape(1, -1), dlabel_embeddings)[0]
    dpred_department.append(dlabels[np.argmax(dsims)])

# --- Evaluation ---
d_eval_accuracy = accuracy_score(dtrue_department, dpred_department)
d_eval_macro_f1 = f1_score(dtrue_department, dpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs")
print("Accuracy:", round(d_eval_accuracy, 3))
print("Macro F1:", round(d_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(dtrue_department, dpred_department))

Embedding-based Department Prediction on ACTIVE Jobs
Accuracy: 0.212
Macro F1: 0.306

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.03      0.57      0.06        14
  Business Development       0.17      0.35      0.23        20
            Consulting       0.29      0.54      0.38        39
      Customer Support       0.14      0.67      0.24         6
       Human Resources       0.35      0.75      0.48        16
Information Technology       0.44      0.24      0.31        62
             Marketing       0.39      0.41      0.40        22
                 Other       0.75      0.03      0.05       344
    Project Management       0.34      0.59      0.43        39
            Purchasing       0.30      0.53      0.38        15
                 Sales       0.50      0.35      0.41        46

              accuracy                           0.21       623
             macro avg       0.34      0.46      0.31   

This block applies the embedding-based zero-shot classification method to predict the department of each job. Each job title and each possible department label is converted into a dense vector using a pre-trained embedding model. Cosine similarity is then computed between the job title embeddings and the label embeddings, and the label with the highest similarity is assigned to each job. This method requires no training, making it a simple, interpretable embedding-based approach based purely on semantic similarity.

Applying this embedding-based approach to predict departments on ACTIVE jobs results in an accuracy of 0.212 and a macro F1 score of 0.306. The detailed classification report is:
| Label                  | Precision | Recall | F1-score | Support |
| ---------------------- | --------- | ------ | -------- | ------- |
| Administrative         | 0.03      | 0.57   | 0.06     | 14      |
| Business Development   | 0.17      | 0.35   | 0.23     | 20      |
| Consulting             | 0.29      | 0.54   | 0.38     | 39      |
| Customer Support       | 0.14      | 0.67   | 0.24     | 6       |
| Human Resources        | 0.35      | 0.75   | 0.48     | 16      |
| Information Technology | 0.44      | 0.24   | 0.31     | 62      |
| Marketing              | 0.39      | 0.41   | 0.40     | 22      |
| Other                  | 0.75      | 0.03   | 0.05     | 344     |
| Project Management     | 0.34      | 0.59   | 0.43     | 39      |
| Purchasing             | 0.30      | 0.53   | 0.38     | 15      |
| Sales                  | 0.50      | 0.35   | 0.41     | 46      |
| **Accuracy**           |           |        | 0.21     | 623     |
| **Macro avg**          | 0.34      | 0.46   | 0.31     | 623     |
| **Weighted avg**       | 0.57      | 0.21   | 0.19     | 623     |

These results indicate that while some departments, such as Human Resources and Project Management, are predicted with moderate recall, the overall performance is low, particularly for highly frequent and ambiguous labels like Other. This demonstrates the limitations of zero-shot embedding-based classification in domains with unbalanced class distributions and short, noisy job titles.

In [24]:
# --- Compare embedding-based results for ACTIVE jobs ---
comparison_metrics_active = pd.DataFrame({
    "Target": ["Seniority (ACTIVE Jobs)", "Department (ACTIVE Jobs)"],
    "Accuracy": [s_eval_accuracy, d_eval_accuracy],
    "Macro F1": [s_eval_macro_f1, d_eval_macro_f1]
})

print("Embedding-based Model Results (ACTIVE Jobs):\n")
print(comparison_metrics_active)

Embedding-based Model Results (ACTIVE Jobs):

                     Target  Accuracy  Macro F1
0   Seniority (ACTIVE Jobs)  0.430177  0.392017
1  Department (ACTIVE Jobs)  0.211878  0.305883


This block summarizes the evaluation metrics for the embedding-based zero-shot models applied to ACTIVE job titles. Accuracy represents the fraction of correct predictions, while Macro F1 accounts for class imbalance by averaging the F1-score across all labels. Presenting both metrics allows a direct comparison of performance between the Seniority and Department predictions on the same dataset.

The results of the embedding-based models on ACTIVE jobs are as follows:

| Target                   | Accuracy | Macro F1 |
| ------------------------ | -------- | -------- |
| Seniority (ACTIVE Jobs)  | 0.430    | 0.392    |
| Department (ACTIVE Jobs) | 0.212    | 0.306    |

These results indicate that the embedding-based zero-shot approach struggles with predicting both seniority and department on ACTIVE jobs. Seniority shows slightly higher accuracy for certain labels, but overall performance remains limited, reflecting the challenges of zero-shot classification in domains with noisy, short job titles and class imbalance.

# 4. Modell Seniority mit synthetic Daten

In [25]:
ORD_MAP = {
    "Junior": 1.0,
    "Professional": 2.0,
    "Senior": 3.0,
    "Lead": 4.0,
    "Management": 5.0,
    "Director": 6.0,
}
INV_ORD = {v: k for k, v in ORD_MAP.items()}

In [26]:
def add_synthetic(train_df: pd.DataFrame, synthetic_csv_relpath: str) -> pd.DataFrame:
    syn = pd.read_csv(get_github_url(synthetic_csv_relpath))
    syn = syn[["position", "seniority"]].copy()

    id2label = {v: k for k, v in ORD_MAP.items()}
    syn["label"] = syn["seniority"].map(id2label)
    syn = syn.rename(columns={"position": "text"})
    syn = syn.dropna(subset=["text", "label"])

    out = pd.concat([train_df[["text", "label"]], syn[["text", "label"]]], ignore_index=True)
    return out

In [27]:
sdf_aug = add_synthetic(seniority_df, "data/results/gemini_synthetic.csv")
sdf_aug

Unnamed: 0,text,label
0,Analyst,Junior
1,Analyste financier,Junior
2,Anwendungstechnischer Mitarbeiter,Junior
3,Application Engineer,Senior
4,Applications Engineer,Senior
...,...,...
11309,Juristischer Berater,Professional
11310,"Leitung Personal, Finanzen, Einkauf, IT | Folk...",Management
11311,Verwaltungsleitung Landesspracheninstitut in d...,Management
11312,"Leitung Gebäudemanagement, Einkauf und Control...",Management


# 4. Modell Seniority mit synthetic Daten

In [28]:
# --- Seniority Labels & Text (mit synthetischen Daten) ---
sslabels = sdf_aug["label"].astype(str).tolist()
sstexts = sdf_aug["text"].astype(str).tolist()

# --- Evaluation: ACTIVE Jobs ---
sseval_texts = jobs_annotated_active_df["position"].astype(str).tolist()
sstrue_labels = jobs_annotated_active_df["seniority"].astype(str).tolist()

# --- Embedding-Modell laden ---
ssembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Labels ---
sslabel_embeddings = ssembed_model.encode(sslabels, convert_to_tensor=True)

# --- Embeddings für ACTIVE Jobs ---
sseval_embeddings = ssembed_model.encode(sseval_texts, convert_to_tensor=True)

# --- Zero-Shot Vorhersagen auf ACTIVE Jobs ---
sspred_labels = []
for emb in sseval_embeddings:
    sssims = cosine_similarity(emb.reshape(1, -1), sslabel_embeddings)[0]
    sspred_labels.append(sslabels[np.argmax(sssims)])

# --- Evaluation ---
ssaccuracy = accuracy_score(sstrue_labels, sspred_labels)
ssmacro_f1 = f1_score(sstrue_labels, sspred_labels, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs (with Synthetic Data)")
print("Accuracy:", round(ssaccuracy, 3))
print("Macro F1:", round(ssmacro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sstrue_labels, sspred_labels))

Embedding-based Seniority Prediction on ACTIVE Jobs (with Synthetic Data)
Accuracy: 0.369
Macro F1: 0.346

Classification Report:

              precision    recall  f1-score   support

    Director       0.30      0.94      0.46        34
      Junior       0.09      0.25      0.13        12
        Lead       0.41      0.10      0.17       125
  Management       0.35      0.58      0.43       192
Professional       0.54      0.19      0.28       216
      Senior       0.56      0.68      0.61        44

    accuracy                           0.37       623
   macro avg       0.37      0.46      0.35       623
weighted avg       0.43      0.37      0.33       623



# 5. Modell Department mit synthetic Daten

In [29]:
def add_synthetic_department(train_df: pd.DataFrame, synthetic_csv_relpath: str) -> pd.DataFrame:
    syn = pd.read_csv(get_github_url(synthetic_csv_relpath))

    # expect columns: position, department
    syn = syn[["position", "department"]].copy()
    syn = syn.rename(columns={"position": "text", "department": "label"})
    syn = syn.dropna(subset=["text", "label"])

    out = pd.concat([train_df[["text", "label"]], syn[["text", "label"]]], ignore_index=True)
    return out

In [30]:
ddf_aug = add_synthetic_department(department_df, "data/results/gemini_synthetic.csv")
ddf_aug

Unnamed: 0,text,label
0,Adjoint directeur communication,Marketing
1,Advisor Strategy and Projects,Project Management
2,Beratung & Projekte,Project Management
3,Beratung & Projektmanagement,Project Management
4,Beratung und Projektmanagement kommunale Partner,Project Management
...,...,...
12026,Juristischer Berater,Consulting
12027,"Leitung Personal, Finanzen, Einkauf, IT | Folk...",Human Resources
12028,Verwaltungsleitung Landesspracheninstitut in d...,Administrative
12029,"Leitung Gebäudemanagement, Einkauf und Control...",Purchasing


In [31]:
# --- Labels & Beschreibungen aus department_df ---
sdlabels = ddf_aug["label"].astype(str).tolist()
sdtexts = ddf_aug["text"].astype(str).tolist()  # nur für Zero-Shot Embeddings

# --- Evaluation: ACTIVE Jobs ---
sdtrue_department = jobs_annotated_active_df["department"].astype(str).tolist()
sdeval_texts = jobs_annotated_active_df["position"].astype(str).tolist()  # Jobtitel für Vorhersage

# --- Embedding-Modell laden ---
sdembed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings für Labels ---
sdlabel_embeddings = sdembed_model.encode(sdlabels, convert_to_tensor=True)

# --- Embeddings für ACTIVE Jobs ---
sdeval_embeddings = sdembed_model.encode(sdeval_texts, convert_to_tensor=True)

# --- Zero-Shot Vorhersagen auf ACTIVE Jobs ---
sdpred_department = []
for emb in sdeval_embeddings:
    sdsims = cosine_similarity(emb.reshape(1, -1), sdlabel_embeddings)[0]
    sdpred_department.append(sdlabels[np.argmax(sdsims)])

# --- Evaluation ---
sd_eval_accuracy = accuracy_score(sdtrue_department, sdpred_department)
sd_eval_macro_f1 = f1_score(sdtrue_department, sdpred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs")
print("Accuracy:", round(sd_eval_accuracy, 3))
print("Macro F1:", round(sd_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sdtrue_department, sdpred_department))

Embedding-based Department Prediction on ACTIVE Jobs
Accuracy: 0.212
Macro F1: 0.306

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.03      0.57      0.06        14
  Business Development       0.17      0.35      0.23        20
            Consulting       0.29      0.54      0.38        39
      Customer Support       0.14      0.67      0.24         6
       Human Resources       0.35      0.75      0.48        16
Information Technology       0.44      0.24      0.31        62
             Marketing       0.39      0.41      0.40        22
                 Other       0.75      0.03      0.05       344
    Project Management       0.34      0.59      0.43        39
            Purchasing       0.30      0.53      0.38        15
                 Sales       0.50      0.35      0.41        46

              accuracy                           0.21       623
             macro avg       0.34      0.46      0.31   

# 6. Modell Seniority mit synthetic Daten und Oversampling