<a href="https://colab.research.google.com/github/luisadosch/Final-Project-snapAddy/blob/main/model2_embedding_based.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Github-Zugangsdaten

In [1]:
# GitHub-Zugangsdaten
import pandas as pd

GH_USER = "luisadosch"
GH_REPO = "Final-Project-snapAddy"
BRANCH = "main"

def get_github_url(relative_path):
    return f"https://raw.githubusercontent.com/{GH_USER}/{GH_REPO}/{BRANCH}/{relative_path}"


jobs_annotated_active_df = pd.read_csv(get_github_url("data/processed/jobs_annotated_active.csv"))

department_df = pd.read_csv(get_github_url("data/raw/department-v2.csv"))

seniority_df = pd.read_csv(get_github_url("data/raw/seniority-v2.csv"))

# 2. Modell Seniority

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np

# --- Text und Labels vorbereiten ---
texts = jobs_annotated_active_df["position"].astype(str).tolist()
true_seniority = jobs_annotated_active_df["seniority"].astype(str).tolist()

# --- Label Liste ---
seniority_labels = seniority_df["label"].astype(str).tolist()

# --- Embedding-Modell laden ---
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- Embeddings erzeugen ---
text_embeddings = embed_model.encode(texts, convert_to_tensor=True)
label_embeddings = embed_model.encode(seniority_labels, convert_to_tensor=True)

# --- Vorhersagen ---
pred_seniority = []
for emb in text_embeddings:
    sims = cosine_similarity(emb.reshape(1, -1), label_embeddings)[0]
    pred_seniority.append(seniority_labels[np.argmax(sims)])

# --- Evaluation ---
s_eval_accuracy = accuracy_score(true_seniority, pred_seniority)
s_eval_macro_f1 = f1_score(true_seniority, pred_seniority, average="macro")

print("Embedding-based Seniority Prediction on ACTIVE Jobs")
print("Accuracy:", round(s_eval_accuracy, 3))
print("Macro F1:", round(s_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(true_seniority, pred_seniority))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Embedding-based Seniority Prediction on ACTIVE Jobs
Accuracy: 0.308
Macro F1: 0.276

Classification Report:

              precision    recall  f1-score   support

    Director       0.25      0.97      0.40        34
      Junior       0.07      0.25      0.11        12
        Lead       0.38      0.11      0.17       125
  Management       0.32      0.58      0.41       192
Professional       0.00      0.00      0.00       216
      Senior       0.48      0.68      0.56        44

    accuracy                           0.31       623
   macro avg       0.25      0.43      0.28       623
weighted avg       0.22      0.31      0.23       623



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


This block performs zero-shot classification of job seniority using embeddings. Each job title is converted into a dense vector using a pre-trained embedding model, and each possible seniority label is also mapped into the same embedding space. Cosine similarity is then computed between the job title embeddings and the label embeddings, and the label with the highest similarity is assigned to the job. This method does not require any training, making it a simple and interpretable embedding-based approach.

Applying this embedding-based approach to predict seniority on ACTIVE jobs results in an accuracy of 0.308 and a macro F1 score of 0.276. The detailed classification report is as follows:

| Label            | Precision | Recall | F1-score | Support |
| ---------------- | --------- | ------ | -------- | ------- |
| Director         | 0.25      | 0.97   | 0.40     | 34      |
| Junior           | 0.07      | 0.25   | 0.11     | 12      |
| Lead             | 0.38      | 0.11   | 0.17     | 125     |
| Management       | 0.32      | 0.58   | 0.41     | 192     |
| Professional     | 0.00      | 0.00   | 0.00     | 216     |
| Senior           | 0.48      | 0.68   | 0.56     | 44      |
| **Accuracy**     |           |        | 0.31     | 623     |
| **Macro avg**    | 0.25      | 0.43   | 0.28     | 623     |
| **Weighted avg** | 0.22      | 0.31   | 0.23     | 623     |



These results indicate that while some seniority levels, such as Director and Senior, are predicted with reasonable recall, the overall performance across all classes is limited. The approach captures high-level trends but struggles with frequent and ambiguous classes like Professional, highlighting the challenges of zero-shot classification in a domain with class imbalance and short, noisy job titles.

In [3]:
# --- Text und Labels vorbereiten ---
true_department = jobs_annotated_active_df["department"].astype(str).tolist()
department_labels = department_df["label"].astype(str).tolist()

# --- Embeddings für Text & Label ---
label_embeddings_dep = embed_model.encode(department_labels, convert_to_tensor=True)

# --- Vorhersagen ---
pred_department = []
for emb in text_embeddings:  # text_embeddings können wir wiederverwenden
    sims = cosine_similarity(emb.reshape(1, -1), label_embeddings_dep)[0]
    pred_department.append(department_labels[np.argmax(sims)])

# --- Evaluation ---
d_eval_accuracy = accuracy_score(true_department, pred_department)
d_eval_macro_f1 = f1_score(true_department, pred_department, average="macro")

print("Embedding-based Department Prediction on ACTIVE Jobs")
print("Accuracy:", round(d_eval_accuracy, 3))
print("Macro F1:", round(d_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(true_department, pred_department))

Embedding-based Department Prediction on ACTIVE Jobs
Accuracy: 0.212
Macro F1: 0.306

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.03      0.57      0.06        14
  Business Development       0.17      0.35      0.23        20
            Consulting       0.29      0.54      0.38        39
      Customer Support       0.14      0.67      0.24         6
       Human Resources       0.35      0.75      0.48        16
Information Technology       0.44      0.24      0.31        62
             Marketing       0.39      0.41      0.40        22
                 Other       0.75      0.03      0.05       344
    Project Management       0.34      0.59      0.43        39
            Purchasing       0.30      0.53      0.38        15
                 Sales       0.50      0.35      0.41        46

              accuracy                           0.21       623
             macro avg       0.34      0.46      0.31   

This block applies the embedding-based zero-shot classification method to predict the department of each job. Each job title and each possible department label is converted into a dense vector using a pre-trained embedding model. Cosine similarity is then computed between the job title embeddings and the label embeddings, and the label with the highest similarity is assigned to each job. This method requires no training, making it a simple, interpretable embedding-based approach based purely on semantic similarity.

Applying this embedding-based approach to predict departments on ACTIVE jobs results in an accuracy of 0.212 and a macro F1 score of 0.306. The detailed classification report is:
| Label                  | Precision | Recall | F1-score | Support |
| ---------------------- | --------- | ------ | -------- | ------- |
| Administrative         | 0.03      | 0.57   | 0.06     | 14      |
| Business Development   | 0.17      | 0.35   | 0.23     | 20      |
| Consulting             | 0.29      | 0.54   | 0.38     | 39      |
| Customer Support       | 0.14      | 0.67   | 0.24     | 6       |
| Human Resources        | 0.35      | 0.75   | 0.48     | 16      |
| Information Technology | 0.44      | 0.24   | 0.31     | 62      |
| Marketing              | 0.39      | 0.41   | 0.40     | 22      |
| Other                  | 0.75      | 0.03   | 0.05     | 344     |
| Project Management     | 0.34      | 0.59   | 0.43     | 39      |
| Purchasing             | 0.30      | 0.53   | 0.38     | 15      |
| Sales                  | 0.50      | 0.35   | 0.41     | 46      |
| **Accuracy**           |           |        | 0.21     | 623     |
| **Macro avg**          | 0.34      | 0.46   | 0.31     | 623     |
| **Weighted avg**       | 0.57      | 0.21   | 0.19     | 623     |

These results indicate that while some departments, such as Human Resources and Project Management, are predicted with moderate recall, the overall performance is low, particularly for highly frequent and ambiguous labels like Other. This demonstrates the limitations of zero-shot embedding-based classification in domains with unbalanced class distributions and short, noisy job titles.

In [4]:
# --- Compare embedding-based results for ACTIVE jobs ---
comparison_metrics_active = pd.DataFrame({
    "Target": ["Seniority (ACTIVE Jobs)", "Department (ACTIVE Jobs)"],
    "Accuracy": [s_eval_accuracy, d_eval_accuracy],
    "Macro F1": [s_eval_macro_f1, d_eval_macro_f1]
})

print("Embedding-based Model Results (ACTIVE Jobs):\n")
print(comparison_metrics_active)

Embedding-based Model Results (ACTIVE Jobs):

                     Target  Accuracy  Macro F1
0   Seniority (ACTIVE Jobs)  0.308186  0.276056
1  Department (ACTIVE Jobs)  0.211878  0.305883


This block summarizes the evaluation metrics for the embedding-based zero-shot models applied to ACTIVE job titles. Accuracy represents the fraction of correct predictions, while Macro F1 accounts for class imbalance by averaging the F1-score across all labels. Presenting both metrics allows a direct comparison of performance between the Seniority and Department predictions on the same dataset.

The results of the embedding-based models on ACTIVE jobs are as follows:

| Target                   | Accuracy | Macro F1 |
| ------------------------ | -------- | -------- |
| Seniority (ACTIVE Jobs)  | 0.308    | 0.276    |
| Department (ACTIVE Jobs) | 0.212    | 0.306    |

These results indicate that the embedding-based zero-shot approach struggles with predicting both seniority and department on ACTIVE jobs. Seniority shows slightly higher accuracy for certain labels, but overall performance remains limited, reflecting the challenges of zero-shot classification in domains with noisy, short job titles and class imbalance.

In [None]:
"""
results = []

def add_result(results_list, model_name, target, metrics):
    results_list.append({
        "Model": model_name,
        "Target": target,
        "Accuracy": metrics["Accuracy"],
        "Macro F1": metrics["Macro F1"]
    })

# Add embedding-based results for ACTIVE jobs
add_result(
    results,
    model_name="Embedding-based",
    target="Seniority",
    metrics={"Accuracy": s_eval_accuracy, "Macro F1": s_eval_macro_f1}
)

add_result(
    results,
    model_name="Embedding-based",
    target="Department",
    metrics={"Accuracy": d_eval_accuracy, "Macro F1": d_eval_macro_f1}
)

save_results(results)

results_df_tfidf = pd.DataFrame(results)
results_df_tfidf"""
