In [1]:
import pandas as pd
from sklearn.metrics import f1_score, classification_report

In [2]:
test_data = pd.read_csv("../../data/processed/jobs_annotated_active.csv")
prediction_df = pd.read_csv("../../data/results/gemini_results.csv")

# Results Prompt Engineering

In this notebook, we evaluate our prompt-engineering approach using `gemini-2.0-flash` on the test set. We report accuracy and F1-score, and include classification reports for both the seniority and department prediction tasks to better understand performance across classes.

For the analysis we used a system prompt that clearly defined the task, expected output format, and decision rules. We also used a few examples in the user prompt to guide the model(few-shot learning). The main goal was to reduce inconsistent predictions and push the model toward more stable, interpretable outputs.

## 1. Results of predicting the seniority labels


We focused on improving the model through a system prompt that more clearly defined the task, expected output format, and decision rules. The main goal was to reduce inconsistent predictions and push the model toward more stable, interpretable outputs.

In [5]:
# merge into results -> merge on same row_id, keep position of test_data and rename seniority of prediction_df to predicted_seniority and rename department of prediction_df to predicted_department
# rename seniority in prediction_df to predicted_seniority and rename department in prediction_df to predicted_department
prediction_df = prediction_df.rename(columns={"seniority": "predicted_seniority", "department": "predicted_department"})

results_df = pd.merge(test_data, prediction_df[["row_id", "predicted_seniority", "predicted_department"]], on="row_id", how="left")

# drop coluumns job_index, startDate, endDate, status
results_df = results_df.drop(columns=["job_index", "startDate", "endDate", "status"])
results_df.head()

Unnamed: 0,row_id,cv_id,organization,position,department,seniority,predicted_seniority,predicted_department
0,0,0,Depot4Design GmbH,Prokurist,Other,Management,5.0,Other
1,1,0,Depot4Design GmbH,CFO,Other,Management,5.0,Other
2,2,0,Depot4Design GmbH,Betriebswirtin,Other,Professional,2.0,Other
3,3,0,Depot4Design GmbH,Prokuristin,Other,Management,5.0,Other
4,4,0,Depot4Design GmbH,CFO,Other,Management,5.0,Other


In [None]:
# encode seniority levels to numeric values

seniority_map = {
    "Junior": 1.0,
    "Professional": 2.0,   
    "Senior": 3.0,
    "Lead": 4.0,
    "Management": 5.0,
    "Director": 6.0
}

results_df["seniority"] = results_df["seniority"].map(seniority_map)

results_df.head()

Unnamed: 0,row_id,cv_id,organization,position,department,seniority,predicted_seniority,predicted_department
0,0,0,Depot4Design GmbH,Prokurist,Other,5.0,5.0,Other
1,1,0,Depot4Design GmbH,CFO,Other,5.0,5.0,Other
2,2,0,Depot4Design GmbH,Betriebswirtin,Other,2.0,2.0,Other
3,3,0,Depot4Design GmbH,Prokuristin,Other,5.0,5.0,Other
4,4,0,Depot4Design GmbH,CFO,Other,5.0,5.0,Other


In [7]:
# make statistic of how much percent of seniority == predicted_seniority
correct_seniority = results_df[results_df["seniority"] == results_df["predicted_seniority"]]
accuracy_seniority = len(correct_seniority) / len(results_df) * 100
print(f"Seniority Prediction Accuracy: {accuracy_seniority:.2f}%")

Seniority Prediction Accuracy: 58.43%


In [14]:
inv_seniority_map = {
    1: "Junior",
    2: "Professional",
    3: "Senior",
    4: "Lead",
    5: "Management",
    6: "Director",
}

sen = results_df[["seniority", "predicted_seniority"]].dropna()

y_true_s = sen["seniority"].astype(int).map(inv_seniority_map)
y_pred_s = sen["predicted_seniority"].astype(int).map(inv_seniority_map)

labels = ["Junior","Professional","Senior","Lead","Management","Director"]

print("Seniority F1 macro:", f1_score(y_true_s, y_pred_s, average="macro"))
print("Seniority F1 weighted:", f1_score(y_true_s, y_pred_s, average="weighted"))
print("\nSeniority report:\n", classification_report(y_true_s, y_pred_s, labels=labels, zero_division=0))


Seniority F1 macro: 0.54027464919846
Seniority F1 weighted: 0.6012028957385558

Seniority report:
               precision    recall  f1-score   support

      Junior       0.14      0.83      0.24        12
Professional       0.82      0.47      0.60       216
      Senior       0.62      0.70      0.66        44
        Lead       0.88      0.41      0.56       125
  Management       0.61      0.71      0.66       191
    Director       0.36      1.00      0.53        34

    accuracy                           0.59       622
   macro avg       0.57      0.69      0.54       622
weighted avg       0.71      0.59      0.60       622



### Seniority prediction: results and interpretation

The prompt-engineered setup achieves ~59% accuracy with a macro F1 of 0.54 and weighted F1 of 0.60. The gap between weighted and macro F1 indicates that the model performs noticeably better on the more frequent seniority classes, while performance on smaller or harder classes is less reliable.

Looking at the per-class metrics, the model shows uneven behavior across the ordinal seniority scale:

* Junior has very low precision (0.14) but high recall (0.83). This means the model labels many roles as Junior that are not actually junior (strong overprediction), even though it catches most true junior cases.

* Professional and Lead show the opposite pattern: high precision (0.82 / 0.88) but low recall (0.47 / 0.41). So when the model predicts these levels it is usually correct, but it assigns them too rarely, pushing many true cases into neighboring classes.

* Senior and Management are comparatively stable, with F1 ≈ 0.66 for both. These categories seem to have clearer title cues in the dataset and therefore contribute strongly to the overall weighted score.

* Director reaches perfect recall (1.00) but low precision (0.36), meaning the model catches all true director roles but also incorrectly labels many non-director roles as director. This suggests the model overreacts to strong leadership keywords (e.g., “Head”, “Chief”, “Managing”) and tends to assign the top class too aggressively.

Overall, the results suggest that prompt engineering improves consistency but does not fully solve the main challenge: separating neighboring levels and avoiding extreme-level overprediction (Junior/Director). Since seniority is ordinal, many errors are likely “one-step shifts” (e.g., Professional ↔ Senior or Lead ↔ Management), which still reduce accuracy/F1 even though the predictions may be close in rank.

#### Future Work: Improving seniority prediction further
* We used a free trial account in Google AI Studio, which imposed token limits. This constrained how detailed the system prompt could be and, more importantly, how many iterations we could run to refine it. With a larger token budget, we would have tested and tuned multiple prompt variants more systematically.j

Therefore, future work could focus on:
A key idea we could not fully implement due to token constraints was adding consistency rules based on prior career history, for example:
* Someone’s inferred seniority should not decrease as their career progresses. If a candidate has already held a senior role, later roles should not be classified as less senior unless there is explicit evidence.
* Encoding this kind of rule would likely reduce unreasonable “seniority drops” and improve overall accuracy, but it requires additional iterative testing and prompt refinement to implement effectively.

## 2. Results of predicting the department labels

For the department prediction task, we used the same prompt-engineering setup with gemini-2.0-flash and applied it to the test set job entries. The model was instructed to assign exactly one department label from our predefined label set based mainly on the job title (and any contextual cues available in the entry). We then merged the predictions with the ground-truth annotations and evaluated performance using accuracy, F1-scores, and a classification report to see how well the model performs across departments.

In [10]:
# now do the same for department
correct_department = results_df[results_df["department"] == results_df["predicted_department"]]
accuracy_department = len(correct_department) / len(results_df) * 100
print(f"Department Prediction Accuracy: {accuracy_department:.2f}%")

Department Prediction Accuracy: 79.61%


In [12]:
# --- Department F1 (multi-class) ---
dep = results_df[["department", "predicted_department"]].dropna()
y_true_d = dep["department"].astype(str)
y_pred_d = dep["predicted_department"].astype(str)

print("Department F1 macro:", f1_score(y_true_d, y_pred_d, average="macro"))
print("Department F1 weighted:", f1_score(y_true_d, y_pred_d, average="weighted"))
print("\nDepartment report:\n", classification_report(y_true_d, y_pred_d))

Department F1 macro: 0.7340526439428288
Department F1 weighted: 0.8035016136170083

Department report:
                         precision    recall  f1-score   support

        Administrative       0.42      0.79      0.55        14
  Business Development       0.33      0.40      0.36        20
            Consulting       0.68      0.67      0.68        39
      Customer Support       1.00      0.83      0.91         6
       Human Resources       0.67      1.00      0.80        16
Information Technology       0.94      0.73      0.82        62
             Marketing       0.63      0.77      0.69        22
                 Other       0.88      0.83      0.86       343
    Project Management       0.65      0.79      0.71        39
            Purchasing       0.86      0.80      0.83        15
                 Sales       0.89      0.85      0.87        46

              accuracy                           0.80       622
             macro avg       0.72      0.77      0.73       62

## Department prediction results and interpretation

* The department prediction task reaches 79.61% accuracy, with a macro F1 of 0.73 and a weighted F1 of 0.80, clearly outperforming the baseline (60% accuracy). Compared to seniority, department labels are predicted more reliably, which suggests that the model can pick up stronger surface cues for departments.

* Performance is strongest for high-signal categories with distinctive titles. Sales (F1 0.87), Purchasing (0.83), Information Technology (0.82), Human Resources (0.80), and Customer Support (0.91) are classified well, likely because roles in these areas contain clear keywords (e.g., “Sales”, “Recruiter/HR”, “IT”, “Support”, “Buyer”).

* The main weaknesses are concentrated in Business Development (F1 0.36) and Administrative (F1 0.55). These categories are harder to separate from more generic roles and are often semantically close to other departments (e.g., Business Development vs Sales vs Consulting). In addition, the dataset is imbalanced: “Other” dominates the test set and achieves a high F1 (0.86), which boosts the weighted F1. 

* Overall, prompt engineering yields a strong and meaningful improvement for department classification, with remaining errors mostly driven by ambiguous titles and overlap between neighboring business functions.

### Future Work
Future work would focus on running more prompt iterations specifically targeting these confusing class boundaries. For example, we would add clearer decision rules and keyword heuristics to separate Business Development from Sales and Consulting, and to better distinguish Administrative from Other by emphasizing typical tasks (coordination, office management, clerical work) rather than broad role wording. With a larger iteration budget, we could systematically test prompt variants and use the per-class report to refine the prompt until the weakest categories improve.