<a href="https://colab.research.google.com/github/luisadosch/Final-Project-snapAddy/blob/main/model6_Bag_of_Words_TF%E2%80%93IDF_%2B_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Github-Zugangsdaten

In [24]:
# GitHub-Zugangsdaten
import pandas as pd

GH_USER = "luisadosch"
GH_REPO = "Final-Project-snapAddy"
BRANCH = "main"

def get_github_url(relative_path):
    return f"https://raw.githubusercontent.com/{GH_USER}/{GH_REPO}/{BRANCH}/{relative_path}"


jobs_annotated_active_df = pd.read_csv(get_github_url("data/processed/jobs_annotated_active.csv"))

department_df = pd.read_csv(get_github_url("data/raw/department-v2.csv"))

seniority_df = pd.read_csv(get_github_url("data/raw/seniority-v2.csv"))

# 2. Modell Seniority

In [25]:
#Seniority Daten sortieren
sdf = seniority_df.copy()

sdf["text"] = sdf["text"].astype(str).str.lower()
sdf["label"] = sdf["label"].astype(str)

sdf = sdf.dropna(subset=["text", "label"])

Prepare the seniority dataset for modeling. Lowercasing ensures uniform text representation. Dropping missing values prevents errors in the model.

In [26]:
#Seniority Train/Test Split
from sklearn.model_selection import train_test_split

sx = sdf["text"]
sy = sdf["label"]

sx_train, sx_test, sy_train, sy_test = train_test_split(
    sx,
    sy,
    test_size=0.2,
    random_state=42,
    stratify=sy
)

# Print dataset sizes
print("Seniority dataset sizes:")
print("Total:", len(sx))
print("Train:", len(sx_train))
print("Test:", len(sx_test))

Seniority dataset sizes:
Total: 9428
Train: 7542
Test: 1886


Split into training and test sets. stratify ensures rare classes are represented proportionally. The total is 9428, while the train set is 7542 and the test set is 1886.

In [27]:
#Seniority TF–IDF + Logistic Regression Pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

smodel = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=3,
        max_df=0.9
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])

Pipeline converts job titles into TF–IDF features and applies logistic regression. class_weight="balanced" ensures rare seniority levels get enough importance.

In [28]:
#Seniority Modell trainieren
smodel.fit(sx_train, sy_train)

# Vorhersagen auf Testdaten
sy_pred = smodel.predict(sx_test)

# Accuracy ausgeben
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(sy_test, sy_pred))

Accuracy: 0.9703075291622482


Train the seniority classifier and generate predictions on the test set. The model achieves a high accuracy of 0.97 on the test set.

In [29]:
#Seniority Evaluation
from sklearn.metrics import accuracy_score, f1_score, classification_report

sy_pred = smodel.predict(sx_test)

saccuracy = accuracy_score(sy_test, sy_pred)
smacro_f1 = f1_score(sy_test, sy_pred, average="macro")

print("Accuracy:", round(saccuracy, 3))
print("Macro F1:", round(smacro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(sy_test, sy_pred))

Accuracy: 0.97
Macro F1: 0.956

Classification Report:

              precision    recall  f1-score   support

    Director       0.99      0.98      0.98       197
      Junior       0.85      1.00      0.92        82
        Lead       0.97      0.98      0.98       709
  Management       0.92      0.93      0.92       151
      Senior       0.99      0.97      0.98       747

    accuracy                           0.97      1886
   macro avg       0.94      0.97      0.96      1886
weighted avg       0.97      0.97      0.97      1886



Evaluate using accuracy and macro F1, which accounts for class imbalance. Classification report shows precision, recall, and F1 per seniority level. The evaluation yields an accuracy of 0.97 and a macro F1 score of 0.956, reflecting strong performance across all seniority classes.

In [30]:
# Seniority Evaluation on Annotated ACTIVE Jobs

# Prepare evaluation data
s_eval_df = jobs_annotated_active_df.dropna(subset=["position", "seniority"]).copy()
s_eval_text = s_eval_df["position"].astype(str).str.lower()
s_eval_labels = s_eval_df["seniority"].astype(str)

# Predict seniority
s_eval_pred = smodel.predict(s_eval_text)

# Evaluation metrics
from sklearn.metrics import accuracy_score, f1_score, classification_report

s_eval_accuracy = accuracy_score(s_eval_labels, s_eval_pred)
s_eval_macro_f1 = f1_score(s_eval_labels, s_eval_pred, average="macro")

print("Seniority Evaluation on ACTIVE Jobs")
print("Accuracy:", round(s_eval_accuracy, 3))
print("Macro F1:", round(s_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(s_eval_labels, s_eval_pred))

Seniority Evaluation on ACTIVE Jobs
Accuracy: 0.437
Macro F1: 0.409

Classification Report:

              precision    recall  f1-score   support

    Director       0.58      0.88      0.70        34
      Junior       0.18      0.33      0.24        12
        Lead       0.33      0.71      0.45       125
  Management       0.90      0.59      0.72       192
Professional       0.00      0.00      0.00       216
      Senior       0.23      0.80      0.36        44

    accuracy                           0.44       623
   macro avg       0.37      0.55      0.41       623
weighted avg       0.40      0.44      0.38       623



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Evaluates the seniority model on annotated ACTIVE LinkedIn job entries using the job position title as input. The predictions are compared against manually labeled seniority levels, providing a realistic assessment of how well the model generalizes from label-based training data to real-world CV data. The accuracy ist 0.437 and the macro F1 score is 0.409.

In [31]:
#Seniority Top Features
import numpy as np

feature_names = smodel.named_steps["tfidf"].get_feature_names_out()
coefs = smodel.named_steps["clf"].coef_

for i, label in enumerate(smodel.named_steps["clf"].classes_):
    top = np.argsort(coefs[i])[-10:]
    print(f"\nTop words for {label}:")
    print(feature_names[top])


Top words for Director:
['marketing director' 'managing directors' 'managing' 'director marketing'
 'vertriebsdirektor' 'director sales' 'directors' 'sales director'
 'abteilungsdirektor' 'director']

Top words for Junior:
['marketing' 'assistent' 'associate' 'assistentin' 'mitarbeiter'
 'mitarbeiterin' 'referent' 'referentin' 'analyst' 'junior']

Top words for Lead:
['abteilungsleiter' 'projektleiter' 'geschäftsleitung' 'teamleiter'
 'leiterin' 'head of' 'head' 'vertriebsleiter' 'leitung' 'leiter']

Top words for Management:
['cio' 'vice president' 'vice' 'founder' 'chief' 'owner' 'ceo'
 'geschäftsführung' 'vp' 'geschäftsführer']

Top words for Senior:
['marketing manager' 'engineer' 'executive' 'assistant' 'managerin'
 'responsable' 'consultant' 'management' 'senior' 'manager']


Shows the most influential words for predicting each seniority class. Helps interpret the model’s decisions.

3. Modell Department

In [32]:
#Department Daten sortieren
ddf = department_df.copy()

ddf["text"] = ddf["text"].astype(str).str.lower()
ddf["label"] = ddf["label"].astype(str)

ddf = ddf.dropna(subset=["text", "label"])

Prepare department dataset. Lowercasing ensures consistent text representation. Drop missing values.

In [33]:
# Department Train/Test Split
from sklearn.model_selection import train_test_split

dx = ddf["text"]
dy = ddf["label"]

dx_train, dx_test, dy_train, dy_test = train_test_split(
    dx,
    dy,
    test_size=0.2,
    random_state=42,
    stratify=dy
)

# Print dataset sizes
print("Department dataset sizes:")
print("Total:", len(dx))
print("Train:", len(dx_train))
print("Test:", len(dx_test))

Department dataset sizes:
Total: 10145
Train: 8116
Test: 2029


Split into train/test sets with proportional label distribution. The total is 10145, while the train set is 8116 and the test set is 2029.

In [34]:
# Department TF–IDF + Logistic Regression Pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

dmodel = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams
        min_df=3,
        max_df=0.9
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])

Same as seniority pipeline but for department prediction.

In [35]:
# Department Modell trainieren
dmodel.fit(dx_train, dy_train)

# Vorhersagen auf Testdaten
dy_pred = dmodel.predict(dx_test)

# Accuracy ausgeben
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(dy_test, dy_pred))

Accuracy: 0.9344504682109414


Train the department classifier and predict on test set. The model achieves a high accuracy of 0.93 in the test set.

In [36]:
# Department Evaluation
from sklearn.metrics import accuracy_score, f1_score, classification_report

dy_pred = dmodel.predict(dx_test)

daccuracy = accuracy_score(dy_test, dy_pred)
dmacro_f1 = f1_score(dy_test, dy_pred, average="macro")

print("Accuracy:", round(daccuracy, 3))
print("Macro F1:", round(dmacro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(dy_test, dy_pred))

Accuracy: 0.934
Macro F1: 0.86

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.62      0.94      0.74        17
  Business Development       0.83      0.99      0.90       124
            Consulting       0.82      0.97      0.89        33
      Customer Support       0.88      1.00      0.93         7
       Human Resources       0.75      1.00      0.86         6
Information Technology       0.92      0.95      0.94       261
             Marketing       0.99      0.92      0.96       859
                 Other       0.50      1.00      0.67         8
    Project Management       0.57      0.88      0.69        40
            Purchasing       0.89      1.00      0.94         8
                 Sales       0.96      0.93      0.94       666

              accuracy                           0.93      2029
             macro avg       0.79      0.96      0.86      2029
          weighted avg       0.95      0.93   

Accuracy and macro F1 score for department model. Classification report for detailed class-level metrics.
Evaluate using accuracy and macro F1, which accounts for class imbalance. Classification report shows precision, recall, and F1 per department level. The evaluation yields an accuracy of 0.934 and a macro F1 score of 0.86, reflecting strong performance across all department classes.

In [37]:
# Department Evaluation on Annotated ACTIVE Jobs

# Prepare evaluation data
d_eval_df = jobs_annotated_active_df.dropna(subset=["position", "department"]).copy()
d_eval_text = d_eval_df["position"].astype(str).str.lower()
d_eval_labels = d_eval_df["department"].astype(str)

# Predict department
d_eval_pred = dmodel.predict(d_eval_text)

# Evaluation metrics
d_eval_accuracy = accuracy_score(d_eval_labels, d_eval_pred)
d_eval_macro_f1 = f1_score(d_eval_labels, d_eval_pred, average="macro")

print("Department Evaluation on ACTIVE Jobs")
print("Accuracy:", round(d_eval_accuracy, 3))
print("Macro F1:", round(d_eval_macro_f1, 3))
print("\nClassification Report:\n")
print(classification_report(d_eval_labels, d_eval_pred))

Department Evaluation on ACTIVE Jobs
Accuracy: 0.223
Macro F1: 0.338

Classification Report:

                        precision    recall  f1-score   support

        Administrative       0.17      0.07      0.10        14
  Business Development       0.38      0.30      0.33        20
            Consulting       0.86      0.46      0.60        39
      Customer Support       1.00      0.17      0.29         6
       Human Resources       0.73      0.50      0.59        16
Information Technology       0.31      0.44      0.36        62
             Marketing       0.17      0.41      0.24        22
                 Other       0.00      0.00      0.00       344
    Project Management       0.27      0.56      0.37        39
            Purchasing       0.80      0.53      0.64        15
                 Sales       0.12      0.85      0.21        46

              accuracy                           0.22       623
             macro avg       0.44      0.39      0.34       623
        

Evaluates the department model on annotated ACTIVE LinkedIn job entries based on the job position title. Evaluating on manually labeled CV data allows for assessing the model’s robustness and applicability in realistic LinkedIn profile scenarios. The accuracy is 0.223 and the macro F1 score is 0.338.

In [38]:
# Department Top Features
import numpy as np

feature_names = dmodel.named_steps["tfidf"].get_feature_names_out()
coefs = dmodel.named_steps["clf"].coef_

for i, label in enumerate(dmodel.named_steps["clf"].classes_):
    top = np.argsort(coefs[i])[-10:]
    print(f"\nTop words for {label}:")
    print(feature_names[top])


Top words for Administrative:
['assistentin des' 'geschäftsführung' 'gf' 'assistent der'
 'geschäftsleitung' 'der' 'sekretärin' 'assistent' 'assistenz'
 'assistentin']

Top words for Business Development:
['business intelligence' 'crm' 'digital business' 'of business'
 'business process' 'ebusiness' 'it business' 'development'
 'business development' 'business']

Top words for Consulting:
['senior berater' 'sap' 'coach' 'senior consultant' 'von' 'senior'
 'recruitment' 'beraterin' 'berater' 'consultant']

Top words for Customer Support:
['service and' 'it systems' 'customer' 'technical' 'it support' 'it'
 'customer support' 'technical support' 'supporter' 'support']

Top words for Human Resources:
['qualitätsmanagement' 'director digital' 'project director' 'gl'
 'manager hr' 'of human' 'resources' 'human resources' 'human' 'hr']

Top words for Information Technology:
['digitalization' 'administrator' 'entwickler' 'digitale' 'administration'
 'digitalisierung' 'sap' 'digital' 'it' 'cr

Shows the top words for each department class to interpret the model.

In [39]:
# Vergleiche Accuracy und Macro F1
comparison_metrics = pd.DataFrame({
    "Target": ["Seniority", "Department"],
    "Accuracy": [saccuracy, daccuracy],
    "Macro F1": [smacro_f1, dmacro_f1]
})

print("Modellvergleich:\n")
print(comparison_metrics)

# Optional: Top 5 Features pro Label für beide Modelle nebeneinander
def print_top_features(model, n=5):
    feature_names = model.named_steps["tfidf"].get_feature_names_out()
    coefs = model.named_steps["clf"].coef_
    for i, label in enumerate(model.named_steps["clf"].classes_):
        top = np.argsort(coefs[i])[-n:]
        print(f"\nTop {n} words for {label}:")
        print(feature_names[top])

print("\n--- Seniority Top Features ---")
print_top_features(smodel)

print("\n--- Department Top Features ---")
print_top_features(dmodel)

Modellvergleich:

       Target  Accuracy  Macro F1
0   Seniority  0.970308  0.956030
1  Department  0.934450  0.860444

--- Seniority Top Features ---

Top 5 words for Director:
['director sales' 'directors' 'sales director' 'abteilungsdirektor'
 'director']

Top 5 words for Junior:
['mitarbeiterin' 'referent' 'referentin' 'analyst' 'junior']

Top 5 words for Lead:
['head of' 'head' 'vertriebsleiter' 'leitung' 'leiter']

Top 5 words for Management:
['owner' 'ceo' 'geschäftsführung' 'vp' 'geschäftsführer']

Top 5 words for Senior:
['responsable' 'consultant' 'management' 'senior' 'manager']

--- Department Top Features ---

Top 5 words for Administrative:
['der' 'sekretärin' 'assistent' 'assistenz' 'assistentin']

Top 5 words for Business Development:
['ebusiness' 'it business' 'development' 'business development' 'business']

Top 5 words for Consulting:
['senior' 'recruitment' 'beraterin' 'berater' 'consultant']

Top 5 words for Customer Support:
['it' 'customer support' 'technical su

In [40]:
# Comparison: Training Evaluation vs ACTIVE Job Evaluation

comparison_metrics = pd.DataFrame({
    "Target": [
        "Seniority (Label Data)",
        "Department (Label Data)",
        "Seniority (ACTIVE Jobs)",
        "Department (ACTIVE Jobs)"
    ],
    "Accuracy": [
        saccuracy,
        daccuracy,
        s_eval_accuracy,
        d_eval_accuracy
    ],
    "Macro F1": [
        smacro_f1,
        dmacro_f1,
        s_eval_macro_f1,
        d_eval_macro_f1
    ]
})

print("Model Comparison:\n")
print(comparison_metrics)


#Top Features per Label

import numpy as np

def print_top_features(model, n=5):
    feature_names = model.named_steps["tfidf"].get_feature_names_out()
    coefs = model.named_steps["clf"].coef_
    for i, label in enumerate(model.named_steps["clf"].classes_):
        top = np.argsort(coefs[i])[-n:]
        print(f"\nTop {n} words for {label}:")
        print(feature_names[top])

print("\n--- Seniority Top Features ---")
print_top_features(smodel)

print("\n--- Department Top Features ---")
print_top_features(dmodel)

Model Comparison:

                     Target  Accuracy  Macro F1
0    Seniority (Label Data)  0.970308  0.956030
1   Department (Label Data)  0.934450  0.860444
2   Seniority (ACTIVE Jobs)  0.436597  0.409319
3  Department (ACTIVE Jobs)  0.223114  0.338219

--- Seniority Top Features ---

Top 5 words for Director:
['director sales' 'directors' 'sales director' 'abteilungsdirektor'
 'director']

Top 5 words for Junior:
['mitarbeiterin' 'referent' 'referentin' 'analyst' 'junior']

Top 5 words for Lead:
['head of' 'head' 'vertriebsleiter' 'leitung' 'leiter']

Top 5 words for Management:
['owner' 'ceo' 'geschäftsführung' 'vp' 'geschäftsführer']

Top 5 words for Senior:
['responsable' 'consultant' 'management' 'senior' 'manager']

--- Department Top Features ---

Top 5 words for Administrative:
['der' 'sekretärin' 'assistent' 'assistenz' 'assistentin']

Top 5 words for Business Development:
['ebusiness' 'it business' 'development' 'business development' 'business']

Top 5 words for Consul

The comparison table summarizes model performance across two evaluation settings with concrete quantitative results. On the label-based datasets, the seniority classifier achieves a very high accuracy of 0.97 with a macro F1 score of 0.96, while the department classifier reaches an accuracy of 0.93 and a macro F1 score of 0.86. These results indicate that both TF–IDF + logistic regression models perform extremely well when trained and evaluated on curated label data.

When evaluated on annotated ACTIVE job entries from real CV data, performance drops substantially. Seniority prediction achieves an accuracy of 0.44 and a macro F1 score of 0.41, while department prediction performs considerably worse with an accuracy of 0.22 and a macro F1 score of 0.34. This sharp decline highlights a strong domain shift between clean label data and real-world job titles, which are shorter, noisier, more ambiguous, and often lack explicit domain or seniority cues.

The observed performance gap confirms that while simple bag-of-words baselines are effective on controlled datasets, they struggle to generalize to realistic CV data.

In [41]:
"""
results = []

# ACTIVE jobs – Seniority
add_result(
    results,
    model_name="bag-of-words/TF-IDF + Logistic Regression",
    target="Seniority (ACTIVE Jobs)",
    metrics={
        "accuracy": s_eval_accuracy,
        "macro_f1": s_eval_macro_f1
    }
)

# ACTIVE jobs – Department
add_result(
    results,
    model_name="bag-of-words/TF-IDF + Logistic Regression",
    target="Department (ACTIVE Jobs)",
    metrics={
        "accuracy": d_eval_accuracy,
        "macro_f1": d_eval_macro_f1
    }
)

save_results(results)

results_df_tfidf = pd.DataFrame(results)
results_df_tfidf"""

'\nresults = []\n\n# ACTIVE jobs – Seniority\nadd_result(\n    results,\n    model_name="bag-of-words/TF-IDF + Logistic Regression",\n    target="Seniority (ACTIVE Jobs)",\n    metrics={\n        "accuracy": s_eval_accuracy,\n        "macro_f1": s_eval_macro_f1\n    }\n)\n\n# ACTIVE jobs – Department\nadd_result(\n    results,\n    model_name="bag-of-words/TF-IDF + Logistic Regression",\n    target="Department (ACTIVE Jobs)",\n    metrics={\n        "accuracy": d_eval_accuracy,\n        "macro_f1": d_eval_macro_f1\n    }\n)\n\nsave_results(results)\n\nresults_df_tfidf = pd.DataFrame(results)\nresults_df_tfidf'