##Let's first look at the data

In [None]:
import pandas as pd
import numpy as np
df = pd.read_excel('/content/drive/MyDrive//JobLevelData.xlsx')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Title     2240 non-null   object
 1   Column 1  2230 non-null   object
 2   Column 2  133 non-null    object
 3   Column 3  12 non-null     object
 4   Column 4  11 non-null     object
dtypes: object(5)
memory usage: 87.6+ KB


In [None]:
df[df['Column 1'].isna()]

Unnamed: 0,Title,Column 1,Column 2,Column 3,Column 4
29,CINO,,,,
341,Ticari Subesi Muduru,,,,
511,"Governor, Chair of Board Risk and Audit Commit...",,,,
764,"Former Director, Compensation and Benefits",,,,
829,Release of Information Tech II,,,,
1257,"Shareholder, Chair of Tax Section",,,,
1406,"Global People Systems, Processes and Informati...",,,,
1713,Supplier Quality Engineer,,,,
1785,RC Environmental and Cyber Specialized Subscri...,,,,
2182,Senior Independedirector and Chair of the Cust...,,,,


In [None]:
df.iloc[28]

Unnamed: 0,28
Title,CIO/ Business Development Director
Column 1,Director
Column 2,Chief Officer
Column 3,
Column 4,
Combined_Labels,"Director, Chief Officer, , ,"


In [None]:

df.loc[:,'Combined_Labels'] = (
        df['Column 1'].fillna('') + ', ' +
        df['Column 2'].fillna('') + ', ' +
        df['Column 3'].fillna('') + ', ' +
        df['Column 4'].fillna('') + ', '
)



In [None]:
df[['Title','Combined_Labels']].to_csv('/content/drive/MyDrive/Colab Notebooks/DF_2.csv', index=False)

In [None]:
import pandas as pd
import numpy as np
DF = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DF_2.csv')

In [None]:
DF

Unnamed: 0,Title,Combined_Labels
0,Vice President / Director of Systems Engineering,"Vice President, , , ,"
1,Systems Engineer; Systems Architect,"Manager, Individual Contributor/Staff, , ,"
2,"Executive Director, Global IT Infrastructure /...","Director, Chief Officer, , ,"
3,CTO/Executive Director of Technology Services,"Director, Chief Officer, , ,"
4,"Vice President, CIO","Vice President, , , ,"
...,...,...
2235,Net Software Architect and Team Project Lead,"Manager, , , ,"
2236,Solutions Architect & Technical Lead,"Manager, Individual Contributor/Staff, , ,"
2237,"Manager, Salesforcecom Administration and Rele...","Manager, , , ,"
2238,Innovation Automation Architect,"Manager, , , ,"


We are dealing with multilabel classification problem,  where each object can have several labels at the same time.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
labels_encoded = mlb.fit_transform(DF['Combined_Labels'].str.split(', '))

labels_encoded

array([[1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 1, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [1, 1, 0, ..., 0, 0, 0]])

In [None]:
labels_encoded.shape


(2240, 7)

The state-of-the-art method for text related problem s is using transformers, as embeddings given from transformers gives us good relative representation of the data. After trainings with different hyperparameters and different versions of bert (including roberta), we see that "bert-base-uncased" gives the best results.

In [None]:
from transformers import AutoTokenizer
# Loading the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Extracting job titles from a DataFrame (DF).
job_titles = DF['Title'].tolist()

# Tokenizing the job titles
encoded_inputs = tokenizer(job_titles, padding=True, truncation=True, return_tensors="pt")


# Accessing the encoded inputs.
input_ids = encoded_inputs["input_ids"]
attention_mask = encoded_inputs["attention_mask"]

# Let's check the encoding of the first job title.
print("Tokenized title:", tokenizer.decode(input_ids[0], skip_special_tokens=True))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Tokenized title: vice president / director of systems engineering


In [None]:
from sklearn.model_selection import train_test_split
from transformers import AutoModel
import torch
# Loading the pre-trained BERT model.
model = AutoModel.from_pretrained("bert-base-uncased")

#  Disabling gradient calculations to save memory and improve performance during inference.
with torch.no_grad():
   # Passing the input IDs and attention masks through the BERT model to obtain embeddings.
    # 'last_hidden_state' contains the output embeddings for all tokens in the input sequences.
    embeddings = model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
# Splitting the data into training validation and testing sets.
X_train, X_temp, y_train, y_temp = train_test_split(embeddings, labels_encoded, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AdamW
import numpy as np

# Define the model
model_name = "bert-base-uncased"
# Number of unique labels
num_labels = labels_encoded.shape[1]
# Load the pre-trained BERT model for sequence classification, setting the number of output labels.
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to("cuda")

# Defining optimizer and loss function
# AdamW is chosen as the optimizer, which is effective for training transformer models.
optimizer = AdamW(model.parameters(), lr=2e-5)
# Binary Cross-Entropy loss with logits is used, suitable for multi-label classification tasks.
loss_fn = torch.nn.BCEWithLogitsLoss()

#  Convert labels to PyTorch tensors and move to GPU for computation
# This ensures that the labels are in the right format for the loss function and model.
y_train = torch.tensor(y_train).float().to("cuda")
y_val = torch.tensor(y_val).float().to("cuda")
y_test = torch.tensor(y_test).float().to("cuda")

# Training loop
num_epochs = 10
batch_size = 16

for epoch in range(num_epochs):
    model.train() # Set the model to training mode
    for i in range(0, len(X_train), batch_size):
        # Select a batch of inputs and labels
        batch_inputs = X_train[i:i+batch_size].to("cuda")
        batch_labels = y_train[i:i+batch_size].to("cuda")

        # Zero the gradients to prepare for the new batch
        optimizer.zero_grad()
        # Forward pass: compute the model's outputs
        outputs = model(inputs_embeds=batch_inputs).logits
        # Compute the loss between the predicted outputs and true labels
        loss = loss_fn(outputs, batch_labels)
        # Backward pass: compute gradients
        loss.backward()
        # Update model parameters
        optimizer.step()

    # Validation loop
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
         # Compute the model's outputs on the validation set
        val_outputs = model(inputs_embeds=X_val.to("cuda")).logits
        # Compute the validation loss
        val_loss = loss_fn(val_outputs, y_val)
        print(f"Epoch {epoch+1} - Training Loss: {loss.item():.4f} - Validation Loss: {val_loss.item():.4f}")

    # Save the model and tokenizer after each epoch
    model.save_pretrained(f'/content/drive/MyDrive/bert_base_model_{epoch}')
    tokenizer.save_pretrained(f'/content/drive/MyDrive/bert_base_model_{epoch}')

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  y_train = torch.tensor(y_train).float().to("cuda")
  y_val = torch.tensor(y_val).float().to("cuda")
  y_test = torch.tensor(y_test).float().to("cuda")


Epoch 1 - Training Loss: 0.2417 - Validation Loss: 0.2181
Epoch 2 - Training Loss: 0.2031 - Validation Loss: 0.1794
Epoch 3 - Training Loss: 0.1351 - Validation Loss: 0.1558
Epoch 4 - Training Loss: 0.1233 - Validation Loss: 0.1390
Epoch 5 - Training Loss: 0.1152 - Validation Loss: 0.1555
Epoch 6 - Training Loss: 0.0946 - Validation Loss: 0.1235
Epoch 7 - Training Loss: 0.0450 - Validation Loss: 0.1221
Epoch 8 - Training Loss: 0.0656 - Validation Loss: 0.1232
Epoch 9 - Training Loss: 0.0655 - Validation Loss: 0.1182
Epoch 10 - Training Loss: 0.0520 - Validation Loss: 0.1339


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load the saved model and tokenizer
model_path = '/content/drive/MyDrive/my_bert_model_5'
model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path)


# Evaluation on test set
model.eval()
with torch.no_grad():
    test_outputs = model(inputs_embeds=X_test.to("cuda")).logits
    # Applying sigmoid to get probabilities
    predicted_probs = torch.sigmoid(test_outputs)
    predicted_labels = (predicted_probs > 0.4).int().cpu().numpy()

    # Calculatint evaluation metrics
    accuracy = accuracy_score(y_test.cpu().numpy(), predicted_labels)
    precision = precision_score(y_test.cpu().numpy(), predicted_labels, average='micro')
    recall = recall_score(y_test.cpu().numpy(), predicted_labels, average='micro')
    f1 = f1_score(y_test.cpu().numpy(), predicted_labels, average='micro')

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")

print("Evaluation completed.")


Accuracy: 0.8512
Precision: 0.9570
Recall: 0.9215
F1-score: 0.9390
Evaluation completed.


Accuracy: 0.8512
With an accuracy of about 85.12%, our model is correctly predicting the labels for roughly 85% of the instances in the dataset. While this is a helpful metric, it can sometimes be misleading, especially in multiclass and multi-label situations where class distributions might be uneven. That’s why it’s crucial to look at other metrics for a more comprehensive understanding of how the model is performing.

Precision: 0.9570
A precision score of around 95.70% means that when our model predicts a label, it’s right about 95.7% of the time. This is particularly important in cases where false positives could have serious repercussions—like mislabeling a job title. High precision indicates that the model is good at avoiding false positives, which is especially valuable in contexts where making incorrect positive predictions can be costly.

Recall: 0.9215
The recall of approximately 92.15% suggests that the model is successfully identifying about 92.15% of all actual instances of each label. This metric is critical in scenarios where failing to catch a relevant label (a false negative) is a big deal, such as in medical diagnoses or safety alerts. However, while this recall score is strong, it’s important to balance it with precision; a model that has high recall but low precision may be too lenient in its predictions.

F1-score: 0.9390
The F1-score of around 93.90% is a nice way to combine precision and recall into a single metric. It shows that the model is doing well on both fronts, making it suitable for the multiclass multi-label context we’re working with. A high F1-score means the model isn’t just accurate; it also effectively manages the trade-off between precision and recall, which is essential for applications where both aspects are important.
