## Description
In this Notebook a LLM from BERT is used to see, if and how it can be finetuned using own keywords. 

#### Result
No difference found between the model before and after the training, BERT seems not to be what we search for. For Word embeddings and word2vec in german may exist better libraries/models. But also our knowledge about finetuning LLMs is not that deep. Training an own LLM could be a better choice, but is not suitable for our work. May try different approach using BERT or other libraries.

The code was created with the assistance of ChatGPT-4.

In [4]:
# from transformers import AutoTokenizer, AutoModelForMaskedLM

# tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-german-cased")
# model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-german-cased")

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-german-cased")
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-german-cased", num_labels=2)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-german-cased")
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-german-cased", num_labels=2)

# Example text
text = "Verkehrsdaten in Berlin"

# Tokenize text
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Make prediction (unusually good due to the untrained classification layer)
with torch.no_grad():
    outputs = model(**inputs)

# Results (Logits)
logits = outputs.logits
print(f"Logits: {logits}")

# Prediction (Index of the highest probability)
prediction = torch.argmax(logits, dim=-1)
print(f"Prediction: {prediction.item()}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Logits: tensor([[-0.2487,  0.1334]])
Vorhersage: 1


In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-german-cased")
model = AutoModel.from_pretrained("google-bert/bert-base-german-cased")

In [None]:
# Example for mobility terms
mobilitaetsbegriffe = [
    "Verkehr", "Transport", "Fahrzeug", "Straße", "Fahrplan", "Auto", "Bahn", "Mobilität",
    "Fahrrad", "Bus", "E-Mobilität", "ÖPNV", "Flughafen", "Lkw", "Velo", "Fabian", "XYZ", "Essen", "Mami"
]


In [None]:
def get_embedding(text):
    # Tokenize and convert to tensors
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    
    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**inputs)
    
    # We take the [CLS] token as the representation of the text
    # outputs.last_hidden_state is a tensor of shape (batch_size, sequence_length, hidden_size)
    # We take the first token ([CLS] token) and extract the embedding
    embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    
    return embedding

# Example text
text = "Reiseverhalten"

# Calculate the embedding of the example text
text_embedding = get_embedding(text)

# Calculate embeddings of the mobility terms
mobilitaets_embeddings = [get_embedding(term) for term in mobilitaetsbegriffe]


In [None]:
# Calculate cosine similarity
similarities = cosine_similarity([text_embedding], mobilitaets_embeddings)

# Find most similar terms
similarity_scores = similarities[0]  # Since there's only one comparison text
similarity_dict = {mobilitaetsbegriffe[i]: similarity_scores[i] for i in range(len(mobilitaetsbegriffe))}

# Sort by highest similarity
sorted_similarity = sorted(similarity_dict.items(), key=lambda item: item[1], reverse=True)

# Output mobility terms by similarity
for term, score in sorted_similarity:
    print(f"{term}: {score:.4f}")

Mobilität: 0.8375
E-Mobilität: 0.8211
Verkehr: 0.7885
Auto: 0.7408
Fahrplan: 0.7294
Velo: 0.7192
Fahrrad: 0.7171
Transport: 0.7162
Lkw: 0.7160
Essen: 0.7158
Fahrzeug: 0.6914
ÖPNV: 0.6898
Bahn: 0.6739
XYZ: 0.6719
Bus: 0.6578
Flughafen: 0.6200
Straße: 0.6139
Fabian: 0.6051
Mami: 0.5899
