In this notebook we try to explain in detail what we expect the project code / pipeline to look like. We put headings to indicate the different parts of the code (we took most headings from the code you already provided and added the parts that we additionally expect) to make clear how the project should look like.

When it says "code" that is a placeholder for your code (which is already there). When we see that something is missing or unclear we wrote text.

Please also read the "Useful information" part at the end of the notebook.

# Import libraries

Code

# Read Dataset

In [None]:
Code

# Data Preprocessing

## Train Test Split

Code

## Tokenization

Code

## Encode Datasets

Code

## Process datasets

Code

# Transformer models

## Fine-tuning

### Code for
**Create model without custom head, freeze the body**

Please use this: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification

It should look something like this:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                      num_labels=len(label_dict),
                                                      output_attentions=True,
                                                      output_hidden_states=True)

Here, the model head is automatically instanciated.

### Code for
**Create model with custom head, freeze the body**

This model you already created for example in the first project, you can re-use this model, but extend it to freeze the body or use the model that you already created with CustomModel2




In [2]:
# class CustomModel(nn.Module):
#     def __init__(self, checkpoint, num_labels): 
#         super(CustomModel,self).__init__() 
#         self.num_labels = num_labels 

#         #Load Model with given checkpoint and extract its body
#         # Load Model with given checkpoint and extract its body
#         config = AutoConfig.from_pretrained(checkpoint, output_hidden_states=True, output_attentions=True)
#         self.model = AutoModel.from_pretrained(checkpoint, config=config)
#         self.dropout = nn.Dropout(0.1) 
#         self.classifier = nn.Linear(768,num_labels) # set sequence length


#     def forward(self, input_ids=None, attention_mask=None,labels=None):
#         #Extract outputs from the body
#         outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)

#         #Add custom layers
#         sequence_output = self.dropout(outputs[0]) #outputs[0]=last hidden state

#         logits = self.classifier(sequence_output[:,0,:].view(-1,768)) # calculate losses

#         loss = None
#         if labels is not None:
#           # set class weights here  
#     #           device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#     #           class_weights = torch.tensor([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]).to(device) # set weights here
#           loss_fct = nn.CrossEntropyLoss()
#           loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

#         return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states,attentions=outputs.attentions)


**Important**

Please note that in CustomModel1 and CustomModel2 this "pretrain_model = BertForPreTraining.from_pretrained(checkpoint, config=config)" from the CustomModel class is not correct. It should be AutoModel or AutoModelForSequenceClassification when using the automatic head from hugging face.

https://stackoverflow.com/questions/66596142/bertmodel-or-bertforpretraining

## Model Trainer

In [None]:
Code

### Evaluation and plotting

In [None]:
Code

### Training

In [None]:
Code

### Configurations and Settings

Here the code to start training for the model with custom head and the model without custom head. Below is what you have provided so far, maybe that is sufficient, maybe you have to adjust, not sure. 
 
One question: please add comments behind the lines of the code, how the code knows that it must run the model without custom head first and then the model with custom head. Please highlight that in the code.

The data must be plit into train and validation.

In [None]:
trained_models_file = 'trained_models.txt' #WHAT DOES THIS DO?
best_params_dict_path = 'best_params.json' #WHAT DOES THIS DO?

df = pickle.load(open('dataset/230130_SmallOberkategorie.pickle', 'rb'))
label_names = df['labels'].unique()
# label_names

X = df[['text']]
y = df['labels']
X_train, X_val, y_train, y_val = split_df(X, y)

custom_head = False
freeze_layers = True
optimizers = [Adam]
learning_rates = [2e-6]
epochs = [2]
batch_sizes = [16]

configs =     {'custom_head': custom_head, 'freeze_layers': freeze_layers,
               'optimizers': optimizers, 'epochs': epochs, 'batch_sizes': batch_sizes,
               'learning_rates': learning_rates, 'val_steps':100}
dataset = {'X_train': X_train, 'y_train': y_train, 'X_val': X_val, 'y_val': y_val}

models = [
    'Bert-base-german-cased', 
'Dbmdz/bert-base-german-uncased',
'Deepset/gbert-base',
'Xlm-roberta-base',
'Uklfr/gottbert-base'
]

# training without custom head
training(models, configs, dataset)

# training with custom head
configs['custom_head'] = True
training(models, configs, dataset)

**Expected results:**

For each of the 5 models should be two models/weights saved:
 * 5 models without custom head
 * 5 models with custom head
 
 
 **Expected Output**
 
 The expected output regarding metrics and graphs is the same as it is now, no changes needed.

## Hyperparameter Tuning

Code to basically do the same that was done during Fine-tuning EXCEPT that now, we want to use the saved fine-tuned models. So instead of creating custom models again, we just want to load the saved models (model after model, not all at once) and train them with *different hyperparameter setting*. The best weights for each model should be saved (not overwrite savings from fine-tuning).

We also want to load data (a different dataset than during fine-tuning. Data must be split into train and test), we also want to see the same outputs (metrics, graphs etc. as after fine-tuning). Probably the code you already have can be re-used and just minor changes must be made.
    

## Testing

Code for testing the saved models (10 in total) from the hyperparameter-tuning step. The same hyperparameter configurations should be used that gave the best results during hyperparameter tuning.

The rest is as it was already communicated (mismatched data, attention weights etc.)

# Useful information

**What is Model Fine-Tuning?**
BERT (Bidirectional Encoder Representations from Transformers) is a big neural network architecture, with a huge number of parameters, that can range from 100 million to over 300 million. So, training a BERT model from scratch on a small dataset would result in overfitting.

So, it is better to use a pre-trained BERT model that was trained on a huge dataset, as a starting point. We can then further train the model on our relatively smaller dataset and this process is known as model fine-tuning.

**Different Fine-Tuning Techniques**
* 1) Train the entire architecture – We can further train the entire pre-trained model on our dataset and feed the output to a softmax layer. In this case, the error is back-propagated through the entire architecture and the pre-trained weights of the model are updated based on the new dataset.
* 2) Train some layers while freezing others – Another way to use a pre-trained model is to train it partially. What we can do is keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained.
* 3) Freeze the entire architecture – We can even freeze all the layers of the model and attach a few neural network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training.

https://www.analyticsvidhya.com/blog/2020/07/transfer-learning-for-nlp-fine-tuning-bert-for-text-classification/


**OUR GOAL**

We basically want to freeze the backbone (body) and only train the last layers (classifier). Then we want to unfreeze (hyperparametertuning) the model and train all of it while also searching for the best hyperparameter options.

We want to see, if for our task this approach performs better than direct training without freezing the body layers.

So it would be helpful, if in a separate notebook, the same code pipeline is delivered WITHOUT the fine-tuning step in the beginning and directly starting with the hyperparameter-tuning / training part of CustomModel1 and CustomModel2 (model without custom head, model with custom head).

