## Knowledge Distillation

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. Auto-regressive sequence models, such as language models (LMs), have shown impressive capabilities in numerous tasks, where the key to this success is often scaling the amount of training data as well as the number of model parameters. However, scaling parameter count comes at a cost, and the deployment of such models is limited by either their inference cost or memory footprint. Thus, a crucial goal for practical use of large capable models is to compress them by reducing their parameter count, while retaining as much as possible of their performance.

One of the prevalent techniques for compressing models is knowledge distillation (Hinton et al., 2015). Distillation is the process of training a model– the student– to replicate the knowledge of another model– the teacher– on a specific set of tasks. Typically, the student has fewer parameters than the teacher and as such, distillation can improve task-specific performance while maintaining lower inference cost and memory footprint than the teacher.

The large model could be an ensemble of separately trained models or a single large model trained with a very strong regularizer such as dropout. Once the large model has been trained, we can then use a different kind of training, which we call "distillation" to transfer the knowledge from the large model to a small model that is more suitable for deployment. Strong dropout helps ensure that the model learns robust and generalizable features rather than overfitting the training data.

However, current KD methods[1] for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by student during inference. 

> Why this train-inference mismatch occurs:

- Auto-regressive nature of language models: These models generate text one token at a time, using previously generated tokens as context for the next prediction.

- Training process: During training, the student model is typically exposed to complete, correct sequences from the training data or teacher-generated sequences. The model learns to predict the next token given the correct previous tokens.

- Inference (generation) process: At inference time, the model generates text from scratch or continues from a given prompt. As it generates, it uses its own previous outputs as context for subsequent tokens.

- The mismatch: During training, the model always sees "correct" or "expert-generated" contexts. During inference, the model sees its own generated context, which may contain errors or be less optimal than the training contexts. As generation progresses, these small differences can compound, leading to increasingly divergent contexts.

- Consequences: The partial sequences encountered during inference can be quite different from those seen in training. The model may not have been trained to handle or recover from its own errors or less-than-optimal generations.


To address this issue, the authors [On-Policy Distillation paper] introduce Generalized Knowledge Distillation (GKD). 

> Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher. 

> GKD offers flexibility to employ alternative loss functions between the student and teacher, which may be useful when the student lacks the expressivity to mimic the teacher's distribution. 


![Alt text](gkd_algo.png)

Teacher model: `Qwen2-7B-Instruct` 

Student model: `Qwen2-0.5B-Instruct`


**High level tasks:**

1. SFT student model on Teacher completions dataset. 

2. Use the SFT model to generate the output sequences on the fly with temperature of 1 to encourage diversity in generated sequences.

3. Obtain token level feedback from teacher's logits and leverage the GKDTrainer from Kashif's branch in TRL where we choose the divergence to optimize between teacher and student distributions

[1] Current KD methods for auto-regressive sequence models require, generating a fixed set of output sequences from the teacher model (Supervised KD) or a fixed dataset of sequences that the teacher can label by assigning token-level probabilities. 

References: 

- https://arxiv.org/abs/1503.02531 
- https://arxiv.org/pdf/2306.13649
- https://arxiv.org/abs/2106.05237
- https://pytorch.org/tutorials/beginner/knowledge_distillation_tutorial.html 


Plan is to start with something a bit smaller (to validate it works before scaling up):

- Distill Qwen2-7B-Instruct to Qwen2-0.5B
- Use LMSYS prompts as the source of generating student / teacher completions
- `GKDTrainer` branch from Kashif R.: https://github.com/huggingface/trl/pull/1814
- Dataset: https://huggingface.co/datasets/andito/chatbot_arena_completions 
- PR: https://github.com/huggingface/llm-swarm/pull/31/commits/f50230ca5a0cc880e6aab88127bb2dedae0368c7 
- PPO Trainer: https://huggingface.co/docs/trl/ppo_trainer 
- GKD Trainer example script: https://github.com/kashif/trl/blob/gkd-trainer/examples/scripts/gkd.py

In [1]:
files = !ls

if 'requirements.txt' in files:
    !pip install -r requirements.txt

Collecting pandas
  Downloading pandas-2.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 5.1 MB/s eta 0:00:01
[?25hCollecting pyarrow
  Downloading pyarrow-16.1.0-cp39-cp39-manylinux_2_28_x86_64.whl (40.8 MB)
[K     |████████████████████████████████| 40.8 MB 121.0 MB/s eta 0:00:01
[?25hCollecting huggingface_hub
  Downloading huggingface_hub-0.23.5-py3-none-any.whl (402 kB)
[K     |████████████████████████████████| 402 kB 108.6 MB/s eta 0:00:01
[?25hCollecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[K     |████████████████████████████████| 547 kB 123.7 MB/s eta 0:00:01
[?25hCollecting transformers
  Downloading transformers-4.42.4-py3-none-any.whl (9.3 MB)
[K     |████████████████████████████████| 9.3 MB 102.4 MB/s eta 0:00:01
[?25hCollecting torch
  Downloading torch-2.3.1-cp39-cp39-manylinux1_x86_64.whl (779.1 MB)
[K     |██████████▍                     | 253.3 MB 185.8 MB/s 

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 779.1 MB 2.5 kB/s  eta 0:00:01
[?25hCollecting torchvision
  Downloading torchvision-0.18.1-cp39-cp39-manylinux1_x86_64.whl (7.0 MB)
[K     |████████████████████████████████| 7.0 MB 60.2 MB/s eta 0:00:01
[?25hCollecting trl
  Downloading trl-0.9.6-py3-none-any.whl (245 kB)
[K     |████████████████████████████████| 245 kB 69.1 MB/s eta 0:00:01
[?25hCollecting accelerate
  Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
[K     |████████████████████████████████| 314 kB 90.8 MB/s eta 0:00:01
[?25hCollecting tensorboard
  Downloading tensorboard-2.17.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 65.8 MB/s eta 0:00:01
[?25hCollecting tzdata>=2022.7
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[K     |████████████████████████████████| 345 kB 57.1 MB/s eta 0:00:01
[?25hCollecting numpy>=1.22.4
  Downloading numpy-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB)
[K 

In [4]:
# !pip install pandas
# !pip install pyarrow
# !pip install huggingface_hub
# !pip install datasets
# !pip install transformers
# !pip install torch
# !pip install torchvision
# !pip install trl
# !pip install accelerate -U
#!pip install tensorboard

In [9]:
import os
os.getcwd()

'/data/kd_exps/trl/on_policy'

In [3]:
import pandas as pd
import random
from datetime import datetime
from datasets import load_dataset
from datasets import Dataset
from datasets import DatasetDict
import accelerate
import gc

from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
from huggingface_hub import HfFolder, Repository, create_repo, login

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [4]:
# Check if CUDA is available
cuda_available = torch.cuda.is_available()
print(f"CUDA Available: {cuda_available}")

# Get the number of available GPUs
num_gpus = torch.cuda.device_count()
print(f"Number of available GPUs: {num_gpus}")

CUDA Available: True
Number of available GPUs: 1


In [6]:
def print_gpu_memory():
    """
    Function to print current GPU memory usage
    """
    if torch.cuda.is_available():
        print(f"Current GPU memory available: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        print(f"Current GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
        print(f"Current GPU memory cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
    else:
        print("CUDA is not available")

In [7]:
print_gpu_memory()

Current GPU memory available: 42.30 GB
Current GPU memory allocated: 0.00 GB
Current GPU memory cached: 0.00 GB


### 1. Read Qwen2-7B-Instruct completions dataset

In [8]:
df = load_dataset("andito/chatbot_arena_completions")

In [9]:
df

DatasetDict({
    train: Dataset({
        features: ['question_id', 'messages'],
        num_rows: 32980
    })
})

In [10]:
df['train'][:1]

{'question_id': ['58210e39b3fd4441a2bd4a518bb44c2d'],
 'messages': [[{'content': 'What is the difference between OpenCL and CUDA?',
    'role': 'user'},
   {'content': "CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language) are two popular GPU computing platforms developed by different companies.\n\nCUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It was specifically developed for NVIDIA GPU hardware and is based on NVIDIA's architecture, making it highly optimized for specific NVIDIA GPUs. CUDA allows developers to directly access the processing power of the GPU and develop parallel-accelerated applications that can run at essentially the same speed as NVIDIA processors. It has a proprietary nature with exclusive use of NVIDIA GPUs and adherence to NVIDIA's hardware and software standards.\n\nOpenCL, on the other hand, is a platform-independent (not tied to a specific hardware vendor) open standard for 

In [11]:
# Split test / eval set
test_size = 1000 / len(df['train'])
test_size

0.030321406913280776

In [12]:
# Split the dataset
split_dataset = df['train'].train_test_split(test_size=test_size, seed=42)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['question_id', 'messages'],
        num_rows: 31980
    })
    test: Dataset({
        features: ['question_id', 'messages'],
        num_rows: 1000
    })
})

In [13]:
# Create a new DatasetDict with both splits
df_v1 = DatasetDict({
                     'train': split_dataset['train'],
                     'test': split_dataset['test']
                   })

df_v1

DatasetDict({
    train: Dataset({
        features: ['question_id', 'messages'],
        num_rows: 31980
    })
    test: Dataset({
        features: ['question_id', 'messages'],
        num_rows: 1000
    })
})

#### 1.1 SFT student model on Teacher completions dataset

In [14]:
# Load model
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
# save to disk
#tokenizer.save_pretrained(os.getcwd() + "/model")
#model.save_pretrained(os.getcwd() + "/model")

In [16]:
# Load model from disk
tokenizer = AutoTokenizer.from_pretrained(os.getcwd() + "/model")
model = AutoModelForCausalLM.from_pretrained(os.getcwd() + "/model")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
# special tokens
tokenizer.special_tokens_map

{'eos_token': '<|im_end|>',
 'pad_token': '<|endoftext|>',
 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}

In [18]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): Linear(in_featur

Let's add special tokens and format the conversation dataset

In [19]:
# To Pandas
pd_dict = {}

for split, data in df_v1.items():
    
    pd_dict[split] = df_v1[split].to_pandas()

In [20]:
f"Train DF shape: {pd_dict['train'].shape}"

'Train DF shape: (31980, 2)'

In [21]:
f"Test DF shape: {pd_dict['test'].shape}"

'Test DF shape: (1000, 2)'

In [22]:
# Function to format the conversation dataset
def format_conversation(sample):
    
    #print('sample', sample)
    
    text = ""
    for message in sample:  # Assuming single conversation per example
        
        if message['role'] == 'user':
            text = f"<|im_start|>### User: {message['content']}<|im_end|>"
        
        elif message['role'] == 'assistant':
            text += f"<|im_start|>### Assistant: {message['content']}"
            return text + tokenizer.eos_token

In [23]:
# Apply format function to the dataset
pd_dict['train']['texts'] = pd_dict['train'].apply(lambda x: format_conversation(x['messages']), axis=1)
pd_dict['test']['texts'] = pd_dict['test'].apply(lambda x: format_conversation(x['messages']), axis=1)

In [24]:
# Max Sequence Length
max_length_train = max(len(text) for text in pd_dict['train']['texts'])
max_length_test = max(len(text) for text in pd_dict['test']['texts'])

max_length = max(max_length_train, max_length_test) 

f"Max Length: {max_length}"

'Max Length: 11796'

In [25]:
# To HF Dataset
hf_dict = {}

for split, df in pd_dict.items():
            
    hf_df = Dataset.from_pandas(df)
        
    hf_dict[split] = hf_df
    
hf_dataset = DatasetDict(hf_dict)

#### 1.2 SFTTrainer

**Loss function: standard cross-entropy loss**[1]

This is the primary loss function used for language model training. It measures the difference between the predicted probability distribution of tokens and the actual distribution (Completions from Teacher model). 

> The loss is calculated token-by-token across the entire sequence.

> It's then averaged over all non-ignored tokens in the batch.

> The model predicts the next token given all the previous tokens, thus it is called an autoregressive model.


Loss for a single token prediction: `L = -Σ(y_i * log(p_i))`

Where:

    - `y_i` is the true probability of class i (usually 1 for the correct class, 0 for others)
    - `p_i` is the predicted probability of class i


[1] https://discuss.huggingface.co/t/fine-tune-with-sfttrainer/67311 


In [26]:
print_gpu_memory()

Current GPU memory available: 42.30 GB
Current GPU memory allocated: 0.00 GB
Current GPU memory cached: 0.00 GB


In [27]:
!nvidia-smi

Fri Jul 12 01:57:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:20:1D.0 Off |                    0 |
| N/A   38C    P0              57W / 400W |      7MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [29]:
# Trainer

instruction_template = "### User:"
response_template = "### Assistant:"
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

sft_config = SFTConfig(
                        max_seq_length=max_length,
                        per_device_train_batch_size=5,
                        per_device_eval_batch_size=5,
                        gradient_accumulation_steps=2,
                        learning_rate=0.0001,
                        weight_decay=0.01, # L2 regularization
                        lr_scheduler_type="cosine",
                        warmup_ratio=0.1, # warm up for first 10% of steps
                        num_train_epochs=3,
                        logging_dir="./logs",
                        report_to=["tensorboard"],
                        logging_steps=500, 
                        eval_strategy="steps",
                        eval_steps=500,  # Evaluate steps
                        save_strategy="steps",
                        save_steps=5000, # save checkpoint
                        output_dir = "./results",
                        dataset_text_field="texts",
                        #fp16=True,  # Enable mixed precision training
                        max_grad_norm=1.0,
                        gradient_checkpointing=True,  # Enable gradient checkpointing
                        optim="adamw_torch_fused", 
                        load_best_model_at_end=True,
                        metric_for_best_model="eval_loss"
                      )

trainer = SFTTrainer(
                        model,
                        train_dataset=hf_dataset['train'],
                        eval_dataset=hf_dataset['test'],
                        args=sft_config,
                        data_collator=collator,
                        tokenizer=tokenizer
                    )

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Map:   0%|          | 0/31980 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [30]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
500,2.1378,2.364868
1000,2.5029,2.74663
1500,2.7192,2.816531
2000,2.7407,2.780465
2500,2.6879,2.758202
3000,2.6691,2.735217
3500,2.3101,2.73681
4000,2.0729,2.696918
4500,2.062,2.663558
5000,2.0431,2.602335


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [31]:
# Evaluate the model on the test set after training
test_results = trainer.evaluate()
print("Final test results:", test_results)

Final test results: {'eval_loss': 2.602334976196289, 'eval_runtime': 49.8678, 'eval_samples_per_second': 20.053, 'eval_steps_per_second': 4.011, 'epoch': 3.0}


In [32]:
# Save the best model
trainer.save_model("./fine_tuned_model")

In [35]:
# Push to HF Hub / Repo

# Load the SFT model and tokenizer from disk
model = AutoModelForCausalLM.from_pretrained("./fine_tuned_model")
tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_model")

# Set the organization and repository name
org_name = "Distillation-Hugs"  # Replace with your organization name
repo_name = "kd_exps"  # Replace with your desired model name

# Create a new repository
#repo_url = create_repo(repo_id="Distillation-Hugs/kd_exps", repo_type="model", private=False)

# Clone the empty repository
#repo = Repository(local_dir="./hf_model_repo", clone_from=repo_url)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [36]:
#model.push_to_hub(repo_id=f"{org_name}/{repo_name}")

README.md:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Distillation-Hugs/kd_exps/commit/932dd2ebadace38124e0b520a45c0eecf47f1adb', commit_message='Upload Qwen2ForCausalLM', commit_description='', oid='932dd2ebadace38124e0b520a45c0eecf47f1adb', pr_url=None, pr_revision=None, pr_num=None)

In [37]:
#tokenizer.push_to_hub(repo_id=f"{org_name}/{repo_name}")

CommitInfo(commit_url='https://huggingface.co/Distillation-Hugs/kd_exps/commit/e179a88345ef2ff04dfe6303afb131ebddbc2d51', commit_message='Upload tokenizer', commit_description='', oid='e179a88345ef2ff04dfe6303afb131ebddbc2d51', pr_url=None, pr_revision=None, pr_num=None)

### GKD 

- GKD trains the student on its self-generated sequences that are on-policy, instead of a fixed set of outputs sequences, using teacher probabilities as expert labels on these sequences.

**Key distinctions between On-Policy KD vs GDK**

> Data source:

- On-policy KD uses only student-generated data.
- GKD uses a mixture of fixed dataset and student-generated data.

> Flexibility:

- On-policy KD is more focused on addressing the train-inference mismatch. 
    - During training, the model always sees "correct" or "expert-generated" contexts. During inference, the model sees its own generated context, which may contain errors or be less optimal than the training contexts. As generation progresses, these small differences can compound, leading to increasingly divergent contexts.        
- GKD is a more general framework that can encompass both supervised and on-policy approaches.

> Hyperparameters:

- GKD introduces λ to control the balance between fixed and student-generated data.
- GKD allows for experimentation with different divergence measures.

> Generalizability:

- On-policy KD can be seen as a specific instance of GKD (when λ = 1 and using forward KL divergence).
- GKD can represent a spectrum of approaches, from fully supervised (λ = 0) to fully on-policy (λ = 1).

**Problem setup:**

- We are given two auto-regressive sequence models of different capacity, where `pS` and `pT` refers to the student and teacher respectively. 

- We assume that the student has learnable parameters `θ` and `pSθ` (student model) is differentiable w.r.t `θ`. This is important because the slope (gradient) changes at different points along the curve. 

- We have generated a dataset by sampling sequences from the teacher. 

- Divergence `D` is defined as the discrepancy between token-level distributions of `pT` and `pS`



In [28]:
import sys

# Get the directory of the current script
#current_dir = os.path.dirname(os.path.abspath(__file__))

# Add the directory containing your .py file to sys.path
#sys.path.append(os.path.join(os.getcwd(), '/trl/trainer/'))

for path in sys.path:
    print(path)

/home/user/miniconda/lib/python39.zip
/home/user/miniconda/lib/python3.9
/home/user/miniconda/lib/python3.9/lib-dynload

/home/user/miniconda/lib/python3.9/site-packages
/trl_local/trl/trainer/
/trl/trainer/
/data/kd_exps
/data/kd_exps


In [29]:
os.getcwd()

'/data/kd_exps/trl/on_policy'

In [30]:
sys.path.append(os.path.dirname(os.path.dirname(os.getcwd())))

for path in sys.path:
    print(path)

/home/user/miniconda/lib/python39.zip
/home/user/miniconda/lib/python3.9
/home/user/miniconda/lib/python3.9/lib-dynload

/home/user/miniconda/lib/python3.9/site-packages
/trl_local/trl/trainer/
/trl/trainer/
/data/kd_exps
/data/kd_exps
/data/kd_exps


In [33]:
from trl.trainer.gkd_config import GKDConfig
from trl.trainer.gkd_trainer import GKDTrainer
from trl.trainer.model_config import ModelConfig
from trl.trainer.callbacks import RichProgressCallback

ModuleNotFoundError: No module named 'trl.trainer.gkd_config'

In [27]:
from trl.trl.trainer import (
    GKDConfig,
    GKDTrainer,
    ModelConfig,
    RichProgressCallback
)

ImportError: cannot import name 'GKDConfig' from 'trl.trainer' (/home/user/miniconda/lib/python3.9/site-packages/trl/trainer/__init__.py)