### Extending bidirectional attention for LLMs via ULLME. 

In [1]:
from ullme.models import ULLME

model = ULLME(
            model_name_or_path="microsoft/phi-1_5",
            model_backbone_type="phi",
            )
model.cuda()
print("Model Architecture: ")
print(model)

  from .autonotebook import tqdm as notebook_tqdm
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


Tokenizer does not have a pad token. We will use the bos token as pad token.
Model Architecture: 
ULLME(
  (model): BidirectionalPhiForCausalLM(
    (model): BidirectionalPhi(
      (embed_tokens): Embedding(51200, 2048)
      (embed_dropout): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0-23): 24 x PhiDecoderLayer(
          (self_attn): PhiFlashAttention2(
            (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (dense): Linear(in_features=2048, out_features=2048, bias=True)
            (rotary_emb): PhiRotaryEmbedding()
          )
          (mlp): PhiMLP(
            (activation_fn): NewGELUActivation()
            (fc1): Linear(in_features=2048, out_features=8192, bias=True)
            (fc2): Linear(in_features=8192, out_features=2048, bias=True)
          )
          (input_

We also support LoRA patching for parameter-effecient fine-tuning 

In [None]:
from ullme.models import ULLME

lora_model = ULLME(
            model_name_or_path="microsoft/phi-1_5",
            model_backbone_type="phi",
            lora_name="ullme-phi",
            loar_r=16,
            lora_alpha=32,
            )
lora_model.cuda()
print("Model Architecture: ")
print(lora_model)

Compute sequence representaion with Bidirectional Extended LLMs

In [7]:
import time
input_sentence = "This a example sentence. " * 64 * 2
model_inputs = model.tokenizer(
                            [input_sentence] * 2,
                            return_tensors='pt'
                            )
t0 = time.time()
model_output = model(
                    input_ids=model_inputs['input_ids'].cuda(),
                    attention_mask=model_inputs['attention_mask'].cuda(),
                    is_generate=False
                    )
print("Time taken: ", time.time() - t0)
reps = model_output['reps']
print("Reps Shape: ", reps.shape)
print("Reps: ", reps)

Time taken:  0.070068359375
Reps Shape:  torch.Size([2, 2048])
Reps:  tensor([[-0.0540, -0.5312, -0.7500,  ..., -0.3770, -1.8828, -0.9961],
        [-0.0540, -0.5312, -0.7500,  ..., -0.3770, -1.8828, -0.9961]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>)


### Evaluation MTEB dataset via ULLME.

Here, we support almost LLM models available in HF. For example, we try to use top1 model in MTEB (dunzhang/stella_en_1.5B_v5)

In [None]:
from ullme.models import WrappedULLME
from ullme.eval import eval_mteb_dataset


model = WrappedULLME(model_name_or_path='dunzhang/stella_en_1.5B_v5')
print("Model Architecture: ")
print(model)

After loading the model, you need to select specific datasets and language subsets for evaluation. 

In [None]:
eval_result = eval_mteb_dataset(
                                model=model,
                                dataset_name='ArguAna',
                                langs=['eng'],
                                )
print("Eval Result: ", eval_result)

### Fine-tune LLMs with ULLME

We support various training strategies including Constrastive Loss, SFT, DPO and GRL. The following spinet inlustrate how to use ULLME for fine-tuning LLM for Dense Retrieval. 

``` python
from ullme.trainer import GradCacheTrainer
trainer = GradCacheTrainer(
                            con_loss_type='NTXentLoss',
                            gen_loss_type='sigmoid', # 'sft'
                            use_kl_loss=True
                            )
trainer.fit_epoch(
                model=model,
                train_loader=train_dataloader,
                )
```

Besides, ULLME also support GradCache, Cross-devices Constrastive loss, Multi-GPUs training, and orther rich features for further improve the training process. Please refer to the documentation and file ```ullme/train.py``` for further information. 