Skip to content

First experiment details #7

@omriel1

Description

@omriel1

MNTP training

Data

We've used 30K pieces of text (1.5 GB) from the hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4

Training

We've used A100 with 40Gb of RAM, the run took a bit more the 5 hours

Configurations

We've used the following configurations

{
    "model_name_or_path": "dicta-il/dictalm2.0-instruct",
    "dataset_name": "HeNLP/HeDC4",
    "dataset_number_of_rows": 30000,
    "streaming": "True",
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 32,
    "gradient_accumulation_steps": 1,
    "do_train": true,
    "do_eval": true,
    "max_seq_length": 512,
    "mask_token_type": "blank",
    "data_collator_type": "all_mask",
    "mlm_probability": 0.8,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp/dictalm2.0-instruct",
    "evaluation_strategy": "steps",
    "eval_steps": 100,
    "save_steps": 200,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2"
}

Model

The finetuned model (i.e. the LORA weights) can be found here: https://drive.google.com/drive/folders/1Fhdon36tHimOM6DIKBqM48h-wE-wmtb8

Code changes

We've modified the data loading section in the run_mntp.py script to load the data from our chosen dataset and filter non-valid rows

SimCSE training

Data

We've used 200K pieces of text from the same source (hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4). But, as this pieces of text are relatively long as compared to the pieces of text that were used in the paper, we applied semantic-aware chunking (try to split by paragraph, sentences, new lines, with a maximum size of 250 characters), that we assume that was translated to around 2M pieces of texts.

Training

We've used A100 with 80Gb of RAM, the run took ~4 hours

Configurations

We've used the following configurations

{
    "model_name_or_path": "dicta-il/dictalm2.0-instruct",
    "peft_model_name_or_path": "./output/mntp/dictalm2.0-instruct",
    "simcse_dropout": 0.3,
    "bidirectional": true,
    "pooling_mode": "mean",
    "dataset_name": "HeNLP/HeDC4",
    "dataset_start_index": 30000,
    "dataset_limit": 200000,
    "learning_rate": 3e-5,
    "loss_scale": 20,
    "per_device_train_batch_size": 128,
    "gradient_accumulation_steps": 1,
    "do_train": true,
    "disable_tqdm": false,
    "max_seq_length": 128,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp-simcse/dictalm2.0-instruct",
    "logging_steps": 50,
    "save_steps": 200,
    "save_only_model": true,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
    "seed": 42
}

Model

The finetuned model (i.e. the LORA weights) can be found here:
https://drive.google.com/drive/folders/1Ae9bg7cxzoa6Z5VUULVfW5SDPhbnDvFK

Code changes

We've modified the data loading section in the run_simcse.py to support our new dataset and apply the mentioned preprocessing techniques.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions