MNTP training
Data
We've used 30K pieces of text (1.5 GB) from the hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4
Training
We've used A100 with 40Gb of RAM, the run took a bit more the 5 hours
Configurations
We've used the following configurations
{
"model_name_or_path": "dicta-il/dictalm2.0-instruct",
"dataset_name": "HeNLP/HeDC4",
"dataset_number_of_rows": 30000,
"streaming": "True",
"per_device_train_batch_size": 32,
"per_device_eval_batch_size": 32,
"gradient_accumulation_steps": 1,
"do_train": true,
"do_eval": true,
"max_seq_length": 512,
"mask_token_type": "blank",
"data_collator_type": "all_mask",
"mlm_probability": 0.8,
"overwrite_output_dir": true,
"output_dir": "output/mntp/dictalm2.0-instruct",
"evaluation_strategy": "steps",
"eval_steps": 100,
"save_steps": 200,
"stop_after_n_steps": 1000,
"lora_r": 16,
"gradient_checkpointing": true,
"torch_dtype": "bfloat16",
"attn_implementation": "flash_attention_2"
}
Model
The finetuned model (i.e. the LORA weights) can be found here: https://drive.google.com/drive/folders/1Fhdon36tHimOM6DIKBqM48h-wE-wmtb8
Code changes
We've modified the data loading section in the run_mntp.py script to load the data from our chosen dataset and filter non-valid rows
SimCSE training
Data
We've used 200K pieces of text from the same source (hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4). But, as this pieces of text are relatively long as compared to the pieces of text that were used in the paper, we applied semantic-aware chunking (try to split by paragraph, sentences, new lines, with a maximum size of 250 characters), that we assume that was translated to around 2M pieces of texts.
Training
We've used A100 with 80Gb of RAM, the run took ~4 hours
Configurations
We've used the following configurations
{
"model_name_or_path": "dicta-il/dictalm2.0-instruct",
"peft_model_name_or_path": "./output/mntp/dictalm2.0-instruct",
"simcse_dropout": 0.3,
"bidirectional": true,
"pooling_mode": "mean",
"dataset_name": "HeNLP/HeDC4",
"dataset_start_index": 30000,
"dataset_limit": 200000,
"learning_rate": 3e-5,
"loss_scale": 20,
"per_device_train_batch_size": 128,
"gradient_accumulation_steps": 1,
"do_train": true,
"disable_tqdm": false,
"max_seq_length": 128,
"overwrite_output_dir": true,
"output_dir": "output/mntp-simcse/dictalm2.0-instruct",
"logging_steps": 50,
"save_steps": 200,
"save_only_model": true,
"stop_after_n_steps": 1000,
"lora_r": 16,
"gradient_checkpointing": true,
"torch_dtype": "bfloat16",
"attn_implementation": "flash_attention_2",
"seed": 42
}
Model
The finetuned model (i.e. the LORA weights) can be found here:
https://drive.google.com/drive/folders/1Ae9bg7cxzoa6Z5VUULVfW5SDPhbnDvFK
Code changes
We've modified the data loading section in the run_simcse.py to support our new dataset and apply the mentioned preprocessing techniques.
MNTP training
Data
We've used 30K pieces of text (1.5 GB) from the hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4
Training
We've used A100 with 40Gb of RAM, the run took a bit more the 5 hours
Configurations
We've used the following configurations
{ "model_name_or_path": "dicta-il/dictalm2.0-instruct", "dataset_name": "HeNLP/HeDC4", "dataset_number_of_rows": 30000, "streaming": "True", "per_device_train_batch_size": 32, "per_device_eval_batch_size": 32, "gradient_accumulation_steps": 1, "do_train": true, "do_eval": true, "max_seq_length": 512, "mask_token_type": "blank", "data_collator_type": "all_mask", "mlm_probability": 0.8, "overwrite_output_dir": true, "output_dir": "output/mntp/dictalm2.0-instruct", "evaluation_strategy": "steps", "eval_steps": 100, "save_steps": 200, "stop_after_n_steps": 1000, "lora_r": 16, "gradient_checkpointing": true, "torch_dtype": "bfloat16", "attn_implementation": "flash_attention_2" }Model
The finetuned model (i.e. the LORA weights) can be found here: https://drive.google.com/drive/folders/1Fhdon36tHimOM6DIKBqM48h-wE-wmtb8
Code changes
We've modified the data loading section in the
run_mntp.pyscript to load the data from our chosen dataset and filter non-valid rowsSimCSE training
Data
We've used 200K pieces of text from the same source (hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4). But, as this pieces of text are relatively long as compared to the pieces of text that were used in the paper, we applied semantic-aware chunking (try to split by paragraph, sentences, new lines, with a maximum size of 250 characters), that we assume that was translated to around 2M pieces of texts.
Training
We've used A100 with 80Gb of RAM, the run took ~4 hours
Configurations
We've used the following configurations
{ "model_name_or_path": "dicta-il/dictalm2.0-instruct", "peft_model_name_or_path": "./output/mntp/dictalm2.0-instruct", "simcse_dropout": 0.3, "bidirectional": true, "pooling_mode": "mean", "dataset_name": "HeNLP/HeDC4", "dataset_start_index": 30000, "dataset_limit": 200000, "learning_rate": 3e-5, "loss_scale": 20, "per_device_train_batch_size": 128, "gradient_accumulation_steps": 1, "do_train": true, "disable_tqdm": false, "max_seq_length": 128, "overwrite_output_dir": true, "output_dir": "output/mntp-simcse/dictalm2.0-instruct", "logging_steps": 50, "save_steps": 200, "save_only_model": true, "stop_after_n_steps": 1000, "lora_r": 16, "gradient_checkpointing": true, "torch_dtype": "bfloat16", "attn_implementation": "flash_attention_2", "seed": 42 }Model
The finetuned model (i.e. the LORA weights) can be found here:
https://drive.google.com/drive/folders/1Ae9bg7cxzoa6Z5VUULVfW5SDPhbnDvFK
Code changes
We've modified the data loading section in the
run_simcse.pyto support our new dataset and apply the mentioned preprocessing techniques.