First experiment details

# MNTP training

## Data
We've used 30K pieces of text (1.5 GB) from the hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4

## Training
We've used A100 with 40Gb of RAM, the run took a bit more the 5 hours

## Configurations
We've used the following configurations
```json
{
    "model_name_or_path": "dicta-il/dictalm2.0-instruct",
    "dataset_name": "HeNLP/HeDC4",
    "dataset_number_of_rows": 30000,
    "streaming": "True",
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 32,
    "gradient_accumulation_steps": 1,
    "do_train": true,
    "do_eval": true,
    "max_seq_length": 512,
    "mask_token_type": "blank",
    "data_collator_type": "all_mask",
    "mlm_probability": 0.8,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp/dictalm2.0-instruct",
    "evaluation_strategy": "steps",
    "eval_steps": 100,
    "save_steps": 200,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2"
}
```

## Model
The finetuned model (i.e. the LORA weights) can be found here: https://drive.google.com/drive/folders/1Fhdon36tHimOM6DIKBqM48h-wE-wmtb8

## Code changes
We've modified the data loading section in the `run_mntp.py` script to load the data from our chosen dataset and filter non-valid rows

# SimCSE training

## Data
We've used 200K pieces of text from the same source (hebrew Wikipedia, as found in https://huggingface.co/datasets/HeNLP/HeDC4). But, as this pieces of text are relatively long as compared to the pieces of text that were used in the paper, we applied semantic-aware chunking (try to split by paragraph, sentences, new lines, with a maximum size of 250 characters), that we **assume** that was translated to around 2M pieces of texts.

## Training
We've used A100 with 80Gb of RAM, the run took ~4 hours

## Configurations
We've used the following configurations
```json
{
    "model_name_or_path": "dicta-il/dictalm2.0-instruct",
    "peft_model_name_or_path": "./output/mntp/dictalm2.0-instruct",
    "simcse_dropout": 0.3,
    "bidirectional": true,
    "pooling_mode": "mean",
    "dataset_name": "HeNLP/HeDC4",
    "dataset_start_index": 30000,
    "dataset_limit": 200000,
    "learning_rate": 3e-5,
    "loss_scale": 20,
    "per_device_train_batch_size": 128,
    "gradient_accumulation_steps": 1,
    "do_train": true,
    "disable_tqdm": false,
    "max_seq_length": 128,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp-simcse/dictalm2.0-instruct",
    "logging_steps": 50,
    "save_steps": 200,
    "save_only_model": true,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
    "seed": 42
}
```

## Model
The finetuned model (i.e. the LORA weights) can be found here: 
https://drive.google.com/drive/folders/1Ae9bg7cxzoa6Z5VUULVfW5SDPhbnDvFK

## Code changes
We've modified the data loading section in the `run_simcse.py` to support our new dataset and apply the mentioned preprocessing techniques.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First experiment details #7

MNTP training

Data

Training

Configurations

Model

Code changes

SimCSE training

Data

Training

Configurations

Model

Code changes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

First experiment details #7

Description

MNTP training

Data

Training

Configurations

Model

Code changes

SimCSE training

Data

Training

Configurations

Model

Code changes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions