Sentence Transformer training support #27

sciencecw · 2024-01-23T19:34:05Z

I ran the following training script

python run.py --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_length 128
--model_name_or_path t5-small --dataset_name msmarco --embedder_model_name sentence-transformers/all-MiniLM-L6-v2 --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 1 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup --learning_rate 0.001 --output_dir ./saves/gtr-1

And I got the following traces:

File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3470, in _map_single
batch = apply_function_on_filtered_inputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3349, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/sciencecw/Repos/references/vec2text/vec2text/tokenize_data.py", line 130, in embed_dataset_batch
batch["frozen_embeddings"] = model.call_embedding_model(**emb_input_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: InversionModel.call_embedding_model() got an unexpected keyword argument 'token_type_ids'
[2024-01-23 14:19:47,988] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-01-23 14:19:47,988] torch._dynamo.utils: [INFO] Function Runtimes (s)
[2024-01-23 14:19:47,988] torch._dynamo.utils: [INFO] ---------- --------------

sciencecw · 2024-01-23T19:35:22Z

These are probably the offending parts
https://github.com/jxmorris12/vec2text/blob/master/vec2text/models/inversion.py#L173

https://github.com/jxmorris12/vec2text/blob/master/vec2text/models/inversion.py#L193-L198

jxmorris12 · 2024-01-23T22:18:26Z

Thanks; I'll look into it. In the meantime, I've trained a decent inversion model using almost exactly these settings, which is available here: https://huggingface.co/jxm/sentence-transformers_all-MiniLM-L6-v2__msmarco__128

jxmorris12 · 2024-01-23T23:10:33Z

I ran this command and it worked fine for me

jxmorris12 · 2024-01-23T23:12:11Z

I'm on an Apple M1

sciencecw · 2024-01-24T01:14:25Z

It ran after I uncommented this line
https://github.com/jxmorris12/vec2text/blob/master/vec2text/models/inversion.py#L173

sciencecw · 2024-01-24T01:16:47Z

Also M1 Mac...

sciencecw · 2024-01-24T01:20:17Z

Silly question: how do you use the trained inverter with your repo?
https://huggingface.co/jxm/sentence-transformers_all-MiniLM-L6-v2__msmarco__128

sciencecw · 2024-01-24T04:22:18Z

I commented out a few lines so that I can use load_corrector with other models

import vec2text
corrector = vec2text.load_corrector("jxm/sentence-transformers_all-MiniLM-L6-v2__msmarco__128")
vec2text.invert_strings(
    [
        "Jack Morris is a PhD student at Cornell Tech in New York City",
        "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
    ],
    corrector=corrector,
)

The model returns:

["Yes, I'm playing the Weightless Option with the help of the Tiny Cells. I'm playing the Weightless Option with the help of the Tiny Cells. I'm growing older.",
 'I have tried the following: 1. I have tried the following: 2. I have tried the following: 3. I have tried the following: 4. I have tried the following: 5. I have tried the following: 6. I have tried the following: 7. I have tried the following: 8. I have tried the following: 9. I have tried the following: 10. I have tried the following: 11. I have tried the following: 12. I have tried the following: 13. I have tried the following: 14. I have tried the following: 15. I have tried the following: 16. I have tried the following: 17. I have tried the following: 18. I']

and when I set num_steps, it gives AssertionError on this line
assert embedding.shape == (batch_size, self.embedder_dim)

jxmorris12 · 2024-01-24T20:42:01Z

I'll get back to you. You can't use it that way though. Basically I haven't trained the expensive corrector model, only have the zero-step inversion model for this specific model, so the API way won't work properly.

christophschuhmann · 2024-04-14T07:38:19Z

I'll get back to you. You can't use it that way though. Basically I haven't trained the expensive corrector model, only have the zero-step inversion model for this specific model, so the API way won't work properly.

Can you add documentation please?

How to use it:)

jxmorris12 added the bug Something isn't working label Jan 23, 2024

jxmorris12 added the documentation Improvements or additions to documentation label Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence Transformer training support #27

Sentence Transformer training support #27

sciencecw commented Jan 23, 2024

sciencecw commented Jan 23, 2024

jxmorris12 commented Jan 23, 2024

jxmorris12 commented Jan 23, 2024

jxmorris12 commented Jan 23, 2024

sciencecw commented Jan 24, 2024

sciencecw commented Jan 24, 2024

sciencecw commented Jan 24, 2024

sciencecw commented Jan 24, 2024 •

edited

Loading

jxmorris12 commented Jan 24, 2024 •

edited

Loading

christophschuhmann commented Apr 14, 2024

Sentence Transformer training support #27

Sentence Transformer training support #27

Comments

sciencecw commented Jan 23, 2024

sciencecw commented Jan 23, 2024

jxmorris12 commented Jan 23, 2024

jxmorris12 commented Jan 23, 2024

jxmorris12 commented Jan 23, 2024

sciencecw commented Jan 24, 2024

sciencecw commented Jan 24, 2024

sciencecw commented Jan 24, 2024

sciencecw commented Jan 24, 2024 • edited Loading

jxmorris12 commented Jan 24, 2024 • edited Loading

christophschuhmann commented Apr 14, 2024

sciencecw commented Jan 24, 2024 •

edited

Loading

jxmorris12 commented Jan 24, 2024 •

edited

Loading