New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError Length during training following workshop MLOps #12
Comments
Hello @MrRobotV8, can you please provide more context regarding your error? That message is not helpful to reproduce. Have you prepared the dataset correctly and uploaded it to S3? |
Hi @philschmid , This is the Processing Step, that seems to run properly:
Then the training Step of my pipeline failed with teh following code:
The train.py file, such as all teh other evaluate.py, deploy_handler.py, etc... are copied and pasted from the repo. At the end of the processing step, data are uploaded to s3 in teh correct path defined. I see three files for train and same (different size) for test: dataset_info.json, datasetarrow and states. Is the framework version of the SKLearn too outdated? 0.23-1.. |
To which versions have you updated? |
package versionstransformers_version = "4.17.0" model_id_="distilbert-base-uncased" Using cached sagemaker-2.119.0-py2.py3-none-any.whl |
And datasets? Are you using |
I didn't explicitly defined it, in the preprocessing file in the repo, we are doing:
Now I edited with:
I think I can be able to share the output in 10 minutes. Training Image is: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04 |
I don't have still an output because the training is proceeding... Hope that was just dataset version. By the way, I will let you know the output once finished. In the meanwhile I can also share the definition of my pipeline, I don't want to have missed something.
|
It works! Thank you @philschmid ! |
Not sure if this is related to this issue too, but we're getting similar problems on some of our datasets in our SageMaker Pipelines, using various versions of The weird thing for us is that it only seems to be happening on some of our HF datasets, but not others. I haven't done a deep dive into the differences in these files yet, but that's my next step. Thought I'd post here in case, though! |
AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 64 --fp16 True --learning_rate 3e-5 --model_id distilbert-base-uncased --train_batch_size 32" Traceback (most recent call last): File "train.py", line 46, in train_dataset = load_from_disk(args.training_dir)
The text was updated successfully, but these errors were encountered: