KeyError Length during training following workshop MLOps #12

MrRobotV8 · 2022-12-04T21:59:03Z

AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 64 --fp16 True --learning_rate 3e-5 --model_id distilbert-base-uncased --train_batch_size 32" Traceback (most recent call last): File "train.py", line 46, in train_dataset = load_from_disk(args.training_dir)

philschmid · 2022-12-05T08:59:30Z

Hello @MrRobotV8,

can you please provide more context regarding your error? That message is not helpful to reproduce. Have you prepared the dataset correctly and uploaded it to S3?

MrRobotV8 · 2022-12-05T09:18:56Z

Hi @philschmid ,
I am following the execution of the 3 MLOps workshop, watched the videos and read the blog post on AWS. I have a ml.t3.medium instance for my notebook with conda_pytorch38 kernel.
As you may have seen on the forum post, I also tried to update transformer, pytorch version and change dataset.

This is the Processing Step, that seems to run properly:

`processing_output_destination = f"s3://{bucket}/{s3_prefix}/data"


sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    instance_type="ml.c5.2xlarge",
    instance_count=1,
    base_job_name=base_job_prefix + "/preprocessing",
    sagemaker_session=sagemaker_session,
    role=role,
)

step_process = ProcessingStep(
    name="ProcessDataForTraining",
    cache_config=cache_config,
    processor=sklearn_processor,
    job_arguments=["--transformers_version",transformers_version,
                   "--pytorch_version",pytorch_version,
                   "--model_id",model_id_,
                   "--dataset_name",dataset_name_],
    outputs=[
        ProcessingOutput(
            output_name="train",
            destination=f"{processing_output_destination}/train",
            source="/opt/ml/processing/train",
        ),
        ProcessingOutput(
            output_name="test",
            destination=f"{processing_output_destination}/test",
            source="/opt/ml/processing/test",
        ),
        ProcessingOutput(
            output_name="validation",
            destination=f"{processing_output_destination}/test",
            source="/opt/ml/processing/validation",
        ),
    ],
    code="./scripts/preprocessing.py",
)`

Then the training Step of my pipeline failed with teh following code:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "KeyError: 'length'" Command "/opt/conda/bin/python3.8 train.py --epochs 3 --eval_batch_size 64 --fp16 True --learning_rate 3e-05 --model_id distilbert-base-uncased --train_batch_size 32", exit code: 1

`huggingface_estimator = HuggingFace(
    entry_point="train.py",
    source_dir="./scripts",
    base_job_name=base_job_prefix + "/training",
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    role=role,
    transformers_version=transformers_version,
    pytorch_version=pytorch_version,
    py_version=py_version,
    hyperparameters={
         'epochs':epochs,  
         'eval_batch_size': eval_batch_size,   
         'train_batch_size': train_batch_size,              
         'learning_rate': learning_rate,               
         'model_id': model_id,
         'fp16': fp16
    },
    sagemaker_session=sagemaker_session,
)

step_train = TrainingStep(
    name="TrainHuggingFaceModel",
    estimator=huggingface_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri
        ),
        "test": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri
        ),
    },
    cache_config=cache_config,
)`

The train.py file, such as all teh other evaluate.py, deploy_handler.py, etc... are copied and pasted from the repo.

At the end of the processing step, data are uploaded to s3 in teh correct path defined. I see three files for train and same (different size) for test: dataset_info.json, datasetarrow and states.

Is the framework version of the SKLearn too outdated? 0.23-1..

philschmid · 2022-12-05T09:24:24Z

As you may have seen on the forum post, I also tried to update transformer, pytorch version and change dataset.

To which versions have you updated?

MrRobotV8 · 2022-12-05T09:25:40Z

package versions

transformers_version = "4.17.0"
pytorch_version = "1.10.2"
py_version = "py38"

model_id_="distilbert-base-uncased"
dataset_name_="emotion"

Using cached sagemaker-2.119.0-py2.py3-none-any.whl

philschmid · 2022-12-05T09:30:38Z

And datasets? Are you using 1.18.X since thats the latest installed in the container: https://github.com/aws/deep-learning-containers/blob/master/huggingface/pytorch/buildspec.yml#LL42C42-L42C48

MrRobotV8 · 2022-12-05T09:37:35Z

I didn't explicitly defined it, in the preprocessing file in the repo, we are doing:

install("datasets[s3]")

Now I edited with:

install("datasets[s3]==1.18.4")

I think I can be able to share the output in 10 minutes.

Training Image is: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

MrRobotV8 · 2022-12-05T10:01:45Z

I don't have still an output because the training is proceeding... Hope that was just dataset version. By the way, I will let you know the output once finished.

In the meanwhile I can also share the definition of my pipeline, I don't want to have missed something.

{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [{'Name': 'ModelId',
   'Type': 'String',
   'DefaultValue': 'distilbert-base-uncased'},
  {'Name': 'DatasetName', 'Type': 'String', 'DefaultValue': 'emotion'},
  {'Name': 'ProcessingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.c5.2xlarge'},
  {'Name': 'ProcessingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
  {'Name': 'ProcessingScript',
   'Type': 'String',
   'DefaultValue': './scripts/preprocessing.py'},
  {'Name': 'TrainingEntryPoint', 'Type': 'String', 'DefaultValue': 'train.py'},
  {'Name': 'TrainingSourceDir', 'Type': 'String', 'DefaultValue': './scripts'},
  {'Name': 'TrainingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.p3.2xlarge'},
  {'Name': 'TrainingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
  {'Name': 'EvaluationScript',
   'Type': 'String',
   'DefaultValue': './scripts/evaluate.py'},
  {'Name': 'ThresholdAccuracy', 'Type': 'Float', 'DefaultValue': 0.8},
  {'Name': 'Epochs', 'Type': 'String', 'DefaultValue': '1'},
  {'Name': 'EvalBatchSize', 'Type': 'String', 'DefaultValue': '32'},
  {'Name': 'TrainBatchSize', 'Type': 'String', 'DefaultValue': '16'},
  {'Name': 'LearningRate', 'Type': 'String', 'DefaultValue': '3e-5'},
  {'Name': 'Fp16', 'Type': 'String', 'DefaultValue': 'True'}],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'ProcessDataForTraining',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.c5.2xlarge',
      'InstanceCount': 1,
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
     'ContainerArguments': ['--transformers_version',
      '4.17.0',
      '--pytorch_version',
      '1.10.2',
      '--model_id',
      'distilbert-base-uncased',
      '--dataset_name',
      'emotion'],
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/preprocessing.py']},
    'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
    'ProcessingInputs': [{'InputName': 'code',
      'AppManaged': False,
      'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/ProcessDataForTraining-86437d43df6eeb597c9c5a3520836925/input/code/preprocessing.py',
       'LocalPath': '/opt/ml/processing/input/code',
       'S3DataType': 'S3Prefix',
       'S3InputMode': 'File',
       'S3DataDistributionType': 'FullyReplicated',
       'S3CompressionType': 'None'}}],
    'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'train',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/train',
        'LocalPath': '/opt/ml/processing/train',
        'S3UploadMode': 'EndOfJob'}},
      {'OutputName': 'test',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/test',
        'LocalPath': '/opt/ml/processing/test',
        'S3UploadMode': 'EndOfJob'}},
      {'OutputName': 'validation',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/test',
        'LocalPath': '/opt/ml/processing/validation',
        'S3UploadMode': 'EndOfJob'}}]}},
   'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'}},
  {'Name': 'TrainHuggingFaceModel',
   'Type': 'Training',
   'Arguments': {'AlgorithmSpecification': {'TrainingInputMode': 'File',
     'TrainingImage': '763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
     'EnableSageMakerMetricsTimeSeries': True},
    'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/'},
    'StoppingCondition': {'MaxRuntimeInSeconds': 86400},
    'ResourceConfig': {'VolumeSizeInGB': 30,
     'InstanceCount': 1,
     'InstanceType': 'ml.p3.2xlarge'},
    'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
    'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
        'S3Uri': {'Get': "Steps.ProcessDataForTraining.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri"},
        'S3DataDistributionType': 'FullyReplicated'}},
      'ChannelName': 'train'},
     {'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
        'S3Uri': {'Get': "Steps.ProcessDataForTraining.ProcessingOutputConfig.Outputs['test'].S3Output.S3Uri"},
        'S3DataDistributionType': 'FullyReplicated'}},
      'ChannelName': 'test'}],
    'HyperParameters': {'epochs': {'Get': 'Parameters.Epochs'},
     'eval_batch_size': {'Get': 'Parameters.EvalBatchSize'},
     'train_batch_size': {'Get': 'Parameters.TrainBatchSize'},
     'learning_rate': {'Get': 'Parameters.LearningRate'},
     'model_id': {'Get': 'Parameters.ModelId'},
     'fp16': {'Get': 'Parameters.Fp16'},
     'sagemaker_submit_directory': '"s3://sagemaker-eu-west-1-183512891321/TrainHuggingFaceModel-0a8b7473ba341a719507d482a6891cd9/source/sourcedir.tar.gz"',
     'sagemaker_program': '"train.py"',
     'sagemaker_container_log_level': '20',
     'sagemaker_region': '"eu-west-1"'},
    'DebugHookConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/',
     'CollectionConfigurations': []},
    'ProfilerRuleConfigurations': [{'RuleConfigurationName': 'ProfilerReport-1670233247',
      'RuleEvaluatorImage': '929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest',
      'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}}],
    'ProfilerConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/'}},
   'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'}},
  {'Name': 'HuggingfaceEvalLoss',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.ProcessingInstanceType'},
      'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/evaluate.py']},
    'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
    'ProcessingInputs': [{'InputName': 'input-1',
      'AppManaged': False,
      'S3Input': {'S3Uri': {'Get': 'Steps.TrainHuggingFaceModel.ModelArtifacts.S3ModelArtifacts'},
       'LocalPath': '/opt/ml/processing/model',
       'S3DataType': 'S3Prefix',
       'S3InputMode': 'File',
       'S3DataDistributionType': 'FullyReplicated',
       'S3CompressionType': 'None'}},
     {'InputName': 'code',
      'AppManaged': False,
      'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/HuggingfaceEvalLoss-c59f512db1d458dadc4e83437b76244e/input/code/evaluate.py',
       'LocalPath': '/opt/ml/processing/input/code',
       'S3DataType': 'S3Prefix',
       'S3InputMode': 'File',
       'S3DataDistributionType': 'FullyReplicated',
       'S3CompressionType': 'None'}}],
    'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'evaluation',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/evaluation_report',
        'LocalPath': '/opt/ml/processing/evaluation',
        'S3UploadMode': 'EndOfJob'}}]}},
   'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'},
   'PropertyFiles': [{'PropertyFileName': 'HuggingFaceEvaluationReport',
     'OutputName': 'evaluation',
     'FilePath': 'evaluation.json'}]},
  {'Name': 'CheckHuggingfaceEvalAccuracy',
   'Type': 'Condition',
   'Arguments': {'Conditions': [{'Type': 'GreaterThanOrEqualTo',
      'LeftValue': {'Std:JsonGet': {'PropertyFile': {'Get': 'Steps.HuggingfaceEvalLoss.PropertyFiles.HuggingFaceEvaluationReport'},
        'Path': 'eval_accuracy'}},
      'RightValue': {'Get': 'Parameters.ThresholdAccuracy'}}],
    'IfSteps': [{'Name': 'HuggingFaceRegisterModel-RegisterModel',
      'Type': 'RegisterModel',
      'Arguments': {'ModelPackageGroupName': 'HuggingFaceModelPackageGroup',
       'InferenceSpecification': {'Containers': [{'Image': '763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
          'Environment': {'SAGEMAKER_PROGRAM': '',
           'SAGEMAKER_SUBMIT_DIRECTORY': '',
           'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
           'SAGEMAKER_REGION': 'eu-west-1'},
          'ModelDataUrl': {'Get': 'Steps.TrainHuggingFaceModel.ModelArtifacts.S3ModelArtifacts'}}],
        'SupportedContentTypes': ['application/json'],
        'SupportedResponseMIMETypes': ['application/json'],
        'SupportedRealtimeInferenceInstanceTypes': ['ml.g4dn.xlarge',
         'ml.m5.xlarge'],
        'SupportedTransformInstanceTypes': ['ml.g4dn.xlarge', 'ml.m5.xlarge']},
       'ModelApprovalStatus': 'Approved'}},
     {'Name': 'HuggingFaceModelDeployment',
      'Type': 'Lambda',
      'Arguments': {'model_name': 'distilbert-base-uncased-emotion12-05-09-39-43',
       'endpoint_config_name': 'distilbert-base-uncased-emotion12-05-09-39-43',
       'endpoint_name': 'distilbert-base-uncased-emotion',
       'endpoint_instance_type': 'ml.g4dn.xlarge',
       'model_package_arn': {'Get': 'Steps.HuggingFaceRegisterModel-RegisterModel.ModelPackageArn'},
       'role': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684'},
      'FunctionArn': 'arn:aws:lambda:eu-west-1:183512891321:function:sagemaker-pipelines-model-deployment-12-05-09-39-43',
      'OutputParameters': [{'OutputName': 'statusCode',
        'OutputType': 'String'},
       {'OutputName': 'body', 'OutputType': 'String'},
       {'OutputName': 'other_key', 'OutputType': 'String'}]}],
    'ElseSteps': []}}]}

MrRobotV8 · 2022-12-05T10:45:30Z

It works! Thank you @philschmid !

cgpeltier · 2022-12-19T14:51:15Z

Not sure if this is related to this issue too, but we're getting similar problems on some of our datasets in our SageMaker Pipelines, using various versions of datasets (1.18.4, 2.5.2, 2.7.1, etc.).

The weird thing for us is that it only seems to be happening on some of our HF datasets, but not others. I haven't done a deep dive into the differences in these files yet, but that's my next step. Thought I'd post here in case, though!

MrRobotV8 closed this as completed Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError Length during training following workshop MLOps #12

KeyError Length during training following workshop MLOps #12

MrRobotV8 commented Dec 4, 2022

philschmid commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022 •

edited

philschmid commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022

philschmid commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022 •

edited

MrRobotV8 commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022

cgpeltier commented Dec 19, 2022

KeyError Length during training following workshop MLOps #12

KeyError Length during training following workshop MLOps #12

Comments

MrRobotV8 commented Dec 4, 2022

philschmid commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022 • edited

philschmid commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022

package versions

philschmid commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022 • edited

MrRobotV8 commented Dec 5, 2022

MrRobotV8 commented Dec 5, 2022

cgpeltier commented Dec 19, 2022

MrRobotV8 commented Dec 5, 2022 •

edited

MrRobotV8 commented Dec 5, 2022 •

edited