Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError Length during training following workshop MLOps #12

Closed
MrRobotV8 opened this issue Dec 4, 2022 · 9 comments
Closed

KeyError Length during training following workshop MLOps #12

MrRobotV8 opened this issue Dec 4, 2022 · 9 comments

Comments

@MrRobotV8
Copy link

AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 64 --fp16 True --learning_rate 3e-5 --model_id distilbert-base-uncased --train_batch_size 32" Traceback (most recent call last): File "train.py", line 46, in train_dataset = load_from_disk(args.training_dir)

@philschmid
Copy link
Owner

Hello @MrRobotV8,

can you please provide more context regarding your error? That message is not helpful to reproduce. Have you prepared the dataset correctly and uploaded it to S3?

@MrRobotV8
Copy link
Author

MrRobotV8 commented Dec 5, 2022

Hi @philschmid ,
I am following the execution of the 3 MLOps workshop, watched the videos and read the blog post on AWS. I have a ml.t3.medium instance for my notebook with conda_pytorch38 kernel.
As you may have seen on the forum post, I also tried to update transformer, pytorch version and change dataset.

This is the Processing Step, that seems to run properly:

`processing_output_destination = f"s3://{bucket}/{s3_prefix}/data"


sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    instance_type="ml.c5.2xlarge",
    instance_count=1,
    base_job_name=base_job_prefix + "/preprocessing",
    sagemaker_session=sagemaker_session,
    role=role,
)

step_process = ProcessingStep(
    name="ProcessDataForTraining",
    cache_config=cache_config,
    processor=sklearn_processor,
    job_arguments=["--transformers_version",transformers_version,
                   "--pytorch_version",pytorch_version,
                   "--model_id",model_id_,
                   "--dataset_name",dataset_name_],
    outputs=[
        ProcessingOutput(
            output_name="train",
            destination=f"{processing_output_destination}/train",
            source="/opt/ml/processing/train",
        ),
        ProcessingOutput(
            output_name="test",
            destination=f"{processing_output_destination}/test",
            source="/opt/ml/processing/test",
        ),
        ProcessingOutput(
            output_name="validation",
            destination=f"{processing_output_destination}/test",
            source="/opt/ml/processing/validation",
        ),
    ],
    code="./scripts/preprocessing.py",
)`

Then the training Step of my pipeline failed with teh following code:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "KeyError: 'length'" Command "/opt/conda/bin/python3.8 train.py --epochs 3 --eval_batch_size 64 --fp16 True --learning_rate 3e-05 --model_id distilbert-base-uncased --train_batch_size 32", exit code: 1

`huggingface_estimator = HuggingFace(
    entry_point="train.py",
    source_dir="./scripts",
    base_job_name=base_job_prefix + "/training",
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    role=role,
    transformers_version=transformers_version,
    pytorch_version=pytorch_version,
    py_version=py_version,
    hyperparameters={
         'epochs':epochs,  
         'eval_batch_size': eval_batch_size,   
         'train_batch_size': train_batch_size,              
         'learning_rate': learning_rate,               
         'model_id': model_id,
         'fp16': fp16
    },
    sagemaker_session=sagemaker_session,
)

step_train = TrainingStep(
    name="TrainHuggingFaceModel",
    estimator=huggingface_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri
        ),
        "test": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri
        ),
    },
    cache_config=cache_config,
)`

The train.py file, such as all teh other evaluate.py, deploy_handler.py, etc... are copied and pasted from the repo.

At the end of the processing step, data are uploaded to s3 in teh correct path defined. I see three files for train and same (different size) for test: dataset_info.json, datasetarrow and states.

Is the framework version of the SKLearn too outdated? 0.23-1..

@philschmid
Copy link
Owner

As you may have seen on the forum post, I also tried to update transformer, pytorch version and change dataset.

To which versions have you updated?

@MrRobotV8
Copy link
Author

package versions

transformers_version = "4.17.0"
pytorch_version = "1.10.2"
py_version = "py38"

model_id_="distilbert-base-uncased"
dataset_name_="emotion"

Using cached sagemaker-2.119.0-py2.py3-none-any.whl

@philschmid
Copy link
Owner

And datasets? Are you using 1.18.X since thats the latest installed in the container: https://github.com/aws/deep-learning-containers/blob/master/huggingface/pytorch/buildspec.yml#LL42C42-L42C48

@MrRobotV8
Copy link
Author

MrRobotV8 commented Dec 5, 2022

I didn't explicitly defined it, in the preprocessing file in the repo, we are doing:

install("datasets[s3]")

Now I edited with:

install("datasets[s3]==1.18.4")

I think I can be able to share the output in 10 minutes.

Training Image is: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04

@MrRobotV8
Copy link
Author

I don't have still an output because the training is proceeding... Hope that was just dataset version. By the way, I will let you know the output once finished.

In the meanwhile I can also share the definition of my pipeline, I don't want to have missed something.

{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [{'Name': 'ModelId',
   'Type': 'String',
   'DefaultValue': 'distilbert-base-uncased'},
  {'Name': 'DatasetName', 'Type': 'String', 'DefaultValue': 'emotion'},
  {'Name': 'ProcessingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.c5.2xlarge'},
  {'Name': 'ProcessingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
  {'Name': 'ProcessingScript',
   'Type': 'String',
   'DefaultValue': './scripts/preprocessing.py'},
  {'Name': 'TrainingEntryPoint', 'Type': 'String', 'DefaultValue': 'train.py'},
  {'Name': 'TrainingSourceDir', 'Type': 'String', 'DefaultValue': './scripts'},
  {'Name': 'TrainingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.p3.2xlarge'},
  {'Name': 'TrainingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
  {'Name': 'EvaluationScript',
   'Type': 'String',
   'DefaultValue': './scripts/evaluate.py'},
  {'Name': 'ThresholdAccuracy', 'Type': 'Float', 'DefaultValue': 0.8},
  {'Name': 'Epochs', 'Type': 'String', 'DefaultValue': '1'},
  {'Name': 'EvalBatchSize', 'Type': 'String', 'DefaultValue': '32'},
  {'Name': 'TrainBatchSize', 'Type': 'String', 'DefaultValue': '16'},
  {'Name': 'LearningRate', 'Type': 'String', 'DefaultValue': '3e-5'},
  {'Name': 'Fp16', 'Type': 'String', 'DefaultValue': 'True'}],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'ProcessDataForTraining',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.c5.2xlarge',
      'InstanceCount': 1,
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
     'ContainerArguments': ['--transformers_version',
      '4.17.0',
      '--pytorch_version',
      '1.10.2',
      '--model_id',
      'distilbert-base-uncased',
      '--dataset_name',
      'emotion'],
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/preprocessing.py']},
    'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
    'ProcessingInputs': [{'InputName': 'code',
      'AppManaged': False,
      'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/ProcessDataForTraining-86437d43df6eeb597c9c5a3520836925/input/code/preprocessing.py',
       'LocalPath': '/opt/ml/processing/input/code',
       'S3DataType': 'S3Prefix',
       'S3InputMode': 'File',
       'S3DataDistributionType': 'FullyReplicated',
       'S3CompressionType': 'None'}}],
    'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'train',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/train',
        'LocalPath': '/opt/ml/processing/train',
        'S3UploadMode': 'EndOfJob'}},
      {'OutputName': 'test',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/test',
        'LocalPath': '/opt/ml/processing/test',
        'S3UploadMode': 'EndOfJob'}},
      {'OutputName': 'validation',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/test',
        'LocalPath': '/opt/ml/processing/validation',
        'S3UploadMode': 'EndOfJob'}}]}},
   'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'}},
  {'Name': 'TrainHuggingFaceModel',
   'Type': 'Training',
   'Arguments': {'AlgorithmSpecification': {'TrainingInputMode': 'File',
     'TrainingImage': '763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
     'EnableSageMakerMetricsTimeSeries': True},
    'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/'},
    'StoppingCondition': {'MaxRuntimeInSeconds': 86400},
    'ResourceConfig': {'VolumeSizeInGB': 30,
     'InstanceCount': 1,
     'InstanceType': 'ml.p3.2xlarge'},
    'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
    'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
        'S3Uri': {'Get': "Steps.ProcessDataForTraining.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri"},
        'S3DataDistributionType': 'FullyReplicated'}},
      'ChannelName': 'train'},
     {'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
        'S3Uri': {'Get': "Steps.ProcessDataForTraining.ProcessingOutputConfig.Outputs['test'].S3Output.S3Uri"},
        'S3DataDistributionType': 'FullyReplicated'}},
      'ChannelName': 'test'}],
    'HyperParameters': {'epochs': {'Get': 'Parameters.Epochs'},
     'eval_batch_size': {'Get': 'Parameters.EvalBatchSize'},
     'train_batch_size': {'Get': 'Parameters.TrainBatchSize'},
     'learning_rate': {'Get': 'Parameters.LearningRate'},
     'model_id': {'Get': 'Parameters.ModelId'},
     'fp16': {'Get': 'Parameters.Fp16'},
     'sagemaker_submit_directory': '"s3://sagemaker-eu-west-1-183512891321/TrainHuggingFaceModel-0a8b7473ba341a719507d482a6891cd9/source/sourcedir.tar.gz"',
     'sagemaker_program': '"train.py"',
     'sagemaker_container_log_level': '20',
     'sagemaker_region': '"eu-west-1"'},
    'DebugHookConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/',
     'CollectionConfigurations': []},
    'ProfilerRuleConfigurations': [{'RuleConfigurationName': 'ProfilerReport-1670233247',
      'RuleEvaluatorImage': '929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest',
      'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}}],
    'ProfilerConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/'}},
   'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'}},
  {'Name': 'HuggingfaceEvalLoss',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.ProcessingInstanceType'},
      'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/evaluate.py']},
    'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
    'ProcessingInputs': [{'InputName': 'input-1',
      'AppManaged': False,
      'S3Input': {'S3Uri': {'Get': 'Steps.TrainHuggingFaceModel.ModelArtifacts.S3ModelArtifacts'},
       'LocalPath': '/opt/ml/processing/model',
       'S3DataType': 'S3Prefix',
       'S3InputMode': 'File',
       'S3DataDistributionType': 'FullyReplicated',
       'S3CompressionType': 'None'}},
     {'InputName': 'code',
      'AppManaged': False,
      'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/HuggingfaceEvalLoss-c59f512db1d458dadc4e83437b76244e/input/code/evaluate.py',
       'LocalPath': '/opt/ml/processing/input/code',
       'S3DataType': 'S3Prefix',
       'S3InputMode': 'File',
       'S3DataDistributionType': 'FullyReplicated',
       'S3CompressionType': 'None'}}],
    'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'evaluation',
       'AppManaged': False,
       'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/evaluation_report',
        'LocalPath': '/opt/ml/processing/evaluation',
        'S3UploadMode': 'EndOfJob'}}]}},
   'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'},
   'PropertyFiles': [{'PropertyFileName': 'HuggingFaceEvaluationReport',
     'OutputName': 'evaluation',
     'FilePath': 'evaluation.json'}]},
  {'Name': 'CheckHuggingfaceEvalAccuracy',
   'Type': 'Condition',
   'Arguments': {'Conditions': [{'Type': 'GreaterThanOrEqualTo',
      'LeftValue': {'Std:JsonGet': {'PropertyFile': {'Get': 'Steps.HuggingfaceEvalLoss.PropertyFiles.HuggingFaceEvaluationReport'},
        'Path': 'eval_accuracy'}},
      'RightValue': {'Get': 'Parameters.ThresholdAccuracy'}}],
    'IfSteps': [{'Name': 'HuggingFaceRegisterModel-RegisterModel',
      'Type': 'RegisterModel',
      'Arguments': {'ModelPackageGroupName': 'HuggingFaceModelPackageGroup',
       'InferenceSpecification': {'Containers': [{'Image': '763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
          'Environment': {'SAGEMAKER_PROGRAM': '',
           'SAGEMAKER_SUBMIT_DIRECTORY': '',
           'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
           'SAGEMAKER_REGION': 'eu-west-1'},
          'ModelDataUrl': {'Get': 'Steps.TrainHuggingFaceModel.ModelArtifacts.S3ModelArtifacts'}}],
        'SupportedContentTypes': ['application/json'],
        'SupportedResponseMIMETypes': ['application/json'],
        'SupportedRealtimeInferenceInstanceTypes': ['ml.g4dn.xlarge',
         'ml.m5.xlarge'],
        'SupportedTransformInstanceTypes': ['ml.g4dn.xlarge', 'ml.m5.xlarge']},
       'ModelApprovalStatus': 'Approved'}},
     {'Name': 'HuggingFaceModelDeployment',
      'Type': 'Lambda',
      'Arguments': {'model_name': 'distilbert-base-uncased-emotion12-05-09-39-43',
       'endpoint_config_name': 'distilbert-base-uncased-emotion12-05-09-39-43',
       'endpoint_name': 'distilbert-base-uncased-emotion',
       'endpoint_instance_type': 'ml.g4dn.xlarge',
       'model_package_arn': {'Get': 'Steps.HuggingFaceRegisterModel-RegisterModel.ModelPackageArn'},
       'role': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684'},
      'FunctionArn': 'arn:aws:lambda:eu-west-1:183512891321:function:sagemaker-pipelines-model-deployment-12-05-09-39-43',
      'OutputParameters': [{'OutputName': 'statusCode',
        'OutputType': 'String'},
       {'OutputName': 'body', 'OutputType': 'String'},
       {'OutputName': 'other_key', 'OutputType': 'String'}]}],
    'ElseSteps': []}}]}

@MrRobotV8
Copy link
Author

It works! Thank you @philschmid !

@cgpeltier
Copy link

Not sure if this is related to this issue too, but we're getting similar problems on some of our datasets in our SageMaker Pipelines, using various versions of datasets (1.18.4, 2.5.2, 2.7.1, etc.).

The weird thing for us is that it only seems to be happening on some of our HF datasets, but not others. I haven't done a deep dive into the differences in these files yet, but that's my next step. Thought I'd post here in case, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants