GitHub - mleonowicz/dlp-model

Info

This repository contains code for a simple, end-to-end ML workflow for traffic sign classification for GTSRB dataset. The workflow is implemented with Amazon SageMaker Pipelines. The pipeline is defined with SageMaker SDK.

The repository is based on a template described in SageMaker MLOps Project Walkthrough. Additional references follow:

The pipeline:

Splits the dataset and converts it to RecordIO format in (pipelines/road_sign/pipeline.py#L112). It is assumed, that the dataset is uploaded to S3 beforehand.
Trains a classification model with transfer learning based on an already trained model (pipelines/road_sign/pipeline.py#L151).
Evaluates the model on the test dataset (pipelines/road_sign/pipeline.py#L199).
Registers the model to a model registry (pipelines/road_sign/pipeline.py#L251) when mean square error from the evaluation is lower than a specified threshold (pipelines/road_sign/pipeline.py#L273).

Model deployment code will be provided in another dlp-deploy repository.

Team members:

Code layout

|-- codebuild-buildspec.yml
|-- CONTRIBUTING.md
|-- pipelines
|   |-- road_sign
|   |   |-- evaluate.py
|   |   |-- __init__.py
|   |   |-- pipeline.py
|   |   `-- preprocess.py
|   |-- create_pipeline_definition.py
|   |-- __init__.py
|   |-- run_pipeline.py
|   |-- _utils.py
|   `-- __version__.py
|-- README.md
|-- sagemaker-pipelines-project.ipynb
|-- setup.cfg
|-- setup.py
|-- tests
|   `-- test_pipelines.py
`-- tox.ini

Code description

A description of some of the artifacts is provided below:

Your code builds execution instructions. This file contains the instructions needed to kick off the execution of the SageMaker Pipeline in the CI/CD system (via CodePipeline). You will see that this file has the fields defined for naming the Pipeline, ModelPackageGroup etc. You can customize them as required.

|-- codebuild-buildspec.yml

The pipeline artifacts, which include a pipeline module defining the required create_pipeline method that returns an instance of a SageMaker pipeline, a preprocessing script that is used in feature engineering, and a model evaluation script to measure the Mean Squared Error of the model that's trained by the pipeline. This is the core business logic:

|-- pipelines
|   |-- road_sign
|   |   |-- evaluate.py
|   |   |-- __init__.py
|   |   |-- pipeline.py
|   |   `-- preprocess.py

Utility modules for getting pipeline definition JSONs and running pipelines (you do not typically need to modify these):

|-- pipelines
|   |-- create_pipeline_definition.py
|   |-- __init__.py
|   |-- run_pipeline.py
|   |-- _utils.py
|   `-- __version__.py

Python package artifacts:

|-- setup.cfg
|-- setup.py

The testing module contains a stub code only:

|-- tests
|   `-- test_pipelines.py

The tox testing framework configuration:

`-- tox.ini

Running

Manual deployment from a local host is possible with

pip install . && python pipelines/run_pipeline.py --module-name road_sign.pipeline --role-arn <provide role here> --kwargs '{"region": "<provide region here>"}'

After each new commit to main branch, an Amazon CodeBuild build is run. An Amazon CloudFormation stack is deployed and the pipeline run is triggered.

GTSRB dataset

The German Traffic Signs Dataset is a collection of images of traffic signs commonly seen on German roads. The dataset consists of 50,000 images in total, with each image belonging to one of 43 different classes of traffic signs. The dataset was created for the purpose of developing and testing computer vision algorithms that can automatically recognize and classify traffic signs.

Here are some additional properties of the dataset:

The images in the dataset are normalized real-life images, meaning that they are not artificially generated but instead are real images that have been standardized in some way (e.g., cropped, resized, etc.) to make them more consistent across the dataset.
The images are uniformly distributed across the 43 different classes, meaning that there are roughly the same number of images for each class. This helps to ensure that the dataset is balanced and that the model is trained on a representative sample of each class of traffic sign.

A few samples images from the dataset:

Transfer learning

Metrics available in the Amazon SageMaker training job like validation, train accuracy or test accuracy are shown in the following picture:

Confusion Matrix

Experiments

A machine learning pipeline was run using an instance of ml.p2.xlarge, which is a type of Amazon Web Services (AWS) instance that is optimized for machine learning workloads.
The entire pipeline took 2 hours and 56 minutes to run from start to finish.
Of that time, 2 hours and 54 minutes were spent on training the model.
The cost of running this pipeline was $5.
It's worth noting that the cost may vary depending on the specific AWS region where the instance was launched and the duration of the instance usage.
The pipeline produced a model with an accuracy of 94.676%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Info

Code layout

Code description

Running

GTSRB dataset

Transfer learning

Confusion Matrix

Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
img		img
pipelines		pipelines
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pydocstylerc		.pydocstylerc
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codebuild-buildspec.yml		codebuild-buildspec.yml
sagemaker-pipelines-project.ipynb		sagemaker-pipelines-project.ipynb
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Folders and files

Latest commit

History

Repository files navigation

Info

Code layout

Code description

Running

GTSRB dataset

Transfer learning

Confusion Matrix

Experiments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages