Skip to content

Sample machine learning model using sklearn and dataset versioning using DVC

License

Notifications You must be signed in to change notification settings

josecelano/data-version-control

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build the model

Resize image using conda

Resize image using GitHub Action

Data Version Control Tutorial

Example repository for the Data Version Control With Python and DVC article on Real Python.

To use this repo as part of the tutorial, you first need to get your own copy. Click the Fork button in the top-right corner of the screen, and select your private account in the window that pops up. GitHub will create a forked copy of the repository under your account.

Clone the forked repository to your computer with the git clone command

git clone git@github.com:YourUsername/data-version-control.git

Make sure to replace YourUsername in the above command with your actual GitHub username.

Happy coding!

Custom features for this fork

This is a fork from a tutorial example repo. I'm adding some new features like an script to classify images using the generated model. For more details read the original tutorial.

Install

git clone git@github.com:josecelano/data-version-control.git
cd data-version-control
conda create --name dvc python=3.8.2 -y
conda config --add channels conda-forge
conda install dvc scikit-learn scikit-image pandas numpy

Alternatively you can create the conda environment with:

conda env create --file environment.yml

Run

conda activate dvc

Generate csv files:

python3 src/prepare.py

Resize images to 100x100 and convert them to PNG format:

python3 src/prepare_images.py

The model can be trained with raw images or pre-processed images and use them. For the time being both options are hardcoded in train.py file. The default option is from pre-resized images.

Train the model:

python3 src/train.py

Evaluate the model with the test set:

python3 src/evaluate.py && cat metrics/accuracy.json 

User the model to classify the image:

python3 src/predict.py

Sample output for predict.py script:

(dvc) josecelano@josecelano:~/Documents/github/josecelano/data-version-control$ python src/predict.py -i /home/josecelano/Documents/github/josecelano/data-version-control/data/raw/train/n03888257/n03888257_24024.JPEG
Predicting for image: " /home/josecelano/Documents/github/josecelano/data-version-control/data/raw/train/n03888257/n03888257_24024.JPEG "
['parachute']

This could be a golden test if you change something:

python src/train.py && python src/evaluate.py && cat metrics/accuracy.json

Accuracy should be almost the same. The trainning process is not deterministic.

Run workflow locally

We are using act to run GitHub Actions locally.

act usage:

act -h

Run workflow locally:

act -j build --secret-file .env
act -j show_changed_images --secret-file .env
...

With the j you can run only a single job.

Don't forget to add your Azure Blog Storage credentials to pull images from remote DVC storage. Otherwise, you will get this error:

| ERROR: failed to pull data from the cloud - Authentication to Azure Blob Storage requires either account_name or connection_string.
| Learn more about configuration settings at <https://man.dvc.org/remote/modify>
[Build the model/build]   ❌  Failure - Pull dataset from remote

Run workflow on GitHub

You need to add the secrets in .env.ci file:

AZURE_STORAGE_ACCOUNT='YOUR_STORAGE_ACCOUNT_NAME'
AZURE_STORAGE_KEY='YOUR_STORAGE_KEY'

New content

Links

Troubleshooting

TODO

  • Cache for DVC cache? We have to pull the whole dataset on every pipeline.
  • Use GitHub cache for DVC local cache .dvc\cache?
  • It seems conda cache it's not working. See issue #1.
  • Add tests to GitHub Action skimage-resizer.

About

Sample machine learning model using sklearn and dataset versioning using DVC

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Python 98.1%
  • Dockerfile 1.9%