DVC Get Started
This is an auto-generated repository (please, don't create issues here, use the example-get-started-dev).
The idea of the project is a simplified version of the
tutorial. It explores the natural language
processing (NLP) problem of predicting tags for a given StackOverflow question.
For example, we want one classifier which can predict a post that is about the
Python language by tagging it
First, you need to download the project:
$ git clone https://github.com/iterative/example-get-started
Second, let's install the requirements. But before we do that, we strongly
recommend creating a virtual environment with
virtualenv or a similar tool:
$ cd example-get-started $ virtualenv -p python3 .env $ source .env/bin/activate
Now, we can install requirements for the project:
$ pip install -r requirements.txt
Running in Your Environment
This project comes with a predefined remote DVC storage that contains all input, intermediate and final results that were produced.
$ dvc remote list storage https://remote.dvc.org/get-started
You can run
dvc pull to download the data:
$ dvc pull -r storage
dvc repro to reproduce the pipeline:
$ dvc repro evaluate.dvc
If you'd like to test commands like
that require write access to the remote storage, the easiest way would be to set
up the local remote on your file system:
$ dvc remote add local /tmp/dvc-storage
You should be able to run:
$ dvc push -r local
This project with the help of the Git tags reflects the sequence of actions that are run in the DVC get started guide. Feel free to checkout one of them and play with the DVC commands having the playground ready.
0-empty- empty Git repository.
1-initialize- DVC has been initialized. The
.dvcwith the cache directory created.
2-remote- remote HTTP storage initialized. It is a shared read only storage that contains all data artifacts produced during next steps.
3-add-file- input data file
data.xmldownloaded and put under DVC control with
dvc add. First
4-source- source code downloaded and put under Git control.
5-preparation- first DVC stage created using
dvc run. It transforms XML data into TSV.
6-featurization- feature extraction step added. It also includes the split step for simplicity. It takes data in TSV format and produces two
.pklfiles that contain serialized feature matrices.
7-train- the model training stage added. It produces
model.pklfile - the actual result that can be then deployed somewhere and classify questions.
8-evaluate- evaluate stage, we run it on a test dataset to see the AUC value for the model. The result is dumped into a DVC metric file so that we can compare it with other experiments later.
9-bigrams- bigrams experiment, code has been modified to extract more features. We run
dvc reprofor the first time to illustrate how DVC can reuse cached files and detect changes along the computational graph.
There are two additional tags:
baseline-experiment- the first end-to-end result that we performance metric for.
bigrams-experiment- second version of the experiment.
Both these tags could be used to illustrate
-T DVC options across
The project files, DVC files, data files changes as you apply stages one by one,
but right after you for Git clone and
dvc pull to
download files that are under DVC control, the structure of the project should
look like this:
. ├── auc.metric <-- DVC metric file to compare baseline and bigrams ├── data <-- directory with input and intermediate data │ ├── features <-- extracted feature matrices │ │ ├── test.pkl │ │ └── train.pkl │ └── prepared <-- pre-processed dataset, split and TSV formatted │ ├── test.tsv │ └── train.tsv │ ├── data.xml <-- initial XML StackOverflow dataset │ ├── data.xml.dvc ├── evaluate.dvc <-- DVC files in the project root describe pipeline ├── featurize.dvc ├── model.pkl ├── prepare.dvc ├── requirements.txt <-- Python dependencies you need to run the project ├── src <-- sources to run the pipeline │ ├── evaluate.py │ ├── featurization.py │ ├── prepare.py │ └── train.py └── train.dvc