In [None]:
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update(
    "rise",
    {
        "theme": "none",
        "transition": "none",
        "start_slideshow_at": "selected",
        "scroll": False
     }
)


<h1 align='center'>Data and Model Version Control: Applications in ML Drug Discovery pipelines
</h1>


<h3 align='center'>Estefania Barreto-Ojeda, PhD</h3>
<h4 align='center'>Computational Scientist <br> Cyclica Inc.</h4> 



<h4 align='right'><br><br>@ojeda-e<br>https://github.com/ojeda-e/dvc-dd-pipelines</h4> 


<h3 align='left'>Overview</h3>

**Part I: Biological data**
- What makes biological data different? 
- Overview ML drug discovery pipelines
- Data feedback loops: optimizing ML models.
- Challenges.

**Part II: Implementing data-versioned control in drug discovery ML pipelines**
- Introduction to DVC.
- Implementing DVC for biological data.
- Tracking data and models in drug discovery.
- Highlights.


<h1 align='center'>Part I: Biological Data</h1>

<br>
<h2 align='center'>Data is crucial for Machine Learning.</h2>

<br><br>
<center>
We need a lot of data.
</center>

<br>
<center>
We <b>have</b> a lot of data.
</center>
<br>
<br>

<center><h3 style="color:darkred;">Data != Information</h3><center>

 <h1 align='center'>Biological data is complex</h1>
 


<img src="img/ComplexData.png"  class='center'>

<br>
<br>
<br>


<center><h3 style="color:darkred;">Data != Information</h3><center>

<!-- Heterogenous -->


<p align="left"><img src="img/heterogeneous.png"/></p>


* Diverse formats.
* Derived from specific assays.
    
<br>
<img src="img/omics.png" align='right' width=800>
<br>

<!-- <p align="right"><img src="img/omics.png"/></p> -->

<!-- High dimensional -->

<p align="left"><img src="img/high-dim.png"/></p>
<br>

- Low number of samples (observations).
- High number of variables (features).

<br>
<img src="img/highdim_nature.png" align='right' width=700>
<br>

- Example:
    - 10,000 samples.
    - Each sample with 100 tumors.
    - 10,000-D space. 

<center>
<h6 style="color:lightgrey;">R. Clarke et al. Nat. Rev. Cancer (2008)</h6>
</center>



<!-- Conditional -->
<p align="left"><img src="img/conditional.png"/></p>
<br>

<br>
<img src="img/practical-rec.jpg" align='right' width=600>


- Protocols are not always reproducible.
- Elusive ground truths.


- Example:
    - No standards or tools available:
        - aggregating data.
        - curating data.
    - Standardized methods available lack functionality.
    - Complex problems, complex experiments.

<center>
<h6 style="color:lightgrey;">V.L. Porubsky, et al. Cancer Cell (2020)</h6>
</center>

The literature on reproducibility agrees to a large degree that the same experiment is conducted as long as the same experimental method is followed. However, following the same method is not enough. 

Popular deep learning (and other machine learning) methods are often used to tackle classification tasks and thus require ground-truth labels for training. 

<!-- Research biased -->

<br>
<p align="left"><img src="img/biased.png"/></p>
<br>

<img src="img/covid_nature.png" align='right' width=500>

- Bias towards specific outputs.
- More data on “hot” topics.



- Example:
        
    Articles on COVID-19:

    - **2020:** +192 k
    - **2021:** +298 k
    
<center>
<h6 style="color:lightgrey;">(H. Else, Nature 2020)</h6>
</center>

<br>
<h2 align='center'>Biological data is complex</h2></text>
<br>
<br>
<div class="col-md-8" markdown="1">
<br>
<br>
</div>

<div class="col-md-8" markdown="1">

- **Dissimilar**

    →  Diverse formats and content.

<br>

- **Imbalanced**

    →  More data for given feature.
</div>


- **Redundant**

    →  Duplicated values.
    
<br>

- **Sparse**

    →  Lack annotations.
    
<br>
<br>
<center><h3 style="color:darkred;">Data != Information</h3><center>

<br>
<h2 align='center'>Biological data is complex</h2></text>
<br>


- No direct use or implementation.

- Requires curation!


<br>
<img src="img/BearTheData.png" align='center' width=490>
<br>

<center>
    <h3 style="color:darkcyan;">Data == Information
    </h3>
<center>

<br>
<h2 align='left'>ML Workflow</h2></text>
<br>
<img src="img/step1.png" align='center'>
<br>


<br>
<h2 align='left'>ML Workflow</h2></text>

<img src="img/step2.png" align='center'>
<br>


<br>
<h2 align='left'>ML Workflow Drug Discovery Pipelines
</h2></text>
<br>
<img src="img/step2-DD.png" align='center'>
<br>


<br>
<h2 align='left'>ML Workflow Drug Discovery Pipelines
</h2></text>
<br>
<img src="img/step3.png" align='center'>
<br>


<br>
<h2 align='left'>ML Workflow Drug Discovery Pipelines
</h2>
<br>
<img src="img/step4.png" align='center'>
<br>


<br>
<h2 align='left'>ML Workflow Drug Discovery Pipelines
</h2>

<br>

* Keep model updated.
* Integration of new generated data.

<br><br>

<center>
<b>Flexible ML models</b>
</center>

<br>
<img src="img/experiments.png" align='center'>

<br>
<h2 align='left'>Drug Discovery Pipelines: challenges
</h2>


* Not completely defined by the code or the dataset only. 
* Changes in the dataset + data processing + code.
* ML is the set of:
    * All the possible models.
    * Every version of initial dataset and transformed dataset.
    * Associated metrics.


<img src="img/experiments-hl.png" align='center'>

<br>
<h2 align='left'>Drug Discovery Pipelines: challenges
</h2>
<br>

(1) Raw data, curated data.

(2) ML models.

(3) Metrics.

To improve prediction: More ML models and more metrics.
<h3 align='center'>How do we track all these changes?
</h3>
<br>
<img src="img/experiments-hl.png" align='center'>

<br>
<h2 align='left'>Drug Discovery Pipelines: challenges
</h2>
<br>

(1) Raw data, curated data.

(2) ML models.

(3) Metrics.

To improve prediction: More ML models and more metrics.
<h3 align='center'>How do we track all these changes?
</h3>
<br>
<img src="img/experiments-DVC.png" align='center'>

<h1 align='center'>Part II: Implementing data version control in drug discovery ML pipelines
</h1>

<h2 align='left'>What is DVC?</h2>

<u>D</u>ata <u>V</u>ersion <u>C</u>ontrol
<div class="col-md-8" markdown="1">

<img src="img/DVClogo.svg" align='left' width=150>
</div>
<br>
<div class="col-md-8" markdown="1">

* Open source.
* Compatible with git providers.
* Compatible with cloud storage providers.
* Language and framework agnostic.

**What can it do?**
* Track data and metrics.
* Version control ML projects.
* Manage ML experiments.
<br>
</div>

<br>

**What is not?**

Disney Vacation Club

<img src="img/not-dvc.jpeg" align='left' width=200>



**Installing DVC**

<br>

```bash

pip install dvc


```

- Choose your cloud provider:
    `[s3]`, `[gdrive]`, `[gs]`, `[azure]`, etc. 

- Use `[all]` to include them all.

<br>

For this talk, cloud provider is Google Cloud: `dvc[gs]`


```bash

pip install "dvc[gs]" | conda install -c conda-forge dvc-gs


```

In [None]:
! 

**Initialize DVC**

DVC works best with Git repos.

```bash

dvc init


```

In [None]:
! mkdir /tmp/dvc_demo
! cd /tmp/dvc_demo && dvc init

In [None]:
! dvc init

At DVC initialization, a new `.dvc/` directory is created for configuration:

In [None]:
! ls -a /tmp/dvc_demo | grep "^\."

In [None]:
! echo '====='
! cat /tmp/dvc_demo/.dvc/config

**Set remote**

```bash

dvc remote add <remote_name> <url>


```

In [None]:
! dvc remote add demo gs://pydata-dvc-demo-bucket/
! cat ./.dvc/config

In [None]:
! dvc remote list

In [None]:
! dvc remote default

In [None]:
! dvc remote default demo
! dvc remote default

In [None]:
! cat ./.dvc/config   # To get/set DVC configuration options.
! dvc push

To start tracking a file or directory, use `dvc add`:

```bash

dvc add <file>.<something>


```

In [None]:
! mkdir data/test && echo "This is a test file." >./data/test/simple_echo.csv
! dvc add ./data/test/simple_echo.csv

1. What does `dvc add` do / generate? 

1.1. `dvc add` --> `*.dvc` (Metadata)

In [None]:
! ls ./data/test/

In [None]:
! cat ./data/test/simple_echo.csv.dvc

1.2. `dvc add` --> `.dvc/cache` (Cache MD5 hash).

What just happened?

```

.dvc/cache

|
└── XY

    |
    └── abcdefghijk1234567890....f


```

In [None]:
cat ./.dvc/cache/2d/282102fa671256327d4767ec23bc6b

1.3. `dvc add` --> `.gitignore` (ignore data files).

For files/directories to be excluded.

In [None]:
 ! cat ./data/test/.gitignore

In [None]:
! rm -rf data/test/
! ls data

1. `dvc add`:

- generate dvc file
- add item to cache
- add files to gitignore


2. Track `.dvc` files with git!

```bash

git add data/test/simple_echo.csv.dvc data/test/.gitignore

git commit -m "Add test data."


```

<h2 align='left'>Implementing DVC in Drug Discovery Pipelines
</h2>
<br>

1. Set up:
    
    - Install
    - Initialize
    - Set remote storage
    - Test adding files


2. Data versioning and model versioning.
3. Highlights

<h2 align='left'>Data versioning and model versioning.</h2>
<br>

**Drug Discovery Pipelines: challenges**

(1) Raw data, curated data.

(2) ML models.

(3) Metrics.


<br>
<img src="img/experiments-DVC.png" align='center' width=1100>


<h2 align='left'>Data versioning and model versioning.</h2>
<br>

Let's assume:
- I have some initial data --> `/data/initial_data.csv`.
- I built an ML pipeline with:

1. Featurization
2. Preparation
3. Training
4. Evaluation

In [None]:
! tree src

I want to record and reproduce all the steps to transform data to models.

<h2 align='left'>Data versioning and model versioning.</h2>
<br>

**< 1-min EDA**


In [None]:
import pandas as pd
data = pd.read_csv('./data/initial_data.csv')
data.head()

In [None]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.molSize = 400,400

mol = Chem.MolFromSmiles(data['smiles'][1])
mol

In [None]:
import pubchempy as pcp

id_ = pcp.get_compounds(data['smiles'][1], 'smiles')[0]
id_.synonyms[0]

<h2 align='left'>Data versioning and model versioning.</h2>
<br>

**First: Version initial data!**

In [None]:
! dvc add data/initial_data.csv

<h2 align='left'>Data versioning and model versioning.</h2>
<br>

**Second: build a pipeline**

1. Featurization
2. Preparation
3. Training --> Metrics

**1. Featurization.**

Calculate chemical descriptors. Used for prediction.

Standard:

```python

from rdkit import Chem

from rdkit.Chem import Descriptors as Descript

from molvs import Standardizer as Std


for key, smile in enumerate(smiles): 
    
    mol = Chem.MolFromSmiles(smile)
    
    mol = Std().standardize(mol)
    
    descriptor_list = []
    
    
    for element in dir(Descript):  
        
    ...


```

Better:    

```

python <python_file> <input_file> <output_file>


```

```

python src/featurize.py data/initial_data.csv data/featurized/featurized_data.csv


```

Or better+ if we use YAML files to parse parameters.

<img src="img/DVClogo.svg" align='right' width=100>

- <h4 style="color:darkred;">Create pipeline by steps + parse params from YAML.</h4> 
- <h4 style="color:darkred;">Run step(s) in pipeline. </h4>

```

dvc stage add 


dvc repro


```

<h2 align='left'>Data versioning and model versioning.</h2>

If I already have:
- initial data versioned (`./data/initial_data.csv.dvc`) 
- modular code to featurize (`src/featurize.py`)
- parameters to parse via YAML (`params.yaml`)

**1. Featurization.**

```python

dvc stage add -n featurize \ # name of the stage 

              -d src/featurize.py -d data/initial_data.csv \ # dependencies

              python src/featurize.py data/initial_data.csv data/featurized/ # cmd
              

```

In [None]:
! dvc stage add -n featurize \
                -p featurize.max_number\
                -d src/featurize.py -d data/initial_data.csv \
                python src/featurize.py data/initial_data.csv data/featurized/featurized_data.csv
                

In [None]:
! head dvc.yaml

<center><h4 style="color:darkred;">  Create pipeline by steps + parse params from YAML.</h4> </center>

Next?

Run pipeline:

In [None]:
! dvc repro #takes some time

Now that we have the output from the featurization step, track it!

In [None]:
! dvc add data/featurized/featurized_data.csv

<center><h4 style="color:darkred;"> ✅ Run step(s) in pipeline. </h4></center>

<h2 align='left'>Data versioning and model versioning.</h2>

That's it?

No.

DVC represents a pipeline internally as a graph: 
 - nodes are stages.   
 - edges are directed dependencies.
 
```

dvc dag


```
**D**irected **A**cyclic **G**raph (DAG)

In [None]:
! dvc dag

<h2 align='left'>Data versioning and model versioning.</h2>
<br>

✅ Featurization

<br>


**2. Processing**

- Remove features that do not provide information.
- Remove features that are highly correlated.

Standard:

```python

import pandas as pd


correlation_matrix=df.corr()  

newColumns=[df.columns[0]]  


...


descriptors_df = remove_zeros(featurized_data, threshold=0)

filtered_data = filter_correlation(descriptors_df, filter_by=0.95) ...



```

Better:    
```


python src/process.py data/featurized_data.csv data/processed_data.csv


```

- parse parameters:
    - `threshold`
    - `filter_by`

<h2 align='left'>Data versioning and model versioning.</h2>

In [None]:
! dvc stage add -n process \
                -p process.threshold,process.filter_by \
                -d src/process.py -d data/featurized/featurized_data.csv \
                python src/process.py data/featurized/featurized_data.csv data/processed/processed_data.csv

In [None]:
! tail -8 dvc.yaml

<h2 align='left'>Data versioning and model versioning.</h2>

In [None]:
! dvc repro

In [None]:
! dvc add data/processed/processed_data.csv

- What is that `dvc.lock` file?

Record the state of the pipeline

<h2 align='left'>Data versioning and model versioning.</h2>


✅ Featurization

✅ Processing

**3. Training**

```python

from src.get_prediction import get_pipeline, predict

pipe = get_pipeline(filtered_data, labels)

y_predict = predict(pipe, X_train, y_train, method="predict") ...



```

In [None]:
! dvc stage add -n train --force\
                -p train.split,train.seed,train.n_est,featurize.max_number \
                -d src/process.py -d data/processed/processed_data.csv \
                python src/train.py data/processed/processed_data.csv models/model.pkl metrics/score

In [None]:
! tail dvc.yaml

In [None]:
! dvc repro

In [None]:
! dvc add models/model.pkl
! dvc add metrics/score.json

In [None]:
! dvc dag 

✅ Featurization  ✅ Processing  ✅ Training ✅ Metrics

In [None]:
! dvc exp run

In [None]:
! dvc exp run --force

<h2 align='left'>Isn't this cool?</h2>

Yes, it is!

<img src="img/noDVC_Dalle.png" align='center' width=900>
<br>

"data scientist struggling, not working with data version control, 1960s style, black and white"
<br>    
<img src="img/DVC_Dalle.png" align='center' width=900>
<br>

"data scientist succeeding, happy,  working with data version control, futuristic style, cyberpunk"



<img src="img/DVClogo.svg" align='left' width=100>
<br>
<h2 align='left'>Highlights (as a user)</h2>

- DVC == happier life.
- Tracks:
    - Data.
    - Models.
- Build and run ML pipelines (by steps).
- Cool DAGs.
- Manage experiments.
- Reproduce experiments.
- Compare experiments.

<img src="img/DVClogo.svg" align='left' width=100>
<br>
<h2 align='left'>Highlights (as a user, but I didn't mention.)</h2>

- DVC --> Content addressable.
- Add custom metadata to `.dvc` file.
- Shared cache among projects.
- Python API.
- Amazing docs.
- Can be used with CI workflows.
- Plugin VScode.
- Registry:
    - Data.
    - Models.

<img src="img/DVClogo.svg" align='left' width=100>
<br>
<br>
<h2 align='left'> In conclusion:</h2>

- DVC is a valuable tool in ML workflows.
    - Give it a try.
- We can't run away from complexity in biological data.



<h1>Acknowledgments</h1>

<img src="img/cyclica_logo.png" align='left'>
<br><br><br>

<img src="img/cyclicans.png" align='right'>

<br><br><br>

- Naheed Kurji, CEO.
- Stephen MacKinnon, CSO. 
- Ali Madani, Director of .
- Steve Constable, Director of Technology.
- Maria Elena Garcia, ML Team Lead.
- Federico Comitani, Comp. Sci.
- Daniella Lato, Comp. Sci.



<h4>www.cyclicarx.com</h1>


<h2 align='center'>Thanks!<br></h2>

<h2 align='center'>Questions?<br></h2>

