# DataJoint tutorial

We run a lot of experiments and keeping track of them is hard (at least for me)
![](figures/folders.png)

## What do we want from an experiment tracking tool?

### Storage and quering

- Make sure all the results are accessible and nothing is lost.
- Maybe raw data storage should be supported as well.
- Easy querying. For example, something like this is inconvenient:
```python
for param1 in list_of_param1_values:
    for param2 in list_of_param2_values:
        for param3 in list_of_param3_values:
            results_file_path = get_file_path(root_folder, param1, param2, param3)
            results = load_results(results_file_path)
            ...
```
- For example, $0.5 \leq \text{param1} \leq 1.5$ and $\text{param2} \geq 10 \cdot \text{param1}$ need to be coded as `if...else` statements
- A table might be a better option

![](figures/table.png)

- A table needs to be manually kept up-to-date and stored somewhere. Also how do we store different types of data? Links to files?
- What about pipelines, where the results of one experiment are inputs to the next one?

### There is a good solution to storage and querying problem

Databases! For example, relational databases with SQL language for querying

![](figures/database_cartoon.png)

![](figures/params_database.jpg)

Querying is much easier than in manual looping over the file system:
```sql
SELECT res1
FROM results
    JOIN param1 ON (param1.id = results.id_p1)
    JOIN param2 ON (param2.id = results.id_p2)
WHERE
    param1.value >= 0.5 AND 
    param1.value <= 1.5 AND 
    param2.value >= param1.value * 10
```

- DataJoint is an easy-to-use wrapper on top of a SQL database. 

### Computation

- DataJoint is not just an SQL database wrapper. It also allows us to define computational pipeline (a bit lime Makefile).
- The database tables can be automatically populated based on the data saved in their dependent tables.
- The auto population jobs are tracked in a global table enabling parallel computations (see later).

## An example

Let's train a neural network on MNIST digits and use DataJoint to keep track of the results.

In [None]:
import torch.optim as optim
from datajoint_tutorial.torch_network import Net, get_dataloaders, train, test

In [None]:
net = Net(num_features_1=32, num_features_2=64, dropout_prob=0.25)

In [None]:
train_loader, test_loader = get_dataloaders(batch_size=4)

In [None]:
import matplotlib.pyplot as plt

In [None]:
for (imgs, labels) in train_loader:
    break
    
plt.figure(figsize=(10, 3))

for i in range(4):
    plt.subplot(1, 4, i + 1)
    plt.imshow(imgs[i,0], cmap="gray")
    plt.axis("off")

In [None]:
net_output = net(imgs)
net_output.shape

### Setting up DataJoint schema

In [None]:
import datajoint as dj

In [None]:
# Database connection

dj.config['database.host'] = 'localhost'
dj.config['database.user'] = 'root'
dj.config['database.password'] = 'password'

In [None]:
# schema.drop()

In [None]:
# Schema is a collection of tables

schema = dj.schema('tutorial', locals())

Our model contains the following parameters:

- Number of features in the first and second layer
- Dropout probability
- Optimizer learning rate
- Number of optimization epochs

In [None]:
@schema
class NumFeatures(dj.Manual):
    definition = """
    features_config_id  : tinyint # so-called primary key, must be unique
    ---
    num_features_1      : int
    num_features_2      : int
    """

In [None]:
NumFeatures()

In [None]:
NumFeatures().insert1([1, 32, 32])

In [None]:
NumFeatures()

In [None]:
NumFeatures().insert([
    [2, 16, 32],
    [3, 32, 64],
    [4, 64, 64]
])

In [None]:
NumFeatures()

In [None]:
some_features = NumFeatures() & 'num_features_1 > 20'

In [None]:
some_features

In [None]:
some_features = NumFeatures() & 'num_features_1 > num_features_2'

In [None]:
some_features

In [None]:
some_features = NumFeatures() & 'num_features_1 = num_features_2'

In [None]:
some_features

In [None]:
some_features.fetch()

In [None]:
some_features.fetch(as_dict=True)

In [None]:
some_features.fetch(format="frame")

In [None]:
some_features = NumFeatures() & dict(features_config_id=1, some_key="some_value")

In [None]:
some_features

In [None]:
some_features.fetch1()

In [None]:
@schema
class DropoutProb(dj.Lookup):
    definition = """
    dropout_config_id  : tinyint # so-called primary key, must be unique
    ---
    dropout_prob       : float
    """
        
    contents = [[1, 0.25], [2, 0.5]]

In [None]:
DropoutProb()

In [None]:
@schema
class LearningRate(dj.Lookup):
    definition = """
    lr_config_id  : tinyint # so-called primary key, must be unique
    ---
    lr            : float
    """
        
    contents = [[1, 1e-3], [2, 1e-2]]

In [None]:
LearningRate()

In [None]:
@schema
class NumEpochs(dj.Lookup):
    definition = """
    epochs_config_id  : tinyint # so-called primary key, must be unique
    ---
    epochs            : int
    """
        
    contents = [[1, 1], [2, 10], [3, 50]]

In [None]:
NumEpochs()

In [None]:
NumEpochs() & dict(epochs_config_id=2)

In [None]:
NumEpochs() & [dict(epochs_config_id=2), dict(epochs_config_id=3)]

In [None]:
NumEpochs() & ['epochs_config_id=2', dict(epochs_config_id=3)]

In [None]:
(NumEpochs() & [dict(epochs_config_id=2), dict(epochs_config_id=3)]).delete()

In [None]:
NumEpochs()

In [None]:
dj.Diagram(schema)

In [None]:
@schema
class Train(dj.Computed):
    definition = """
    -> NumFeatures
    -> DropoutProb
    -> LearningRate
    -> NumEpochs
    ---
    train_loss      : float
    model_weights   : longblob
    """
        
    def make(self, key):
        pass

In [None]:
Train()

In [None]:
dj.Diagram(schema)

In [None]:
NumFeatures() * Train()

In [None]:
Train().drop()

In [None]:
NumFeatures()

In [None]:
@schema
class Train(dj.Computed):
    definition = """
    -> NumFeatures
    -> DropoutProb
    -> LearningRate
    -> NumEpochs
    ---
    train_loss      : float
    """
    
    class Weights(dj.Part):
        definition = """  # weights of the trained model
        -> Train
        layer    : varchar(64)   # layer name
        ---
        weights  : longblob      # numpy array of model weigths
        """
        
    def make(self, key):
        train_loader, test_loader = get_dataloaders(batch_size=64)
        
        num_features_1, num_features_2 = (NumFeatures() & key).fetch1("num_features_1", "num_features_2")
        dropout_prob = (DropoutProb() & key).fetch1("dropout_prob")
        lr = (LearningRate() & key).fetch1("lr")
        num_epochs = (NumEpochs() & key).fetch1("epochs")
        
        model = Net(num_features_1=num_features_1, num_features_2=num_features_2, dropout_prob=dropout_prob)
        optimizer = optim.Adam(model.parameters(), lr=lr)
        
        for epoch in range(1, num_epochs + 1):
            loss = train(model, train_loader, optimizer, epoch)
            
        key["train_loss"] = float(loss.detach().numpy())
        self.insert1(key)
        del key["train_loss"]
        
        for k, v in model.state_dict().items():
            key["layer"] = k
            key["weights"] = v.numpy()
            self.Weights.insert1(key)

In [None]:
Train().progress(display=True)

In [None]:
Train().populate(max_calls=1)

In [None]:
Train().populate(max_calls=1)

In [None]:
Train()

In [None]:
Train().progress(display=True)

In [None]:
Train().Weights()

In [None]:
Train().Weights() & dict(layer="conv2.bias")

In [None]:
(Train().Weights() & dict(layer="conv2.bias")).fetch1("weights")

In [None]:
Train().Weights().fetch(as_dict=True)

### Parallel jobs

In [None]:
schema.jobs

In [None]:
print(schema.jobs.fetch("error_stack", limit=1)[0])

In [None]:
schema.jobs.fetch("key")

In [None]:
Train()

In [None]:
best_loss_key = Train().fetch("KEY", order_by="train_loss", limit=1)

In [None]:
best_loss_key

In [None]:
Train() & best_loss_key

In [None]:
Train() * NumFeatures() * DropoutProb() & best_loss_key

In [None]:
Train().Weights() & best_loss_key

### Results consistency

In [None]:
DropoutProb()

In [None]:
Train()

In [None]:
DropoutProb() & "dropout_prob = 0.5"

In [None]:
(DropoutProb() & "dropout_prob = 0.5").delete()

In [None]:
Train()