# Classify embeddings

This notebook can be used to recreate Fig. 4 from the manuscript. We compute and classify embeddings for several vector field datasets using a pretrained model. 

Our coding framework is based on a Click interface and we make use of that in this notebook by running the basic steps in the pipeline through shell commands.note

**NOTE**: This notebook assumes you have a pretrained model. You can train a model using the notebook `train_and_eval`.

In [1]:
import os
import subprocess
from src.utils import get_command_defaults, ensure_dir, write_yaml, update_yaml
from src.train import load_model, train_model, run_epoch
from src.data import load_dataset, SystemFamily

## Generate data
First, we generate the relevant evaluation data. There are three datasets immediately available here: (1) linear (e.g. fixed point classification), (2) conservative vs non-conservative and (3) incompressible vs compressible. You can comment out the `eval_data_names` definitions to switch between experiments.

First, we set the data generation parameters. 

In [2]:
## Generate evaluation data

data_dir = '/home/mgricci/data/phase2vec' # Alter to change where all of the phase2vec data will be saved. 

# Switch between evaluation sets here
# eval_data_names  = ['linear']
# eval_data_names  = ['conservative', 'polynomial']
# eval_data_names  = ['incompressible', 'polynomial']
eval_data_names  = ['incompressible', 'conservative','polynomial']

data_set_name = 'physics_3class'

# Get class names
eval_system_classes = []
for system in [SystemFamily(data_name=name) for name in eval_data_names]:
    eval_system_classes += [system.data_name + ' ' + str(i) for i in range(system.num_classes)]

num_eval_classes = len(eval_system_classes)

# Edit the number of total samples from each data set here.
#By default, each set is divied further into a base and validation set at a 75/100 split. This can be altered below. 
num_eval_samples = 1000 # total number of train/val samples
device            = 'cpu' # set to `cpu` if cuda not available

eval_samplers    = ['uniform'] * len(eval_data_names)
eval_props       = [str(1. / len(eval_data_names))] * len(eval_data_names)
eval_system_names   = '-s ' +  ' -s '.join(eval_data_names)
eval_samplers     = '-sp ' +  ' -sp '.join(eval_samplers)
eval_props = '-c ' +  ' -c '.join(eval_props)

Next, we call the actual shell commands for generating the data. These commands will make two directories, called `polynomial` and `classical`, corresponding to train and test sets, inside your `data_dir`. 

In order to alter the validation proportion, $p$, add the flag `--val-size <p>` where $p\in (0,1)$. 

In [3]:
subprocess.call(f'phase2vec generate-dataset --data-dir {data_dir} --data-set-name {data_set_name} --num-samples {num_eval_samples} {eval_samplers} {eval_props} {eval_system_names}', shell=True)

Generating incompressible data.
Generating conservative data.
Generating polynomial data.


0

## Load `phase2vec` encoder. 

We load the embedding CNN. By default, the net is saved in the folder `basic_train`, which is the default directory given in the `train_and_eval` notebook. 

* **model_type** (str): which of the pre-built architectures from _models.py to load. Make your own by combining modules from _modules.py 
* **latent_dim** (int): embedding dimension
* Continue...

In [10]:
## Set net parameters
from src.cli._cli import generate_net_config

net_info = get_command_defaults(generate_net_config)
model_type = net_info['net_class']
model_save_dir  = os.path.join('/home/mgricci/phase2vec/', 'basic_train')

# These parameters are not considered architectural parameters for the net, so we delete them before they're passed to the net builder. 
del net_info['net_class']
del net_info['output_file']
del net_info['pretrained_path']
del net_info['ae']


net = load_model(model_type, pretrained_path=os.path.join(model_save_dir, 'model.pt'), device=device, **net_info).to(device)

## Load data and compute embeddings


In [5]:
# Where is evaluation data stored? 
eval_data_path = os.path.join(data_dir, 'physics_3class')

# Load evaluation data. 
X_train, X_test, y_train, y_test, p_train, p_test = load_dataset(eval_data_path)

Now, we compute and save the embeddings.

In [6]:
import numpy as np

results_dir = f'/home/mgricci/results/phase2vec/{eval_data_names[0]}'
ensure_dir(results_dir)

for i, (name, data, labels, pars) in enumerate(zip(['train', 'test'], [X_train, X_test],[y_train, y_test],[p_train, p_test])):

    losses, embeddings = run_epoch(data, labels, pars,
                               net, 0, None,
                               train=False,
                               device=device,
                               return_embeddings=True)


    np.save(os.path.join(results_dir,f'embeddings_{name}.npy'), embeddings.detach().cpu().numpy())


In [7]:
clf_command = f'phase2vec classify {eval_data_path} --feature-name embeddings --classifier logistic_regressor --results-dir {results_dir}'
subprocess.call(clf_command,shell=True)

0: {'precision': 0.9382716049382716, 'recall': 0.9156626506024096, 'f1-score': 0.9268292682926829, 'support': 83}
1: {'precision': 0.898876404494382, 'recall': 0.9523809523809523, 'f1-score': 0.9248554913294796, 'support': 84}
2: {'precision': 1.0, 'recall': 0.963855421686747, 'f1-score': 0.9815950920245399, 'support': 83}
accuracy: 0.944
macro avg: {'precision': 0.9457160031442179, 'recall': 0.943966341556703, 'f1-score': 0.9444266172155675, 'support': 250}
weighted avg: {'precision': 0.9455286447496185, 'recall': 0.944, 'f1-score': 0.9443483327120231, 'support': 250}


0