# Classify embeddings

This notebook can be used to recreate Fig. 4 from the manuscript. We compute and classify embeddings for several vector field datasets using a pretrained model. 

Our coding framework is based on a Click interface and we make use of that in this notebook by running the basic steps in the pipeline through shell commands.note

**NOTE**: This notebook assumes you have a pretrained model. You can train a model using the notebook `train_and_eval`.

In [1]:
import os
import torch
import subprocess
from src.utils import get_command_defaults, ensure_dir, write_yaml, update_yaml
from src.train import load_model, train_model, run_epoch
from src.data import load_dataset, SystemFamily

## Load data
First, we generate the relevant evaluation data. There are three datasets immediately available here: (1) linear (e.g. fixed point classification), (2) conservative vs non-conservative and (3) incompressible vs compressible. You can comment out the `eval_data_names` definitions to switch between experiments.

First, we set the data generation parameters. 

In [4]:
data_dir = os.path.join('/home/mgricci/data/phase2vec')
data_set = 'koppen'
eval_data_path = os.path.join(data_dir, data_set)
time_avg = True
coarse_labeling = True

device='cpu'
# Get reps

Next, we call the actual shell commands for generating the data. These commands will make two directories, called `polynomial` and `classical`, corresponding to train and test sets, inside your `data_dir`. 

In order to alter the validation proportion, $p$, add the flag `--val-size <p>` where $p\in (0,1)$. 

In [3]:
# Download data here? 

In [8]:
import os
for i in range(12):
    os.makedirs(os.path.join(eval_data_path, str(i)))

## Load `phase2vec` encoder. 

We load the embedding CNN. By default, the net is saved in the folder `basic_train`, which is the default directory given in the `train_and_eval` notebook. 

* **model_type** (str): which of the pre-built architectures from _models.py to load. Make your own by combining modules from _modules.py 
* **latent_dim** (int): embedding dimension
* Continue...

In [3]:
## Set net parameters
from src.cli._cli import generate_net_config

net_info = get_command_defaults(generate_net_config)
model_type = net_info['net_class']
model_save_dir  = os.path.join('/home/mgricci/phase2vec/', 'basic_train')

# These parameters are not considered architectural parameters for the net, so we delete them before they're passed to the net builder. 
del net_info['net_class']
del net_info['output_file']

net = load_model(model_type, pretrained_path=os.path.join(model_save_dir, 'model.pt'), device=device, **net_info).to(device)

## Load data and compute embeddings


In [6]:
# Where is evaluation data stored? 
import numpy as np
# Load evaluation data. 
X_train, X_test, y_train, y_test, p_train, p_test = load_dataset(eval_data_path)

# for i, (name, data, labels, pars) in enumerate(zip(['train', 'test'], [X_train, X_test],[y_train, y_test],[p_train, p_test])):
#     for month in range(12):
#         save_path = os.path.join(eval_data_path, str(month))
#         monthly_data = data[:,month]
#         monthly_pars = pars[:,month]
#         np.save(os.path.join(save_path, f'X_{name}.npy'), monthly_data)
#         np.save(os.path.join(save_path, f'y_{name}.npy'), labels)
#         np.save(os.path.join(save_path, f'p_{name}.npy'), monthly_pars)

In [7]:
y_train

array([30., 27., 29., ...,  4., 29., 26.], dtype=float32)

Now, we compute and save the embeddings.

In [3]:
import numpy as np
import gc

results_dir = f'/home/mgricci/results/phase2vec/koppen'
ensure_dir(results_dir)


# for i, (name, data, labels, pars) in enumerate(zip(['train', 'test'], [X_train, X_test],[y_train, y_test],[p_train, p_test])):
#     embeddings = []
#     for month in range(3):
#         print(month)
#         monthly_data = data[:,month]
#         monthly_pars = pars[:,month]

#         losses, monthly_embeddings = run_epoch(monthly_data, labels, monthly_pars,
#                                    net, 0, None,
#                                    train=False,
#                                    device=device,
#                                    batch_size=64,
#                                    return_embeddings=True)
#         embeddings.append(monthly_embeddings)
#         gc.collect()
#     embeddings = torch.stack(embeddings)
#     np.save(os.path.join(results_dir,f'embeddings_{name}.npy'), embeddings.mean(0).detach().cpu().numpy())


'/home/mgricci/results/phase2vec/koppen'

In [5]:
clf_command = f'phase2vec classify {eval_data_path} --feature-name embeddings --classifier logistic_regressor --results-dir {results_dir}'
subprocess.call(clf_command,shell=True)

Traceback (most recent call last):
  File "/home/mgricci/anaconda3/envs/phase2vec/bin/phase2vec", line 33, in <module>
    sys.exit(load_entry_point('phase2vec', 'console_scripts', 'phase2vec')())
  File "/home/mgricci/anaconda3/envs/phase2vec/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/mgricci/anaconda3/envs/phase2vec/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/mgricci/anaconda3/envs/phase2vec/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mgricci/anaconda3/envs/phase2vec/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mgricci/anaconda3/envs/phase2vec/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/mgricci/projects

1

In [22]:
labels.shape

(2000,)

In [15]:
gc.gc()

AttributeError: module 'gc' has no attribute 'gc'