## 1. Pre-training the Auto-Encoder

In this notebook, we pre-train the auto-encoder for reconstruction. In their paper, Carr et al. define the motion encoder `E_M` and the privacy encoder `E_P`, which are implemented identically as described below.

![Encoder Architecture](figures/encoder.png)

Here, the input size 75 refers to the frame length `T` of each sequence of 3D coordinates in the dataset. The output of each encoder is a motion embedding and a privacy embedding respectively. The acronyms are defined using the Pytorch implementations:

- C2D: Convolution 2D
- LR: Leaky ReLU
- Up: Upsample
- MP: Max Pooling
- RP2D: Reflection Pad 2D

This stage includes 5 epochs of paired pre-training, followed by 20 epochs of unpaired pre-training.

### 1.1 Paired Pre-training

In order to separate the embeddings, we first start by pre-training the auto-encoder in the paired setting for 5 epochs.

#### 1.1.1 Paired Data Loading

In the paired setting, the model is trained with "carefully matched sets of skeleton motions". We load the preprocessed paired samples of the NTU RGB+D dataset, where two distinct actors perform the same two actions under an identical camera view.

In [1]:
from sitc.data.ntu60 import NTU60

num_epochs = 5
ntu60_dataset = NTU60(num_frames=75)

print(f"==========\nNTU 60 Dataset\n==========")
print(f"Number of samples: {len(ntu60_dataset)}")
print(f"Number of frames: {ntu60_dataset.max_frames}")
print(f"Number of joints: {ntu60_dataset.num_joints}")

Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.
NTU 60 Dataset
Number of samples: 30757
Number of frames: 75
Number of joints: 25


In [2]:
sample = ntu60_dataset[0]
print(f"Sample ID: {sample['name']}")
print(f"Camera ID: {sample['camera']}")
print(f"Action ID: {sample['action']}")
print(f"Person ID: {sample['person']}")

Sample ID: S001C003P004R002A038
Camera ID: 3
Action ID: 38
Person ID: 4


In [3]:
sample['keypoints'].shape

(75, 25, 3)

In [4]:
import torch

train_dataloader = torch.utils.data.DataLoader(ntu60_dataset, batch_size=128, shuffle=True, num_workers=0, pin_memory=True)
print(f"Number of batches: {len(train_dataloader)}")

Number of batches: 241


#### 1.1.2 Instantiating the Auto-encoder

In [11]:
from sitc.models.autoencoder import AE

ae = AE(in_features=3, out_features=256)
ae.cuda()

AE(
  (motion_encoder): EM(
    (conv1): Conv2d(3, 12, kernel_size=(3, 3), stride=(1, 1))
    (conv2): Conv2d(12, 24, kernel_size=(3, 3), stride=(1, 1))
    (conv3): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))
    (conv4): Conv2d(32, 256, kernel_size=(3, 3), stride=(1, 1))
    (lr): LeakyReLU(negative_slope=0.01)
    (mp): MaxPool2d(kernel_size=3, stride=1, padding=0, dilation=1, ceil_mode=False)
    (rp): ReflectionPad2d((1, 1, 1, 1))
  )
  (privacy_encoder): PM(
    (conv1): Conv2d(3, 12, kernel_size=(3, 3), stride=(1, 1))
    (conv2): Conv2d(12, 24, kernel_size=(3, 3), stride=(1, 1))
    (conv3): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))
    (conv4): Conv2d(32, 256, kernel_size=(3, 3), stride=(1, 1))
    (lr): LeakyReLU(negative_slope=0.01)
    (mp): MaxPool2d(kernel_size=3, stride=1, padding=0, dilation=1, ceil_mode=False)
    (rp): ReflectionPad2d((1, 1, 1, 1))
  )
  (decoder): Decoder(
    (convt1): ConvTranspose2d(512, 256, kernel_size=(3, 3), stride=(1, 1), paddin

#### 1.1.3 Defining hyperparameters

**Defining the reconstruction loss**

The loss is defined as

$$L_{rec} = \mathbb{E}_{\mathbf{s}\sim\mathcal{S}}\Bigl[||\mathit{D}(\mathit{E_M}(\mathbf{s}), \mathit{E_P}(\mathbf{s})) - \mathbf{s} ||^2\Bigr]$$

This is equivalent to the MSE loss.

In [6]:
import torch.nn as nn

loss_rec = nn.MSELoss()

**Defining the smooth loss**

The smooth loss is defined as

$$L_{smooth} = \mathbb{E}_{\mathbf{s}\sim\mathcal{S}}\Biggl[\frac{\sqrt{\sum_i^J|\sum_j^T(\hat{s}_{i,j}-\hat{s}_{i,j+1})^2 - \sum_j^T(\mathbf{s}_{i,j}-\mathbf{s}_{i,j+1}^2)}}{J\times T}\Biggr]$$

In [7]:
import numpy as np

class SmoothLoss(nn.Module):
    def __init__(self):
        super(SmoothLoss, self).__init__()
    
    def forward(self, inputs, targets):
        J = inputs.shape[0]
        T = inputs.shape[1]
        
        total_shifts = []
        for i in range(J): 
            input_shifts = []
            target_shifts = []
            for j in range(T):
                input_shifts.append((inputs[i][j] - inputs[i][j+1]**2))
                target_shifts.append((targets[i][j] - targets[i][j+1]**2))
            
            total_shifts.append(np.abs(sum(input_shifts) - sum(target_shifts)))
        
        loss = np.sqrt(sum(total_shifts)) / (J * T)

        return loss.mean()

loss_smooth = SmoothLoss()

**Defining the total loss**

The total loss for the autoencoder pre-training is

$$L_{ae} = \alpha_{rec}L_{rec} + \alpha_{smooth}L_{smooth}$$

where $\alpha_{rec}$, $\alpha_{smooth}$ are hyperparameters.

In [8]:
class AELoss(nn.Module):
    def __init__(self, alpha_rec=2, alpha_smooth=3):
        super(AELoss, self).__init__()
        self.alpha_rec = alpha_rec
        self.alpha_smooth = alpha_smooth
    
    def forward(self, inputs, targets):
        loss_rec = nn.MSELoss()
        loss_smooth = SmoothLoss()

        return self.alpha_rec * loss_rec(inputs, targets) + self.alpha_smooth * loss_smooth(inputs, targets)

**Defining the optimiser**

In [9]:
import torch 

optimiser = torch.optim.Adam(ae.parameters(), lr=0.001)

loss_fn = AELoss()
loss_fn.to('cuda')

ae.train()

AE(
  (motion_encoder): EM(
    (conv1): Conv2d(75, 12, kernel_size=(3, 3), stride=(1, 1))
    (conv2): Conv2d(12, 24, kernel_size=(3, 3), stride=(1, 1))
    (conv3): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))
    (conv4): Conv2d(32, 256, kernel_size=(3, 3), stride=(1, 1))
    (lr): LeakyReLU(negative_slope=0.01)
    (mp): MaxPool2d(kernel_size=3, stride=1, padding=0, dilation=1, ceil_mode=False)
    (rp): ReflectionPad2d((1, 1, 1, 1))
  )
  (privacy_encoder): PM(
    (conv1): Conv2d(75, 12, kernel_size=(3, 3), stride=(1, 1))
    (conv2): Conv2d(12, 24, kernel_size=(3, 3), stride=(1, 1))
    (conv3): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1))
    (conv4): Conv2d(32, 256, kernel_size=(3, 3), stride=(1, 1))
    (lr): LeakyReLU(negative_slope=0.01)
    (mp): MaxPool2d(kernel_size=3, stride=1, padding=0, dilation=1, ceil_mode=False)
    (rp): ReflectionPad2d((1, 1, 1, 1))
  )
  (decoder): Decoder(
    (convt1): ConvTranspose2d(512, 256, kernel_size=(3, 3), stride=(1, 1), padd

#### 1.1.4 Training

In [None]:
for epoch in range(num_epochs):

    for i, batch in enumerate(train_dataloader):

        if i > 0:
            break

        batch = NTU60.move_batch_to_device(batch, 'cuda')

        optimiser.zero_grad()

        inputs = batch['keypoints'].float()

        outputs = ae(inputs)

        #loss = loss_fn(outputs, inputs)

        #loss.backward()
        #optimiser.step()




torch.Size([128, 75, 25, 3])


OutOfMemoryError: CUDA out of memory. Tried to allocate 3.71 GiB. GPU 0 has a total capacity of 9.77 GiB of which 496.75 MiB is free. Process 27918 has 249.38 MiB memory in use. Including non-PyTorch memory, this process has 8.69 GiB memory in use. Of the allocated memory 8.27 GiB is allocated by PyTorch, and 177.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)