# Playing with Echonet video module. 
Stough

Working with the [r2plus1d_18](https://pytorch.org/docs/stable/torchvision/models.html#resnet-2-1-d) (paper [here](https://arxiv.org/abs/1711.11248)) module used in [Echonet](https://github.com/echonet/dynamic). I still don't see how the frame-by-frame segmentation model (which is a pretty straight-forward fully-convolutional [deeplabv3_resnet50](https://pytorch.org/docs/stable/torchvision/models.html#deeplabv3)) is combined with this r2plus1d_18 to get ejection fraction. 

This playbook is for understanding what's happening in [video.py](../echonet/utils/video.py)

In [3]:
%matplotlib widget
import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision
import echonet
from argparse import Namespace

In [6]:
args = Namespace(modelname="r2plus1d_18",
        tasks="EF",
        frames=32,
        period=2,
        pretrained=True)

### Look at the structure of the model

It's quite confusing, need to read the [paper](https://arxiv.org/pdf/1711.11248.pdf). It seems different from what's in the paper, with a number of extra layers. In the end, the spatial and temporal convolutions are separated (Conv3d with kernel_size (1,3,3) for spatial and (3,1,1) for temporal. 
- The 0 layer (stem) is (1,7,7) and (3,1,1)
- Layer 1 contains two bigblocks (BasicBlock), each of which has 2 of these (2 + 1)d  pieces.
- Layers 2,3,4 look like layer 1, but include downsampling at the end of the first bigblock.
- I'm confused about the downsamples, as the number of input channels seems to be half what I expect.

Did some digging, found the [r2plus1d_18 code](https://github.com/pytorch/vision/blob/master/torchvision/models/video/resnet.py). Whew. So the residual part is each BasicBlock adds its input to its output (not concat). On the downsampling: if downsampling is happening in the BasicBlock, then it's done to the *residual* (input) before being added to the output. So it's at [this line](https://github.com/pytorch/vision/blob/master/torchvision/models/video/resnet.py#L109) where the downsample reduces the spatial and temporal dimensions while increasing the number of channels in order to match the BasicBlock output. It's [here](https://github.com/pytorch/vision/blob/master/torchvision/models/video/resnet.py#L247) where the downsample is constructed for the first BasicBlock of a layer (layers 2,3,4). Those layers that do downsample, you can see the first [Conv2Plus1D](https://github.com/pytorch/vision/blob/master/torchvision/models/video/resnet.py#L47), where `stride=(1, 2, 2)` downsamples in space and `stride=(2, 1, 1)` downsamples in time. That's how the t, h, w dimensions can line up.

In [7]:
# Set up model. video.py ~83
model = torchvision.models.video.__dict__[args.modelname](pretrained=args.pretrained)

Downloading: "https://download.pytorch.org/models/r2plus1d_18-91a641e6.pth" to /home/stough/.cache/torch/checkpoints/r2plus1d_18-91a641e6.pth


HBox(children=(FloatProgress(value=0.0, max=126162996.0), HTML(value='')))




In [8]:
model

VideoResNet(
  (stem): R2Plus1dStem(
    (0): Conv3d(3, 45, kernel_size=(1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), bias=False)
    (1): BatchNorm3d(45, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): Conv3d(45, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
    (4): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
  )
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Sequential(
        (0): Conv2Plus1D(
          (0): Conv3d(64, 144, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (1): BatchNorm3d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Conv3d(144, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        )
        (1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=Tru

In [16]:
# Here the 64 input features to the downsample Conv3d are the inputs to layer2's first BasicBlock. 
# This downsampling happens so that the input can be added to the 128 features of this block's
# output.
ds = model.layer2[0].downsample
ds

Sequential(
  (0): Conv3d(64, 128, kernel_size=(1, 1, 1), stride=(2, 2, 2), bias=False)
  (1): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

In [17]:
# Number of trainable parameters?
# https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
print('Trainable params: {}'.format(sum(p.numel() for p in model.parameters() if p.requires_grad)))

Trainable params: 31505325


&nbsp;

### Now, let's build up a batch to understand how those are working.

In [19]:
# video.py ~97
mean, std = echonet.utils.get_mean_and_std(echonet.datasets.Echo(split="train"))
kwargs = {"target_type": args.tasks,
          "mean": mean,
          "std": std,
          "length": args.frames,
          "period": args.period,
          }

100%|██████████| 16/16 [00:01<00:00, 10.94it/s]


In [20]:
# Set up datasets and dataloaders, video.py ~107
train_dataset = echonet.datasets.Echo(split="train", **kwargs, pad=12)

train_dataloader = torch.utils.data.DataLoader(train_dataset, 
                                               batch_size=20, 
                                               num_workers=4, 
                                               shuffle=True, 
                                               pin_memory=False, 
                                               drop_last=True)

In [21]:
# video.py ~272
for (X, outcome) in train_dataloader:
    break

In [22]:
who

Namespace	 X	 args	 ds	 echonet	 kwargs	 mean	 model	 np	 
outcome	 plt	 std	 tasks	 torch	 torchvision	 train_dataloader	 train_dataset	 


In [23]:
X.shape

torch.Size([20, 3, 32, 112, 112])

In [24]:
outcome

tensor([50.3779, 57.2229, 60.9947, 35.1332, 58.0099, 64.8835, 67.2624, 65.2090,
        56.7912, 58.8789, 65.5986, 60.2931, 60.9901, 57.0501, 67.1416, 55.9150,
        67.6700, 57.1032, 70.2874, 58.1282])

&nbsp;

### Visualize one of the batch_size videos.

In [32]:
%%capture
vid = echonet.utils.makeVideo(np.transpose(X[1], (1,2,3,0)))

In [33]:
vid

In [34]:
X[15].max()

tensor(4.5219)

In [41]:
xn = X[15].numpy()
xn = xn - xn.min()
xn = xn/xn.max()


In [42]:
%%capture
vidN = echonet.utils.makeVideo(np.transpose(xn, (1,2,3,0)))

In [43]:
vidN