In paired $(\textbf{x}_1,\textbf{x}_2)$ setting:

 * Can we consider learning transformation b/w pairs a generalization of this idea ?
 * Can we get rid of explicit pairs somehow and train directly on single example case 
 * Transformation complexity acting as a regularizer
 * What transformation? How to parameterize and where to enfore them in network?
 * Connection to latent semantic vector math?

If each latent unit is responsible for only a single FoV, then only a limited number of ‘connections’ / ‘activations’ should be fired for transforming each pair. If the total number of latents is $d$ and $k$ factors are shared then we only need to activate $d - k$ units (ideally) to transform $x_1$ into $x_2$. This information can be used as a regularizer. I need to think if it makes sense to use the formalism of block-diagonal matrices for this. If not, what other form of regularization can be used? (“Learning disent structure of dyn envs” and “Cfactuals uncover Modular structure of DNNs” papers could be relevant here?) 


**What is meant by shared factors?**

For example in the case when the domain of factors vary smoothly as opposed to discrete value in synthetic datasets e.g in MNIST. How would we *find* two images which have same `thickness` or `rotation` value? These are vague / approximate notions so we probably need a notion of “factors are approximately shared” or “close enough in latent space”.

**Nature of factor might affect structure of latent encoding?**

 The nature of latent factor determines how its values affect the reconstruction (and hence loss), and how encoding it in latent space can be different. For example, `posX = 0` and `posX = 39` are at opposite ends of the image so the loss will be max (\*), but in case of orientation there exists several ‘aliased’ values for which reconstruction loss is very close to zero even if the latent value itself is far apart e.g. rotation = 0 deg looks very close to rotation = 350 degree, even if the values are at extremes of range. In case of a square `rotation` $= 0,90,270,360$ all result in the “same” image because of symmetry. Cartesian coords aren't the most natural system for some latents i.e. orthonormal vecs aren't the most natural basis (How to combine spaces representated in different corrdinates?)

How will / should such numerically different but semantically close values be encoded in the latent space ? 

(\*) It actually won't be max. Once we have zero overlap with the true image, all loss values are the same at least for `pos` variables. So in a way, how far we are from true position doesn't matter and isn't used as a signal.

## Training (what we want?)

<img src="vec_image.jpg" alt="drawing" width="300" height="300"/>

1. Take an example $\textbf{x}_1$ and encode it to get $\textbf{z}_1$. Pick another example $\textbf{x}_2$ and encode it to get $\textbf{z}_2$
3. Have a function $f_\psi$ which takes as input $\textbf{z}_1$ and "transforms" it into $\textbf{z}_2$ i.e. $ \hat{\textbf{z}}_2 = f_\psi(\textbf{z}_1)$
4. This function should have limited modeling capacity
5. $\hat{\textbf{z}}_2$ should be able to recontruct $\textbf{x}_2$, the function $f_\psi$ then represents a transformation in latent space
6. Let the total number of latent factors be $L$. Then $f_\psi$ will act on only those dimensions(corresponding to FoV) which change value b/w  $\textbf{x}_1$  and $\textbf{x}_2$. The strength of action should depend on the difference in FoV values. It means that transformation can't be fixed, it has to depend on the pair $(\textbf{x}_1,\textbf{x}_2)$. Naturally we'd expect closer pair to require 'less' transformation. Would we have $L$ such transformations? 
7. In early stages of training, encodings would be crap. So learning transformations at early stages isn't probably a good thing?
8. Would it be an invertible one-to-one transformation? (Normalizing flows helpful?)

It seems that in any case we'll need to do 2 passes thru decoder. I'm not sure if backprop after 2 passes makes sense in this case. 

### Sketch objective?

\begin{aligned}
\max_{\phi,\theta, \psi} \mathbb{E}_{(\textbf{x}_1,\textbf{x}_2)}  & \bigg\{
 \mathbb{E}_{q_\phi(\textbf{z}|\textbf{x}_1)} \log(p_\theta(\textbf{x}_1|{\textbf{z}}))  \\
&+ \mathbb{E}_{q_\phi(\textbf{z}|\textbf{x}_2)} \log(p_\theta(\textbf{x}_2|{\textbf{z}})) \\
&- \beta  D_{KL}( q_\phi({\textbf{z}}|\textbf{x}_1)||p({\textbf{z}})) \\
&- \beta  D_{KL}( q_\phi({\textbf{z}}|\textbf{x}_2)||p({\textbf{z}})) \\
&+ \gamma \text{Recon(Dec(} f_\psi(\textbf{z}_1,\textbf{z}_2)) - \textbf{x}_2)
\bigg\}
\end{aligned}



## A Gaussian Process diversion

1. One of the key themes of above idea is to view the dataset as a whole. That is to view each data point $\textbf{x}$ as being related to every other datapoint $\hat{\textbf{x}}$ via some transformation.

2. GPs very naturally encode this idea in their covariance function $k(\textbf{x},\hat{\textbf{x}})$.

3. Since we have multiple axes-parallel transformations, we would need a combination / library of transformations to change one datapoint into another. This gives an idea of somehow learning these via kernels and then learning to combine them.

4. We already model the latent $\textbf{z}$ as a gaussian. We now have to model the dynamics in Z-space as composed of GPs?

### 3d Shapes data

https://github.com/deepmind/3d-shapes

We have factors floor color, wall color, object color, scale, shape, rotation

We can consider 3 objects in the scene (1) Object (2) Floor (3) Wall

Rotation affects appearance of all the objects

ToDo: Check Google's post on explainable visual classifier

 ### Disentanglement principle re-thinking
 
- We can't assume to know the number of concepts
- We may have some idea about the nature of some concepts, but not all
- We might need more than one-dim to represent some concepts
- What would be an unsupervised way to check for disentanglement? When we don't have labels like in MNIST?
- Some concepts might apply to whole image or a bigger region of image and some might be local e.g. background color vs eye-color
- Different concepts may be learned at different layer depths (i.e. simple vs complex) 
- concepts learned at different different iteration
- Correlated latents?

Kate and Jonas' idea seems to require FC nets so that they can enforce the block-diagonal structure.

Can we do it in CNN somehow?

In CNNs a filter runs through whole image to make an activation map. It collects information globally

Building blocks of CNN and how they might affect information flow
1. Kernels and biases
2. Activation functions
3. Pooling
4. Later FC layers

When we use CNN in reconstruction configuration, does it "store" the state (i.e. value) and position of each pixel somewhere in the network so that it can be reconstructed? That feels strange. 
When reconstruction what operations in a CNN allow 1-d numbers to be transformed into an image. We only have few basic ops like Decov, ReLU, pool, addition, scaling.
