In paired $(\textbf{x}_1,\textbf{x}_2)$ setting:

 * Can we consider learning transformation b/w pairs a generalization of this idea ?
 * Can we get rid of explicit pairs somehow and train directly on single example case 
 * Transformation complexity acting as a regularizer
 * What transformation? How to parameterize and where to enfore them in network?
 * Connection to latent semantic vector math?

If each latent unit is responsible for only a single FoV, then only a limited number of ‘connections’ / ‘activations’ should be fired for transforming each pair. If the total number of latents is $d$ and $k$ factors are shared then we only need to activate $d - k$ units (ideally) to transform $x_1$ into $x_2$. This information can be used as a regularizer. I need to think if it makes sense to use the formalism of block-diagonal matrices for this. If not, what other form of regularization can be used? (“Learning disent structure of dyn envs” and “Cfactuals uncover Modular structure of DNNs” papers could be relevant here?) 


**What is meant by shared factors?**

For example in the case when the domain of factors vary smoothly as opposed to discrete value in synthetic datasets e.g in MNIST. How would we *find* two images which have same `thickness` or `rotation` value? These are vague / approximate notions so we probably need a notion of “factors are approximately shared” or “close enough in latent space”.

**Nature of factor might affect structure of latent encoding?**

 The nature of latent factor determines how its values affect the reconstruction (and hence loss), and how encoding it in latent space can be different. For example, `posX = 0` and `posX = 39` are at opposite ends of the image so the loss will be max (\*), but in case of orientation there exists several ‘aliased’ values for which reconstruction loss is very close to zero even if the latent value itself is far apart e.g. rotation = 0 deg looks very close to rotation = 350 degree, even if the values are at extremes of range. In case of a square `rotation` $= 0,90,270,360$ all result in the “same” image because of symmetry. Cartesian coords aren't the most natural system for some latents i.e. orthonormal vecs aren't the most natural basis (How to combine spaces representated in different corrdinates?)

How will / should such numerically different but semantically close values be encoded in the latent space ? 

(\*) It actually won't be max. Once we have zero overlap with the true image, all loss values are the same at least for `pos` variables. So in a way, how far we are from true position doesn't matter and isn't used as a signal.

## Training (what we want?)

<img src="vec_image.jpg" alt="drawing" width="300" height="300"/>

1. Take an example $\textbf{x}_1$ and encode it to get $\textbf{z}_1$. Pick another example $\textbf{x}_2$ and encode it to get $\textbf{z}_2$
3. Have a function $f_\psi$ which takes as input $\textbf{z}_1$ and "transforms" it into $\textbf{z}_2$ i.e. $ \hat{\textbf{z}}_2 = f_\psi(\textbf{z}_1)$
4. This function should have limited modeling capacity
5. $\hat{\textbf{z}}_2$ should be able to recontruct $\textbf{x}_2$, the function $f_\psi$ then represents a transformation in latent space
6. Let the total number of latent factors be $L$. Then $f_\psi$ will act on only those dimensions(corresponding to FoV) which change value b/w  $\textbf{x}_1$  and $\textbf{x}_2$. The strength of action should depend on the difference in FoV values. It means that transformation can't be fixed, it has to depend on the pair $(\textbf{x}_1,\textbf{x}_2)$. Naturally we'd expect closer pair to require 'less' transformation. Would we have $L$ such transformations? 
7. In early stages of training, encodings would be crap. So learning transformations at early stages isn't probably a good thing?
8. Would it be an invertible one-to-one transformation? (Normalizing flows helpful?)

It seems that in any case we'll need to do 2 passes thru decoder. I'm not sure if backprop after 2 passes makes sense in this case. 

### Sketch objective?

\begin{aligned}
\max_{\phi,\theta, \psi} \mathbb{E}_{(\textbf{x}_1,\textbf{x}_2)}  & \bigg\{
 \mathbb{E}_{q_\phi(\textbf{z}|\textbf{x}_1)} \log(p_\theta(\textbf{x}_1|{\textbf{z}}))  \\
&+ \mathbb{E}_{q_\phi(\textbf{z}|\textbf{x}_2)} \log(p_\theta(\textbf{x}_2|{\textbf{z}})) \\
&- \beta  D_{KL}( q_\phi({\textbf{z}}|\textbf{x}_1)||p({\textbf{z}})) \\
&- \beta  D_{KL}( q_\phi({\textbf{z}}|\textbf{x}_2)||p({\textbf{z}})) \\
&+ \gamma \text{Recon(Dec(} f_\psi(\textbf{z}_1,\textbf{z}_2)) - \textbf{x}_2)
\bigg\}
\end{aligned}



## A Gaussian Process diversion

1. One of the key themes of above idea is to view the dataset as a whole. That is to view each data point $\textbf{x}$ as being related to every other datapoint $\hat{\textbf{x}}$ via some transformation.

2. GPs very naturally encode this idea in their covariance function $k(\textbf{x},\hat{\textbf{x}})$.

3. Since we have multiple axes-parallel transformations, we would need a combination / library of transformations to change one datapoint into another. This gives an idea of somehow learning these via kernels and then learning to combine them.

4. We already model the latent $\textbf{z}$ as a gaussian. We now have to model the dynamics in Z-space as composed of GPs?

### 3d Shapes data

https://github.com/deepmind/3d-shapes

We have factors floor color, wall color, object color, scale, shape, rotation

We can consider 3 objects in the scene (1) Object (2) Floor (3) Wall

Rotation affects appearance of all the objects

ToDo: Check Google's post on explainable visual classifier

 ### Disentanglement principle re-thinking
 
- We can't assume to know the number of concepts
- We may have some idea about the nature of some concepts, but not all
- We might need more than one-dim to represent some concepts
- What would be an unsupervised way to check for disentanglement? When we don't have labels like in MNIST?
- Some concepts might apply to whole image or a bigger region of image and some might be local e.g. background color vs eye-color
- Different concepts may be learned at different layer depths (i.e. simple vs complex) or may have heirarchical structure (i.e. learn edges and then learn about square as combination of edges in a certain way)
- concepts learned at different different iteration (There was an OpenAI post on this where some classes were separated earlier and two classes 6 and 9 were separated much later in training)
- Correlated latents?

Kate and Jonas' idea seems to require FC nets so that they can enforce the block-diagonal structure.

Can we do it in CNN somehow?

In CNNs a filter runs through whole image to make an activation map. It collects information globally

Building blocks of CNN and how they might affect information flow
1. Kernel size, weights, and biases.  
2. Activation functions
3. Pooling
4. Later FC layers

When we use CNN in reconstruction configuration, does it "store" the state (i.e. value) and position of each pixel somewhere in the network so that it can be reconstructed? That feels strange. 
When reconstruction what operations in a CNN allow 1-d numbers to be transformed into an image. We only have few basic ops like Decov, ReLU, pool, addition, scaling.


### Complete elimination of pairs vs. Weak Sharedness

The goal is to completely eliminate the need for pairs, but in that case we loose a lot of useful relationships and also have bad starting points.

Should first try to incorporate / formalize the idea of "weak sharedness" or "approximate sharedness". It keeps the formalism a lot more similar and the proof provides a jumping off point.

## All is not well


The unsupervised Manifold / Graph approach based on Group Theory definition kinda seems a cleaner direction. But the paper is too technical for me right now to really understand it.

I feel no energy to think or work on stuff right now. My brain is all foggy and I feel sleepy. I don't feel 'sad' in the usual sense but I feel a kind of weight on myself.

I can't focus. I show up here everyday and do nothing productive. The day just melts away into nothingness. The problem that I have decided feels not very meaningful in the end and I can't prove that it is supposed to work. Approx sharedness idea feels too trivial and at the same time too overwhelming. I haven't even thought about correlated variables case in a long time. I really really want to have seminar soon but it seems that I can't do anything right now. I don't know why I feel this way.

I feel an information overload because so many people have tried so many things.It feels that there has been so much work done in disent direction and so much already explored that I can't do any reasonably meaningful work in the framework of thesis here.

Just understanding the results / proofs is challenging enough I do have a niche problem that can be probably solved, but there doesn't seem to be a neat theoretical formulation for it yet and no guarantees that it would work

I am no longer worried / care about the correlated latents case because I don't have faith in even the factorized latents case. it seems that the easier / low-hanging ideas have been explored already. I don't understand VAEs fully either. There's a lot of work exploring that dimension as well.


## Approx Shared ... a bit more formal


$p(x_1,x_2 | S = S_1)$ -- some fixed set of shared factors.

$p(x_1,x_2 | S = S_1)$ and $p(x_1,x_2 | S = S_2)$ -- a mixture e.g. $S_1$ represents shared scales and $S_2$ represents shared posX

* The values of properties come from a distribution
* Approx sharedness ? scale 1 is closer to scale 2 than scale 1 is to scale 5 
* If $x_1$ has scale 1 and has $x_2$ scale 2 we can say that $x_1$ and $x_2$ approximately share scales.* We can't assume anything about marginal probability of any particular scale just based on their closeness in measuring scale e.g. scale 1 may be way more frequent than scale 2 and scale 5 may be as frequent as sclae 1.
* But in data space we will have a unique minimizer $i \in \{scales\} \ \{actual_scale\}$ and would be closest --> Not True! 2 is as close to 3 as 4 is

Instead of equality in latents we can have a bound

Let $z_s$ be the latent corresponding to scale and let true value of z_s for scale 1 we z_{s1}. If we perturb it, it will affect the recon slightly as well and we will get a loss:

$ z_s = z_{s1} + \delta_{z_s} \rightarrow ||x_1 - x_1|| \geq || x_1 - x_1||_{\delta} $

We can test this idea by conditionally sampling from dataset and calculating recon / diff between close values.

We probably can't have a good upper bound for this perturbation... 


We also have an implicit <u>assumption about the scale</u> of latent representations. That is, we assume that they're not too spead apart e.g. fif we sweep z from -1.96 +1.996 we will probably see that all the possible states encoded by it. OTOH if we have only fewer possible values (as in scales) and sweep -1.96 to -1.00 we might not actually change anything perceptually.

I think that prior $\mathcal{N}(0, I)$ already regularizes / enforces a scale.

There doesn't seem to be a dedicated method for correlated latents and no clear argument on what should happen to those in the latent layer i.e. if we should preserve the correlation or disentangle it as well. (See: Correlated VAEs and Property Controlled VAEs PC-VAEs)



There are 2 mains papers that I know which deal with correlated latent disentanglement.

**On Disent Reprs Learned from Correlated Data** which uses method from **Weakly-Supervised Disentanglement Learning without Compromises** but I'm not entirely clear on WHY it works

There are some other VAE related paper which extend it to heirarchy of latent variables that can be 
used to further define these

I read about PixelVAE which uses PixelCNN and also about models like Property Controlled VAE and Correlated VAE.

Thinking about correlated latents cases, we can have some latents which we want to not be correlation and some that we're fine with even if they are correlated.. so there would be 2 groups. The idea is some correlations might be helpful and some might be sensitive. Thought, we might or might not know to which group an attribute belongs beforehand.

**When might we want to resolve correlations and when not?**


A summary from the corr data paper to think / ponder uon

**Goal of paper is: Empirically assess to what extent the additional inductive biases of SoTA disent methods still suffice to learn disent reprs when the tr data exhibits correlations**

**Impossibility result for unsup disent non-paired case states: If the gt model has independent FoVs then there may be (infinitely) many generative models which achieve the optimal likelhood i.e. they're all indistinguishable from the "true" model**

**If FoVs are correlated and we use a factorized prior p(z) then methods which optimize some kind of ELBO might have a bias against disentanglemet. If gt factors p(c) are correlated a disent represenation will never achieve the optimal likelihood and therefore entangled reprs are preferred**


Three reasons given:

1. Independently control relevant semantic aspects of data during inference or generation
2. Ability to sample OOD examples e.g. in which foot sizes changes indept of body height
3. Fairness wrt sensitive attributes

... using indept prior is not a good idea.. so what prior do we use then?

... converting a joint distribution in which rnd vars are correlated to a joint distribution in which vars are indept...converting a gaussian with non-identify covar mat to one with identity covar mat.. normalizing flows provide a way to do this ?

# Correlated VAE

Extend standard VAE by encouraging latent reprs. to take the correlation structure into account.

Apply a correlated prior on the latent variables following the structure known a priori.

Two Variants
- CVAE<sub>ind</sub> : Applies corr prior (eq.4 in paper) but still uses fully factorized singleton variational family
- CVAE<sub>corr</sub> : Applies corr prior (eq.4 in paper) and laso incorporates pairwise latent variational densities along with singleton (eq.5 in paper)

Fit VAEs with a set of latent vars having an acyclic correlation graph (ACG).

In the case of ACG, prior and posterior approx of the latent var densities can be expressed exactly (in closed form).

$ \sum_i^n KL( q_\lambda(z_i|x_i) || p_0(z_i) )$ - As seen here the KL term in ELBO is sum over the per-data-point KL-divergence terms which means we do not consider any correlations of latent representation between data points.
This means for all $i,j$ we don't have any $z_i$ correlated with any $z_j$

This happens because the approximation family $ q_\lambda(z | x) $ factorizes over data point i.e.  $\Pi_i q_\lambda(z_i | x_i)$  and the prior is also Gaussian i.e. $ p_0(z) = \Pi_i p_0(z_i)$



If we already know that there is corr. b/w datapoints we can incroporate this knowledge into the generative process (means Encoder part?) of VAEs (how?)

Assume that correlation structure is given by an undirected (acyclic) graph $G = (V,E)$. There's an edge $(v_i,v_j) \in E$ if datapoints $x_i$ and $x_j$ are correlated. (Extension to general graphs in section 3 of paper)

Instead of being a distribution over $\mathbb{R}^d$ the prior is now a distribution over $\mathbb{R}^d \times \mathbb{R}^d \ldots \times \mathbb{R}^d$

If $G$ is an undirected acyclic graph, such a joint dist exists and can be expressed wih only the singleton and pair-wise marginal dists, without having an intractable normalization constant.


$$ \begin{equation*} p_0^{corr}(Z) = \prod_i p_0(z_i) \prod_E \frac{p_0(z_i, z_j)}{p_0(z_i)p_0(z_j)} \end{equation*} $$

Correlated variational family used in CVAE<sub>corr</sub>


$$ \begin{equation*} q_\lambda(Z|X) = \prod_i^n q_\lambda(z_i|x_i) \prod_E \frac{q_\lambda(z_i, z_j| x_i, x_j)}{ q_\lambda(z_i|x_i) \times  q_\lambda(z_j|x_j)} \end{equation*} $$






In VAE We learn a different distribution for every individual $x_i$. It is an amortized scheme i.e. we use a powerful function i.e. NN to predict parameters of these individual distributions. That NN itself has parameters.