# Models

## Theory

This section covers the theory behind the different generative models that were implemented. <br />
Below you can find a brief glossary of variables and terminology used in the following paragraphs.
- L: Number of positions in the sequence
- D: Vocabulary size, in this case usually the 20 amino acids


### Autoregressive Models

Autoregressive language models perform token-wise generation, where the likelihood of the next token is conditioned on the preceding tokens. The key assumption of this modeling paradigm is that the marginal likelihood of a sequence factorizes into
$$ p(x) = \prod_{i}^L p(x_i|x_{<i})$$
This way, one can also easily perform conditional generation by prepending a prompt to the sequence, which the model will condition on when generating the tokens in the sequence. The likelihood then becomes $p(x) = \prod_{i}^L p(x_i|x_{<i}, c)$, with $c$ denoting the prompt.
Within the autoregressive framework, the conditional likelihood at position $i$ is usually parametrized by a transformer neural network, i.e. the network that is being trained outputs (unnormalized) logits for each token in the vocabulary. Since this network takes fixed-length inputs, the correct conditional dependencies are ensured by using an attention mask that hides any tokens suceeding the position of interest as to not leak information.
Autoregressive models are trained using a negative log-likelihood objective: 
$$ \mathcal{L}_{NLL}(batch) = - \sum_{j}^n \sum_i^L p_\theta(x^{(j)}_i|x^{(j)}_{<i}) $$
Generation with autoregressive models is performed sequentially in a left-to-right manner, since all preceding tokens have to be sampled first in order to determine the conditional likelihood of the next.
There exist a variety of techniques to steer the sampling process in order to trade sample stochasticity, such as
- Top-k sampling: Limit the sampling distribution to the k most likely tokens
- Top-p sampling: Limit the sampling distribution to the most likely tokens whose cumulative likelihood is $\leq p$
- Temperature: Temper the softmax used to normalize the conditional logits $p(x_i = k|x_{<i}) = \frac{\exp(s_\theta(x_i = k|x_{<i})/T)}{\sum_a^D\exp( s_\theta(x_i = a|x_{<i})/T)}$. Higher temperature flattens the conditional distribution, thus introducing more randomness
- Repetition penalty: 
- Beam search: 

These parameters allow one to tune more coherent and repetitive samples with more creative but incoherent samples.


#### ZymCTRL

One particular architecture of an autoregressive model is the CTRL model ([Keskar et al., 2019](https://doi.org/10.48550/arXiv.1909.05858)). It takes advantage of the simple extension of unconditional autoregressive models to conditional generation by introducing a control tag. The control tag encodes some properties of the data. Different concepts can be encoded in these codes, for instance the text style or a specific task. <br /><br />
The ZymCTRL adaptation of the CTRL model ([Munsamy et al., 2024](https://doi.org/10.1101/2024.05.03.592223)) uses Enzyme Commission Numbers (EC) as control tags. The EC number encodes the functionality of an enzyme in a hierarchical fashion and consist of 4 numbers separated by . ZymCTRL leverages the GPT-2 transformer architecture ([Radford et al.](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)). Training is conducted on data from the BRENDA enzyme database ()

### Diffusion and Flow models

## Training

## Sample Generation