# Linear Additive Phenotype Simulation Example

Many phenotype simulation tools and GWAS models assume a linear additive genetic architecture. This notebook demonstrates how to simulate a phenotype with a linear additive genetic architecture using CITRUS.

The Python API is used to configure and run the simulation in this notebook. The simulation could be run from the command line using the `citrus simulate` command with a config JSON file matching the config dict in this example. See the [Command Line Interface](../cli.md) documentation for more information.

## Linear Additive Genetic Architecture

A linear additive genetic architecture is one where the phenotype is the sum of the effects of each variant. The phenotype is modeled as:

$$
\begin{align}
y = \sum_{i=1}^{n} \beta_i x_i + \epsilon
\end{align}
$$

where $y$ is the phenotype, $n$ is the number of variants, $\beta_i$ is the effect size of variant $i$, $x_i$ is the genotype of variant $i$, and $\epsilon$ is the error term.






# Designing the Simulation

The three main components of this archetecture that require descisions are the input variants ($x_i$), the effect sizes ($\beta_i$), and the error term ($\epsilon$).

## Input Variants

These will be the SNPs that actually impact the phenotype. In the equation above, $x_i$ will be 0 if the SNP is the reference allele and 1 if it is any alternate allele. When actually running the simulation, we will get these values from a VCF file from 1000 Genomes.

For this example I found input SNPs on chromosome 19 using the [NIH Library of Medicine website](https://www.ncbi.nlm.nih.gov/snp). Below is a table of the rsIDs, loci (GRCh37), and minor allele frequencies (MAF) of the input SNPs.

| rsID      | Locus     | MAF   |
|-----------|-----------|-------|
|rs2147799496|19:54785430|0.01|
|rs2145404651|19:52901080|0.01|
|rs2145329733|19:16023111|0.02|
|rs4029|19:55555845|0.08|
|rs2147796048|19:54784716|0.22|
|rs2147925780|19:11397030|0.41|
|rs2145330557|19:17186520|0.47|

## Effect Sizes

The effect sizes will be the $\beta_i$ values in the equation above.

Rarer variants often have larger effect sizes. To model that in this simulation, input variants are split into two groups: rare (1-2% MAF) and common. Both groups will then draw betas from mean 0 normal distributions, but the rare variants will draw from a distribution with a greater variance. The rare group will have a standard deviation of 0.5 and the common group will have a standard deviation of 0.1.

## Noise

There are two ways to model the noise term. The first is to draw a noise value from some normal distribution and add it to the phenotype. The challenge with this approach is choosing a standard disrtibution of the noise distribution that will give you the proper ratio of genetic signal to noise.

The second approach is to use the [Heritability](../operator_nodes/noise.md#class-heritability) operator node. This node allows you to set a heritability value and then returns a weighted average of the input signal and sampled noise that will give you the desired heritability. This approach is used in this example.

# Configuring the Simulation

To run a simulation we have to define it as configuration dict (or JSON file with the command line interface). The simulation configuration has two keys: 'input' and 'simulation_steps'. The 'input' key defines the input variants and the 'simulation_steps' key defines the steps of the simulation.

## Input Configuration

In this simulation there is a single input source file which generates the values for all the input nodes. The value for the 'input' key is a list of dicts defining input sources. Here, it will be a list with a single dict defining that source file and input nodes.

The input nodes are all of the [SNP type](../input_nodes.md#single-nucleotide-polymorphisms-snps-snp). For more information on [input nodes](../input_nodes.md) and [input sources](../input_sources.md), see the linked documentation. 

In [2]:
config_dict = {
	"input": [
		{
			"file": "1000_genomes_data/ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz",
			"file_format": "vcf",
			"reference_genome": "GRCh37",
			"force_bgz": True,
			"input_nodes": [
				{
					"alias": "rare_variants",
					"type": "SNP",
					"chr": "19",
					"pos": [54785430, 52901080, 16023111]
				},
				{
					"alias": "common_variants",
					"type": "SNP",
					"chr": "19",
					"pos": [55555845, 54784716, 11397030, 17186520]
				}
			]
		}
	]
}

## Simulation Steps Configuration

This simulation requires X operator nodes:

- 2 [RandomConstant](../operator_nodes/constant_func.md#class-randomconstant) nodes to draw the effect sizes for the rare and common variants.
- 2 [Product](../operator_nodes/math_func.md#class-product) nodes to multiply the effect sizes by the input genotypes to get the effect values.
- 1 [Concat](../operator_nodes/util_func.md#class-concat) node to combine the rare and common variant effect sizes.