# Evaluation of RV-QNPL Through Simulation Studies 

## Aim

The aim of this notebook is to display the workflow process used to 1) simulate genotype data for families conditional on quantitative trait values using RarePedSim and 2) the analysis of the simulated data using RV-QNPL to evaluate type I error and power. 

## Method & Workflow overview

1. [`RarePedSim`](http://www.bioinformatics.org/simped/rare/) is used to simulate genotypes for family data using a linear mean shift model with the output of VCF file. (The parameters for RarePedSim are located in a configuration file generated by workflow shown below)

2. To generate the collapsed haplotype pattern (CHP) regional markers for pedigrees the "collapse" command from `rvnpl` is used.

3. The "qnpl" command from `rvnpl` is used to the quantitative trait and CHP markers.

The default workflow can be executed by doing the following command:

```
sos run analysis/QNPL_Simulation.ipynb simulate
```

Or simulation can be performed so that only a proportion of functional rare variants, e.g., 50%,  contribute to disease etiology:
```
sos run analysis/QNPL_Simulation.ipynb simulate --name quant_Prop50 --proportion 0.5
```

Or, to simulate genotypes under the null (no linkage) the null the mean shift is set to 0:
```
sos run analysis/QNPL_Simulation.ipynb simulate --name quant_null --mean_shift 0.0
```

## Input Data
1. The PED file contains the pedigree structures as well as the pre-specified quantitative trait values.
2. The SFS file containing the variant information for each simulatedgene (6-column format: gene, chromosome, position, ref, alt, MAF, function score)*. 

*For the SFS files used here, the variant information was downloaded from gnomAD website. The non-Finnish European allele frequency is used for the MAF column. The value for the "function score" column is based on the functional annotations from gnomAD. 

### Global Parameter Setting

In [1]:
[global]
# Disease model scenario
parameter: name = 'quant_Prop100'
# proportion of functional variants that contribute to the disease
parameter: proportion = 'None'
# Mean shift value of detrimental rare variants
parameter: mean_shift = 1.0
# model: LOGIT for qualitative traits or LNR for quantitative traits
parameter: model = 'LNR'
# Path to the ped file (6-column PED format),quantitative trait values are standardized
parameter: ped_file = path('data/100extend_quant.ped')
# Path to list of genes
parameter: gene_list = path('data/genes.txt')
# the output directory for VCF file
parameter: out_dir = path('output')

# gene names
genes = paths([f'{gene_list:d}/{x.strip()}.sfs' for x in open(gene_list).readlines()])

### Configuration file for disease model
First, we will need to generate the configuration file for the disease model which will be used in the simulation software `RarePedSim`.

#### Specify various simulation scenarios through the configuration file

The configuration file serves as the primary input for `RarePedSim` to generate simulated genotypes for families with given phenotype values. For quantitative traits (QT), linear mean-shift ("LNR") model is used to model the mean-shift effect of causal rare variants on the distribution of QT values. The input `ped` file contains a column of pre-specified QT values. These QT values were sampled from $N(2,1)$ distribution for specified members in the family and from $N(0,1)$ for others, then standardized (mean = 0, standard deviation = 1) and saved as the last column of the PED file. Given these QT values and MAF data from gnomAD, we can generate genotypes under different scenarios for genotype data by giving different values for parameters in the configuration file. For example:

1. When genotypes are simulated under the null that rare variants are not associated with QT values, we set "meanshift_rare_detrimental=0.0". Genotype will be generated from given MAF assuming Hardy-Weinberg equilibrium, regardless of QT values.
2. When simulating under the alternative that rare variants cause mean-shift in QT values, we used "meanshift_rare_detrimental=1.0" to assume each rare variants increase the QT by one standard deviation. That is, QT values will be shifted from $N(0,\sigma^2)$ distribution to $N(\sigma M, \sigma^2)$ where $M$ is total count of detrimental rare variants in the gene. Distribution of $M$ is then estimated based on observed QT, and will be used to adjust baseline MAF to simulated genotypes with QT associations. For additional details please refer to Section 3.3 of [RarePedSim Documentation](https://github.com/statgenetics/rarepedsim/blob/master/doc/doc_RarePedSim.pdf).
3. When simulating with different proportions of rare variants being causal, we can specify a value for parameter "proportion_causal" (e.g. proportion_causal=0.5). 

Please refer to the Appendix of [RarePedSim Documentation](https://github.com/statgenetics/rarepedsim/blob/master/doc/doc_RarePedSim.pdf) for more details on other parameters used in the configuration file.

In [None]:
[make_config: provides = f'{out_dir}/{name}.conf']
# conf file contains the simulation specifications (either Mendelian or Complex, details in RarePedSim doc)
output: f'{out_dir}/{name}.conf'
report: expand=True, output=_output
    trait_type=Complex
    [model]
    model={model}
    [quality control]
    def_rare=0.01
    rare_only=True
    def_neutral=(-1E-5, 1E-5)
    def_protective=(-1, -1E-5)
    [phenotype parameters]
    baseline_effect=0.01
    moi=MAV
    proportion_causal={proportion}
    [LOGIT model]
    OR_rare_detrimental=None
    OR_rare_protective=None
    ORmax_rare_detrimental=None
    ORmin_rare_protective=None
    OR_common_detrimental=None
    OR_common_protective=None
    [LNR model]
    meanshift_rare_detrimental={mean_shift}
    meanshift_rare_protective=None
    meanshiftmax_rare_detrimental=None
    meanshiftmax_rare_protective=None
    meanshift_common_detrimental=None
    meanshift_common_protective=None
    [genotyping artifact]
    missing_low_maf=None
    missing_sites=None
    missing_calls=None
    error_calls=None
    [other]
    max_vars=2
    ascertainment_qualitative=(0,0)
    ascertainment_quantitative=((0,~),(0,~))

### Generate genotypes for given families
Here, we use RarePedSim to generate genotypes for families with given quantitative traits based on user-specified disease model in the configuration file. 

The output file is a VCF file and we need to tabix it before the next step.

In [None]:
[simulate_1 (rarepedsim)]
depends: f'{out_dir}/{name}.conf'
input: for_each = 'genes'
output: f'{out_dir}/{_genes:bn}.vcf.gz'
bash: container = 'statisticalgenetics/rarepedsim', expand = '${ }'
    rm -rf ${_output:nn} ${_output} ${_output}.tbi && mkdir -p ${_output:nn}
    rarepedsim generate -s ${_genes:a} -c ${out_dir}/${name}.conf -p ${ped_file:a} --num_genes 1 --num_reps 1 -o ${_output:nn} --vcf -b -1 \
    && mv ${_output:nn}/${_genes:bn}/rep1.vcf ${_output:n} && rm -rf ${_output:nn}
    bgzip ${_output:n} && tabix -p vcf ${_output}

### Generate CHP regional marker
The next step is to use RV-NPL to generate CHP regional markers for the genotypes in families.

In [None]:
[simulate_2 (CHP)]
output: f'{_input:nn}/MERLIN/{_input:bnn}.CHP.ped'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    rvnpl collapse --fam ${ped_file} --vcf ${_input} --output ${_input:nn} --freq EVSMAF -c 0.01 --rvhaplo \
    && mv ${_output:nn}.chr*.ped ${_output}

### Perform RV-QNPL analysis
Finally, we use RV-QNPL to analyze the CHP genotypes to get the significance of allele-sharing on rare variants in the families.

In [2]:
[simulate_3 (rvnpl)]
output: f'{_input:dd}/pvalue.txt'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    rvnpl qnpl --path ${_input:dd} --output ${_output:d} --exact --rvibd --n_jobs 8 -c 0.001 --lower_cut 1E-8 --rep 2000000

### Results

<Add Result> 

The p-values for two QNPL scores are presented in file pvalue.txt in the corresponding gene folder under the output directory.