# RV-NPL analysis for simulated genes

### Aim
In this notebook I show the workflow process to simulate family data with genotypes generated conditional on affection status using RarePedSim. And use RV-NPL to analyze the simulated families.

### Method&Workflow overview

Here I analyze one gene (MYO7A):

1) I first simulate family data using RarePedSim with the output of VCF file. (The parameters are annotated below)


2) Then use "collapse" command from rvnpl to generate CHP regional markers for the genotypes in pedigrees


3) Finally, use "npl" command from rvnpl to analyze RV-NPL score on the genotypes for CHP markers

The workflow can be executed by doing the following command:
```
sos run NPL_Simulation.ipynb simulation
```

### Input Data
1) The PED file containing information of families
2) The SFS file containing the variant information for the simulated gene
3) The configuration file

### Global Parameter Setting

In [1]:
[global]
# Path to the ped file (6-column PED format)
parameter: ped_file = './data/100extend01.ped'
# gene name and chromosome
parameter: gene = 'MYO7A'
parameter: chromosome = 'chr11'
# Variant information for genes to analyze (sfs format)
parameter: sfs_file = './data/MYO7A.sfs'
# conf file contains the simulation specifications (either Mendelian or Complex, details in RarePedSim doc)
# for this notebook, we use the same conf file with multiplicative model
parameter: conf_file = './data/ComplexPhenotype.conf'
# the output directory for VCF file
parameter: out_dir = './data/simulation'


### Generate genotypes for given families
Here, we use RarePedSim to generate genotypes for families with given affection status based on user-specified disease model in the configuration file. 

The output file is a VCF file and we need to tabix it before the next step.

In [None]:
[simulation_1]
depends: executable('rarepedsim')
output: f'{out_dir}/{gene}/rep1.vcf.gz'
bash: expand = '${ }'
    rarepedsim generate -s ${sfs_file} -c ${conf_file} -p ${ped_file} --num_genes 1 --num_reps 1 -o ${out_dir} --vcf --debug -b -1
    bgzip ${_output:n}
    tabix -p vcf ${_output}

### Generate CHP regional marker
The next step is to use RV-NPL to generate CHP regional markers for the genotypes in families.

In [None]:
[simulation_2]
depends: executable('rvnpl')
output: f'{out_dir}/{gene}/rep1/MERLIN/rep1.{chromosome}.ped'
bash: expand = '${ }'
    rvnpl collapse --fam ${ped_file} --vcf ${out_dir}/${gene}/rep1.vcf.gz --output ${out_dir}/${gene}/rep1 --freq EVSMAF -c 0.01 --rvhaplo

### Perform RV-NPL analysis
Finally, we use RV-NPL to analyze the CHP genotypes to get the significance of allele-sharing on rare variants in the families.

In [None]:
[simulation_3]
depends: executable('rvnpl')
output: f'{out_dir}/{gene}/rep1/pvalue.txt'
bash: expand = '${ }'
    rvnpl npl --path ${out_dir}/${gene}/rep1 --output ${out_dir}/${gene}/rep1 --exact --info_only --perfect --sall --rvibd --n_jobs 8 -c 0.001 --rep 2000000 -v 0

### Results

<Add Result> 

The p-values for two NPL scores are presented in file pvalue.txt under the output directory.