Introduction

This tool has been developed to simulate G2G analysis.

G2G or genome-to-genome analysis is a joint analysis of host and pathogen genomes that study side by side correlation between host and pathogen systematic variation.

Load tool

source("G2G_simulator.R")

OPTION 1: Simulate a full G2G study

Similarly to G2G simplified, a study design must be fined in term of host populations (population P1, P2...) and pathogen strains distribution (strain A, B,...).

A. Define G2G data structure

Description:

G2G_conf defines the G2G data structure through a composition of SNP, AA and association function calls.

Usage:

G2G_conf(SNP, AA, association, ...)

G2G_conf =G2G_conf(
      association(
        AA(
          size=1,
          stratified = c("A","B"),
          fst_strat = 0.2,
          biased = c("P1","P2"),
          fst_bias = 0.01,
          beta = 0.3,
          bio_tag = "Asso_Stratified_Biased_AA_PG2"),
        SNP(
          size=1,
          stratified = c("P2","P1"),
          biased = c("B","A"),
          fst_strat = 0.2,
          fst_bias = 0.016,
          bio_tag = "Stratified_biased_SNP"),
        replicate = 100),
      association(
        AA(
          size=1,
          beta = 0.3,
          bio_tag = "Asso_Unstratified_AA"),
        SNP(
          size=1,
          bio_tag = "Unstratified_SNP"),
        replicate = 100),
      AA(
        size=100,
        stratified = c("A","B"), 
        biased = c("P1","P2"),
        fst_strat = 0.2,
        fst_bias = 0.005, 
        bio_tag = "Stratified_biased_AA"),
      SNP(
        size=100,
        stratified = c("P1","P2"),
        biased = c("A","B"),
        fst_strat = 0.2,
        fst_bias = 0.016, 
        bio_tag = "Stratified_biased_SNP"),
      AA(
        size=100,
        stratified = c("A","B"), 
        fst_strat = 0.2,
        bio_tag = "Stratified_AA"),
      SNP(
        size=10000,
        stratified = c("P1","P2"),
        fst_strat = 0.2,
        bio_tag = "Stratified_SNP"),
      SNP(
        size = 40000,
        bio_tag = "Unstratified_SNP")))

Arguments:

SNP, fun: SNP function call
- description: defines (a) SNP(s). SNPs corresponding to the variations of the host side
- Usage: SNP(size, stratified = NA, fst_strat=NA, biased = NA, fst_bias=NA, bio_tag=NA)
- Arguments:
  - size, int: the number of SNPs
  - stratified, vector of strings: host populations groups order give the direction of the stratification from higher MAF to lower MAF
  - fst_strat, int: is the fixation coefficient that defines the stratification magnitude defined by stratified
  - biased, vector of strings: include a bias such that, the pathogen strains are associated with host stratification (regardless of the defined populations). The order gives the direction from higher MAF to lower MAF
  - fst_bias, int: is the fixation coefficient that defines the stratification magnitude defined by biased
AA, fun: AA function call
- description: defines (an) AA(s). AAs for amino acids correspond to the variations on the pathogen side
- Usage: AA(size, stratified = NA, fst_strat=NA, biased = NA, fst_bias=NA, beta=NA, bio_tag=NA)
- Arguments:
  - size, int: the number of pathogen variant
  - stratified, vector of strings: pathogen strains order give the direction of the stratification from higher MAF to lower MAF
  - fst_strat, int: is the fixation coefficient that defines the stratification magnitude defined by stratified
  - biased, vector of strings: include a bias such that, the host populations are associated with pathogen stratification (regardless of the defined pathogen strains). The order gives the direction from higher MAF to lower MAF
  - fst_bias, int: is the fixation coefficient that defines the stratification magnitude defined by biased
  - beta, int: in case of association (and therefore inside the association() function call (see bellow)), the log of odd ratio.
association, fun: association function call
- description: defines an association between (a) SNP(s) and (a) AA(s)
- Usage: association(SNP, AA, replicate)**
- Arguments:
  - SNP, fun: is a SNP function call outcome, the number of SNP will define how many are associated with the AA function call
  - AA, fun: is a AA function call outcome, the number of AA will define how many are associated with the SNP function call
  - replicate, int: is the number of time such an association is added
..., fun: other AA, SNP or association function calls
bio_tag, string: a tag that will be added in the generated dataset.

B. Define the host populations and pathogen strains distributions

Description:

get_study_design defines the host populations and pathogen strains distributions

Usage:

get_study_design(structure)

study_design =  get_study_design(list(
  `P1` = c(`A` = 250, `B` = 250), 
  `P2` = c(`A` = 250, `B`  = 250)))

Arguments:

structure, list of nammed vector of nammed int: defines the study design with the host populations P1 and P2 and their respective proportion in pathogen strains A and B

eg : Here we have the same number of samples in each host population (500) with each 250 with strain A and strain B

C. Generate G2G data

Description:

get_G2G_data generates the G2G data

Usage:

get_G2G_data(study_design, G2G_conf)

G2G_data =	get_G2G_data(
	study_design,
	G2G_conf)

Arguments:

study_design, fun: get_study_design function call

G2G_conf, fun: G2G_conf function call

D. Analyse the G2G data

Description:

analyse_G2G runs the G2G analysis

Usage:

analyse_G2G(G2G_data, correction, nb_cpu = 40)

analyse_G2G(G2G_data,
  get_correction(WO_correction = T, W_host_PC = T, W_pathogen_group = T, W_pathogen_groups_host_PC = T), 
  nb_cpu = 40)

Arguments:

G2G_data is get_G2G_data function call
correction, fun: get_correction function call
- description: defines the series of corrections to assess
- Usage: get_correction(WO_correction = F, W_human_PC = F, W_pathogen_group = F, W_pathogen_groups_host_PC = F)
- Arguments:
  - WO_correction, bool : no correction
  - W_pathogen_group, bool: with pathogen strains
  - W_host_PC, bool: with 5 first PCs from SNPs data (imputed human groups)
  - W_pathogen_groups_host_PC, bool: with 5 first PCs from hosts data and pathogen strains
nb_cpu, int : number of available CPU to use

See here for the results visualization

OPTION 2: Simulate a single genome-to-genome (G2G) association case

A. Define host's populations and pathogen's strains distribution

study_design = get_study_design(structure = list(
  `P1` = c(`A` = 1500, `B` = 1000), 
  `P2` = c(`A` = 1000, `B`  = 1500)))

Description:

get_study_design defines the host and pathogen structure

Usage:

get_study_design(structure)

Arguments:

structure, list of nammed vector of nammed int: defines the study design with the host populations P1 and P2 and their respective proportion in strains A and B

eg : Here we have the same number of samples in each host population (2500) but in P1 1500 samples have strain A and 1000 strain B and conversely in P2, 1000 samples have strain A and 1500 strain B.

B. Define the correlation structure

G2G_setup = get_G2G_setup(rep = 1000, 
  s_stratified = c("P1","P2"), 
  s_biased = c("A","B"),
  a_stratified = c("A","B"))

Description:

get_G2G_setup allows to specify the stratification direction

Usage:

get_G2G_setup(rep, s_stratified = NA, s_biased = NA, a_stratified = NA, a_biased = NA)

Arguments:

rep, int: is the number of repetition you want to execute to draw the pvalue distribution.

s_stratified, vector of strings: host populations groups order give the direction of the stratification from higher MAF to lower MA

eg : here there will be a higher minor allele frequencyi (MAF) in population P1 than in population P2.

s_biased, vector of strings: include a bias such that, the pathogen strains are associated with host stratification (regardless of the defined sub-populations groups). The order gives the direction from higher MAF to lower MAF

eg : Here there will be a higher MAF for the hosts that have strain A than strain B. In conlcusion, the MAF decreases with a maximum for P1 with strain A (P1.A) to P1.B, P2.A and finally P2.B

Similarly for the variants on the pathogen side...
a_stratified, vector of strings: pathogen strains order give the direction of the stratification from higher MAF to lower MAF

a_biased, vector of strings: include a bias such that, the host populations groups are associated with pathogen stratification (regardless of the defined pathiogen strains). The order gives the direction from higher MAF to lower MAF

C. Run simplified G2G

test_G2G_setup(study_design, G2G_setup, 
  fst_host_strat = 0.2, 
  fst_host_bias = 0.2, 
  fst_pathogen_strat = 0.2,
  tag = 'demo')

Description:

test_G2G_setup runs the simplified G2G

Usage:

test_G2G_setup(study_design, G2G_setup, fst_host_strat = NA, fst_host_bias = NA, fst_pathogen_strat = NA, fst_pathogen_bias=NA, tag = 'unnamed')

Arguments:

study_design, fun: get_study_design function call

G2G_setup, fun: get_G2G_setup function function call

fst_host_strat, int: is the fixation coefficient that defines the stratification magnitude defined by s_stratified

fst_host_bias, int: is the fixation coefficient that defines the stratification magnitude defined by s_biased

fst_pathogen_strat, int: is the fixation coefficient that defines the stratification magnitude defined by a_stratified

fst_pathogen_bias, int: is the fixation coefficient that defines the stratification magnitude defined by a_biased

tag folder name to save results

The results are automatically plotted in the tag folder.

OPTION 3: Simulate Case-Control GWAS

Define populations

my_population = generate_population_for_GWAS(list(
  `P1` = c(`case` = 200, `control` = 400), 
  `P2` = c(`case` = 400, `control`  = 200)))

Here we want two sub-populations P1 and P2.

From P1, 200 individuals are in case group and 400 in control group.
From P2, 400 individuals are in case group and 200 in control group.

Define genotyping data & run analysis

GWAS_result = GWAS_scenario(populations = my_population, 
 neutral = 100000, 
 neutral_S_rate = 0.05, 
 causal_NS = seq(1,2, by = 0.05), 
 causal_S = seq(1,2, by = 0.05), 
 fst_strat = 0.2)

Here we want 100,040 SNPs, neutral and causal, stratified or not

Number of neutral SNP is 100,000
On this 5% will be stratified
20 non stratified causal SNP will be added with R coefficient between 1 and 2
20 stratified causal SNP will be added with R coefficient between 1 and 2
Fixation coefficient for making stratification strength is 0.2

Plot results

Plot the results with 3 different conditions :

Without correction
With human groups
With 5 first PCs

On Manhattan plots

plot_GWAS_manhattan(GWAS_result)

On QQ plots

plot_GWAS_QQ(GWAS_result)

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
doc		doc
paper		paper
src		src
.gitignore		.gitignore
G2G_simulator.R		G2G_simulator.R
README.md		README.md
sample_run.R		sample_run.R

onaret/G2G-Simulator

Folders and files

Latest commit

History

Repository files navigation

Introduction

Load tool

OPTION 1: Simulate a full G2G study

A. Define G2G data structure

Description:

Usage:

Arguments:

B. Define the host populations and pathogen strains distributions

Description:

Usage:

Arguments:

C. Generate G2G data

Description:

Usage:

Arguments:

D. Analyse the G2G data

Description:

Usage:

Arguments:

OPTION 2: Simulate a single genome-to-genome (G2G) association case

A. Define host's populations and pathogen's strains distribution

Description:

Usage:

Arguments:

B. Define the correlation structure

Description:

Usage:

Arguments:

C. Run simplified G2G

Description:

Usage:

Arguments:

OPTION 3: Simulate Case-Control GWAS

Define populations

Define genotyping data & run analysis

Plot results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages