This simulator can be used for generating admixed genomes. As input the method takes phased data from two populations and generates admixed individuals for a given time of admixture and proportion of ancestry from each population. Method assumes instantaneous admixture but to generate data for multiple pulses or continuous admixture, one can run the same code in a loop. Details of the method can be found in Moorjani et al. 2011.
./simulation.py -p <parfilename>
This program requires two sets of phased individuals in the EIGENSTRAT format (See https://reich.hms.harvard.edu/software/InputFileFormats). The input phased geno, snp, and ind files must be consistent (same number of SNPs in .snp and .geno file, and there should be two rows for each individual which is correctly indentified in the ind file). For an example, see
The program also requires a parameter file. See format below.
Parameter file arguments:
ancestorAgeno: input1.phgeno #file for the first ancestral population ancestorAsnp: input1.phsnp #file for the first ancestral population ancestorAind: input1.phind #file for the first ancestral population ancestorBgeno: input2.phgeno #file for the second ancestral population ancestorBsnp: input2.phsnp #file for the second ancestral population ancestorBind: input2.phind #file for the second ancestral population outputfilename: output #The filename that outputs will be written to. lambda: 10 #The time of admixture in generations. Any number greater than 0. theta: 0.2 #The admixture ratio - theta should be a float between 0 and 1, and indicates the proportion of ancestorA in the produced individuals. n: 100 #The number of individuals. Thus, if the output is haploid, the simulator will create n haploid individuals, otherwise the simulator will create n diploid individuals. haploidoutput: False #If 'False', output diploid genotypes, otherwise output phased individuals. trackancestry: False #If 'False', do not track ancestry, otherwise output information to <output>.ancestry. Each line of the output file will have the ancestry at a SNP - individuals separated by hyphens ('-').
By default, the simulator outputs diploid data (
<output>.geno, <output>.ind, <output>.snp), but for phased admixed chromosomes (
<output>.phgeno, <output>.phind, <output>.phsnp), use
haploidoutput:True. Additionally, you can output the true ancestry at each position (
trackancestry:True. The output files will be in EIGENSTRAT format.
The simulated individuals are labeled:
NA<number>, and have gender as
U (unknown), and the population name of the outputted individuals is set as "Simulation". This nomenclature can easily be changed by updating the
The number of simulated individuals will be lower than the number of individuals in the smaller of the two parental pools, as we sample ancestral haplotypes without replacement. Note also that we guarantee that every crossover event includes a derangement - two individuals cannot copy from the sample haplotye.
Look in the directory
example/ for full details. The command will be:
./simulation.py -p parfiles/simulation.par
Simulating missing data
This utility allows one to set a number of genotypes (based on the user defined proportion) as missing data. This is useful for testing the impact of missing data on the inference (particularly helpful in ancient DNA studies). This is implemented by checking the genotype at every SNP for each individual, and choosing to replace the position with 9 with given probability based on the input missing fraction (-r). The analysis can be performed in two modes:
Diploid mode:output is left as diploid for non-missing data;
Pseudo-haploid mode:output is haploid or missing. This aims to mimic ancient DNA where its hard to call heterozygous sites with limited coverage. Hence typically one randomly samples a read mapping to that site and uses a pseudo-haploid call for the inference. Similarly, in our simulation we randomly sample one allele at each heterozygous site.
./missingdata.py -r <threshold> -f <input> -o <output> -a -s <seed>
Inputs must be in EIGENSTRAT format, PACKEDANCESTRYMAP is not currently supported (See https://reich.hms.harvard.edu/software/InputFileFormats).
-r <float> : Designates the ratio of data to remove. With probability r, each character in the .geno file will be converted to a '9' -f <filename> : Name of the geno file to read and convert. -o <outfilename> XOR -i : The name outfilename designates the file to which to write the new output. -i indicates that the modification is meant to be done in-place - identical to having the output filename be the same as the input file name. You may only use one of these two options.
-s <seed> : Random seed for generation. -a : Indicates ancient data conversion. For characters that aren't chosen to be removed by the main part of the utility, if the character is a '1', indicating a heterozygous site, the utility will replace the character with either a '0' or a '2', with even probability of the outcomes.
The output file as indicated by the arguments will contain the characters from the input file, with each character being replaced with a '9' with probability r, and with '1's being replaced by a '0' or '2' with probability 0.5 each.
In the directory example/, run:
./missingdata.py -r 0.2 -f family.geno -o family_missingdata.geno -a -s 42
This will use data in family.geno (output from simulator/realdata) and output a new pseudo-haploid geno file with missing-data set based on the proportion -r.
Requirements: This utility only uses standard python modules. Developed in Python 3.5.2.
Naive PCA-based estimator for admixture proportions
This utility takes the eigenstrat output file from smartpca (https://github.com/DReichLab/EIG/tree/master/EIGENSTRAT) and performs a naive analysis to infer the admixture proportion in simulated individuals. Assuming a two-way admixture, we can perform a PCA between the admixed individuals and the two ancestral populations and then use the projection on PC 1 and 2 to infer the admixture proportion. Assume $m_1, m_2, m_3$ is the mean position on PC1 for ancestral pop1, ancestral pop2 and the admixed pop respectively. We estimate $theta$ = abs(m_1-m_3)/abs(m_1-m_2). We recognize that this approximation does not hold under complex mixture or when true ancestrals are not available. Though this serves as a handy tool to infer admixture proportions in simulations where the truth is known. We estimate standard errors by using a simple empirical variance calculation.
./pc_analysis.py -f <input evec file> -p <population list>
The input must be the output style of smartpca (*.evec). The first column will be the individual IDs, the second will be the projection of the individual onto the first PC, the third will be the projection of the individual onto the second PC, and the final will be the population title. The first line will contain the eigenvalues (for display purposes only).
-f <filename> : This designates the input file to the program, which is itself the output from a smartpca run on the individuals. -p <poplist> : This indicates the populations to read out. The format for this input is "<Admixed population>;<Reference population 1>,<Reference population 2>".
This utility will output to the standard output - with information on the population means and standard errors on the first two principal components, as well as the estimation of admixture ratio based on positions in the first two principal components.
./pc_analysis.py -f "family.evec" -p "Simulation;French,Yoruba"
If you want to store the outputs:
./pc_analysis.py -f "family.evec" -p "Simulation;French,Yoruba" > pc_analysis.log
-o <outfilename> : This will make the program try to use matplotlib to plot the individuals on the first two PCs. This is currently experimental, as whether or not this will work depends on the backend that your system's matplotlib uses.
The package requires Python 3.5.2 and standard packages. For plotting, matplotlib is also required.