nIndividuals=40
nReplicates=100
nSNPs=2000
rho=300
length=100000
window_width=0.01
window_step=0.005
# simulated 100 replicates of a sweep
msms $nIndividuals $nReplicates -s $nSNPs -r $rho $length -SAA 200 -SaA 100 -SF 1e-2 -N 100000 -Sp 0.5 >output.ms
# calculated statistics along chromosomes
python3 ../scripts/msmscalc_onePop.py infile=output.ms nIndiv=$nIndividuals nCombParam=1 regionSize=$length width=$window_width step=$window_step nRep=$nReplicates root=outputStats
Here, a sweep was simulated 100 times in the middle of a chromosome of 100-kb, sequenced in 40 sampled gametes.

With two simulated datasets using the Guillaume Lan-Fong's SLiM pipeline producing the following files:
- 101_neutral_positions.txt
- 101_neutral_sumStats.txt
- 100_neutral_positions.txt
- 100_neutral_sumStats.txt
- 101_neutral.ms
- 101_neutral_recap_mut.trees
- 101_neutral_recap.trees
- 100_neutral.ms
- 100_neutral_recap_mut.trees
- 100_neutral_recap.trees
- 101_neutral_parameters.txt
- 101_neutral.trees
- 100_neutral_parameters.txt
- 100_neutral.trees
- 101_sweep_positions.txt
- 101_sweep_sumStats.txt
- 100_sweep_positions.txt
- 100_sweep_sumStats.txt
- 101_sweep.ms
- 101_sweep_recap_mut.trees
- 101_sweep_recap.trees
- 100_sweep.ms
- 100_sweep_recap_mut.trees
- 100_sweep_recap.trees
- 100_sweep_parameters.txt
- 100_sweep.trees
- 101_sweep_parameters.txt
- 101_sweep.trees
In the example subdirectory:
cd example
for iteration in 100 101; do
for model in neutral sweep; do
echo ${iteration}_${model}
python3 ../scripts/msmscalc_onePop.py infile=${iteration}_${model}.ms nIndiv=40 nCombParam=1 regionSize=100000 width=0.01 step=0.005 nRep=1 root=${iteration}_${model}
done
done
nIndiv: number of sampled gametes within the simulated population.
nCombParam: number of different combination of parameters with a single ms output file.
nRep: number of replicates for each of the different combination of parameters found within a single ms output file.
regionSize: size in bp of the simulated chromosome.
width: size of the sliding window in bp divided by regionSize. i.e, if width=0.01 for regionSize=100000, then the actual window's width is 1kb.
step: step of the sliding window. Same unit as for width. i.e, if step=0.005, then the window slides with a step of 500bp.
root: root name used to write the different output files.
In the example subdirectory:
# go through all simulated datasets a first time, in order to get the range of variation for each stat : getAllData=1
python3 ../scripts/sim2box_single_YOLOv5.py dpi=300 datapath=$PWD simulation=100 object=posSelection theta=0 phasing=1 plotStats=0 getAllData=1 modelSim=both target_res=512 &
# can then be run independently for all simulated datasets: getAllData=0
for iteration in $(ls *.ms | cut -d "." -f1 | sed "s/_neutral//g" | sed "s/_sweep//g" | sort | uniq); do
python3 ../scripts/sim2box_single_YOLOv5.py dpi=300 datapath=$PWD simulation=${iteration} object=posSelection theta=0 phasing=1 plotStats=0 getAllData=0 modelSim=both target_res=512 &
done
phasing: to specify whether the data where phased or not. For phased data (phasing=1): statistics relative to LD are computed (nHaplotypes, H1, H2, H12, H2 over H1, D, r2).
plotStats: for only plotting jpg files used for machine learning (and not one additional plot per individual statistics) set plotStats to 0. If you want multiple jpg files: plotStats=1.
datapath: datapath of a directory with all neutral and sweep pairs of simulations.
dpi: resolution of the jpg files.
theta: specify the way we define the bounding box. If theta=1, then the bounding box is delimited by 4.N.mu. If theta=0, then the bounding box is delimited by the average pi calculated from the neutral simulation.
modelSim: specify whether the simulated models are sweep (only simulations with selective sweep), neutral (only simulations without sweeps) or both (with both sweep.ms and neutral.ms files in the directory)
Get a list of range_stats.txt files located in different subdirectories, from the main directory:
find . -name "range_stats.txt" > list_ranges.txt
Then execute global_ranges.py:
python3 global_ranges.py list_ranges.txt
You can replace the range_stats.txt files located in different subdirectories by the 'consensus' file produced by global_ranges.py.
- 100_neutral_rawData.txt
- 100_neutral_rawData.jpg
- 100_sweep_rawData.txt
- 100_sweep_rawData.jpg
- 101_neutral_rawData.txt
- 101_neutral_rawData.jpg
- 101_sweep_rawData.txt
- 101_sweep_rawData.jpg
- 100_neutral_globalPic.txt
- 100_neutral_globalPic.jpg
- 100_sweep_globalPic.txt
- 100_sweep_globalPic.jpg
- 101_neutral_globalPic.txt
- 101_neutral_globalPic.jpg
- 101_sweep_globalPic.txt
- 101_sweep_globalPic.jpg
jpg files correspond to simulated genomes.
txt files are associated to jpg files and correspond to coordinates of the box for YOLOv5 : object x_center y_center width height.
summary statistics along windows.
Eeach column is a window, and each row is a statistics with, from the bottom to the top:
- pi (avg)
- pi (std)
- theta (avg)
- Tajima's D
- Achaz's Y
- Pearson's r for pi
- Pearson's r pval for pi
- Number of haplotypes
- Haplotype's homozygosity (H1)
- Haplotype's homozygosity without the most common haplotype (H2)
- Haplotype's homozygosity considering the two most common haplotype as the same one (H12)
- H2 over H1
- Linkage disequilibrium measured by D
- LD measured by r2
binpath=/home/croux/Programmes/deepDILS/scripts/dev
# rootname of the output file
outfile=exp_123
# number of sequenced gametes
n=20
# chromosome length
L=100000
# number of polymorphic sites
S=2000
# STEP BY STEP
## simulations
python3 ${binpath}/submit_msms.py -n ${n} -S ${S} -r 1e-8 -L ${L} -Sp 20000 -s 0.001 -fA 0.01 -Nc 100000 -Na 10000 -m 0 -Ts 100000 -Td 25000 -Tsel 20000 -output ${outfile}.msms
## get trajectory of the advantageous allele
python3 ${binpath}/get_msms_trajectory.py --infile ${outfile}.msms --outfile ${outfile}.trajectory
## compute summary statistics
python3 ${binpath}/msmscalc_onePop_sort_haplo.py --infile ${outfile}.msms --nIndiv ${n} --nCombParam 1 --regionSize ${L} --width 0.1 --step 0.05 --nRep 1 --root CST_0
# THE WHOLE PIPELINE IN ONCE
python3 ${binpath}/launch_pipeline_msms.py --outfile ${outfile} --model EXP -n ${n} -S ${S} -L ${L} -r 2e-6 -fA 0.01
