## Network simulation details

This notebook contains the code for simulating gene trees from a number of networks using the software `ms`. For each network, we generated 30 replicates each depicting 30, 100, 300, 1000, and 3000 gene trees for use with the software `TICR` and `MSCquartets`. We generated 30 replicates each of 30, 100, 300, 1000, 3000, and 10000 gene trees for use with the software `HyDe`, which also generates site pattern frequencies enabling calculation of Patterson's D-Statistic, D3, and Dp.

### Networks following Kong and Kubatko 2021 

To both replicate and verify results of HyDe and the D-Statistic, as well as conform to networks previous papers have used to compare other reticulation detection methods (ADMIXTURE, STRUCTURE), we are using four smaller networks that were used in "Comparative Performance of Popular Methods for Hybrid Detection using Genomic Data" by Kong and Kubatko. We use a different method to simulate gene trees using `ms`. Our approach differs, as theirs samples from networks that were decomposed into trees, we sample directly from the network by using the split `-es` and rejoin `-ej` method to depict hybridization events in `ms`. We also simulate only one individual per taxon.

The networks that we simulate have the following Newick formats, and correspond to Fig 1a, 1b, 1e, and 1f of the original Kong and Kubatko paper. [Kong, Kubatko 2021](https://academic.oup.com/sysbio/article/70/5/891/6066192)

- 1a depicts a single hybrid speciation event with 4 total taxa. This has extended newick format and accompanying `ms` command:

    - (4:8.0,((1:1.5,#H1:0.75::0.5):1.5,(3:1.5,(2:0.75)#H1:0.75::0.5):1.5):5.0);
    - This network is tested with varying mixing parameters (0, 0.1, 0.2, 0.3, 0.4, 0.5) to reflect the following network
    - (4:8.0,((1:1.5,#H1:0.75::gamma):1.5,(3:1.5,(2:0.75)#H1:0.75::1-gamma):1.5):5.0);

    - ms 4 ${gt} -T -I 4 1 1 1 1 -es 0.375 2 0.5 -ej 0.75 5 1 -ej 0.75 3 2 -ej 1.5 2 1 -ej 4.0 4 1

    - ms 4 ${gt} -T -I 4 1 1 1 1 -es 0.375 2 0.6 -ej 0.75 5 1 -ej 0.75 3 2 -ej 1.5 2 1 -ej 4.0 4 1

    - ms 4 ${gt} -T -I 4 1 1 1 1 -es 0.375 2 0.7 -ej 0.75 5 1 -ej 0.75 3 2 -ej 1.5 2 1 -ej 4.0 4 1

    - ms 4 ${gt} -T -I 4 1 1 1 1 -es 0.375 2 0.8 -ej 0.75 5 1 -ej 0.75 3 2 -ej 1.5 2 1 -ej 4.0 4 1

    - ms 4 ${gt} -T -I 4 1 1 1 1 -es 0.375 2 0.9 -ej 0.75 5 1 -ej 0.75 3 2 -ej 1.5 2 1 -ej 4.0 4 1
    
    - ms 4 ${gt} -T -I 4 1 1 1 1 -es 0.375 2 1.0 -ej 0.75 5 1 -ej 0.75 3 2 -ej 1.5 2 1 -ej 4.0 4 1
       

- 1b depicts an introgression event with 4 total taxa. This has extended newick format and accompanying `ms` command: 

    - (4:8.0,((3:1.5,#H1:0::0.5):1.5,(1:1.5,(2:1.5)#H1:0::0.5):1.5):5.0);
    
    - ms 4 ${gt} -T -I 4 1 1 1 1 -es 0.375 2 0.5 -ej 0.375 5 3 -ej 0.75 1 2 -ej 1.5 3 2 -ej 4.0 4 2

- 1e depicts an 3 hybridization events with 8 total taxa. This has extended newick format and accompanying `ms` command:

    - (8:11.0,((((1:1.5,#H1:0.75::0.5):1.5,(3:1.5,(2:0.75)#H1:0.75::0.5):1.5):1.5,(4:3.75)#H3:0.75::0.5):1.5,(((5:1.5,#H2:0.75::0.5):1.5,(7:1.5,(6:0.75)#H2:0.75::0.5):1.5):1.5,#H3:0.75::0.5):1.5):5.0);

    - ms 8 ${gt} -T -I 8 1 1 1 1 1 1 1 1 -es 0.375 2 0.5 -ej 0.75 9 3 -ej 0.75 2 1 -es 0.375 6 0.5 -ej 0.75 6 5 -ej 0.75 10 7 -es 1.875 4 0.5 -ej 2.25 4 1 -ej 2.25 11 5 -ej 1.5 7 5 -ej 1.5 3 1 -ej 3.0 5 1 -ej 5.5 8 1
        

- 1f depicts 2 overlapping hybridization events with 5 total taxa. This has extended newick format and accompanying `ms` command: 

    - (5:9.5,(((1:1.5,#H1:0.75::0.5):1.5,((3:1.5,(2:0.75)#H1:0.75::0.5):0.75)#H2:0.75::0.5):1.5,(4:3.0,#H2:0.75::0.5):1.5):5.0);

    - ms 5 ${gt} -T -I 5 1 1 1 1 1 -es 0.375 2 0.5 -ej 0.75 6 3 -ej 0.75 2 1 -es 1.125 3 0.5 -ej 1.5 3 1 -ej 1.5 7 4 -ej 2.25 4 1 -ej 4.75 5 1
    

### Networks following Solis-Lemus and Ane (2016)

From [PhyloNetworks](https://academic.oup.com/mbe/article/34/12/3292/4103410), a network with 10 taxa and 2 hybridization (n10h2) was replicated as well as a network with 15 taxa and 3 hybridizations (n15h3).


- n10h2, which contains two (2) reticulation events has the extended newick format:

    - (10:9.6,(#H1:2.9::0.3,(1:7.2,(2:6.0,(((9:0.4)#H2:5.0::0.8,(3:4.4,(4:3.5,((5:0.2,6:0.2):2.1,(7:1.4,(8:0.4,#H2:0.0::0.2):1.0):0.9):1.2):0.9):1.0):0.1)#H1:0.5::0.7):1.2):1.2):1.2);

    - this is generated with the `ms` command:

    - ms 10 ${gt} -T -I 10 1 1 1 1 1 1 1 1 1 1 -ej 0.1 6 5 -es 0.2 9 0.8 -ej 0.2 11 8 -ej 0.7 8 7 -ej 1.15 7 5 -ej 1.75 5 4 -ej 2.2 4 3 -ej 2.7 9 3 -es 2.75 3 0.7 -ej 3.0 3 2 -ej 3.6 2 1 -ej 4.2 12 1 -ej 4.8 10 1



- n15h3, which contains three (3) reticulation events has the extended newick format:

    - (15:11.0,(1:10.0,((14:8.0,(((7:2.8,((10:0.6)#H3:1.0::0.8,(9:0.4,8:0.4):1.2):1.2):0.8,((11:1.6,#H3:1.0::0.2):1.2,(13:0.4,12:0.4):2.4):0.8):3.4, #H1:0.4::0.3):1.0):1.2, ((((2:0.4,3:0.4):1.4)#H2:3.8::0.8,(((4:2.8,#H2:1.0::0.2):0.8,5:3.6):1.2,6:4.8):0.8 ):1.0)#H1:2.6::0.7):0.8):1.0);

    - this is generated with the `ms` command:

    - ms 15 ${gt} -T -I 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -ej 0.2 9 8 -ej 0.2 3 2 -ej 0.2 13 12 -es 0.3 10 0.8 -ej 0.8 16 11 -ej 0.8 10 8 -es 0.9 2 0.8 -ej 1.4 17 4 -ej 1.8 5 4 -ej 2.4 6 4 -ej 2.8 4 2 -ej 1.4 12 11 -ej 1.4 8 7 -ej 1.8 11 7 -es 3.3 2 0.7 -ej 3.5 18 7 -ej 4.0 14 7 -ej 4.6 7 2 -ej 5.0 2 1 -ej 5.5 15 1

### Additional Networks

From [PhyloNetworks](https://academic.oup.com/mbe/article/34/12/3292/4103410), networks n10h2 and n10h3 were decomposed into networks that contain only one hybridization. These are named depending on the depth (how recent) of the hybridization.

- n10h1_shallow has the extended newick and accompanying ms format:

    - (10:9.6,(1:7.2,(2:6.0,((9:0.4)#H2:5.0::0.8,(3:4.4,(4:3.5,((5:0.2,6:0.2):2.1,(7:1.4,(8:0.4,#H2:0.0::0.2):1.0):0.9):1.2):0.9):1.0):0.6):1.2):2.4);

    - ms 10 ${gt} -T -I 10 1 1 1 1 1 1 1 1 1 1 -ej 0.1 6 5 -es 0.2 9 0.8 -ej 0.2 11 8 -ej 0.7 8 7 -ej 1.15 7 5 -ej 1.75 5 4 -ej 2.2 4 3 -ej 2.7 9 3 -ej 3.0 3 2 -ej 3.6 2 1 -ej 4.8 10 1

- n10h1_deep has the extended newick and accompanying ms format:

    - (10:9.6,(#H1:2.9::0.3,(1:7.2,(2:6.0,((9:5.4,(3:4.4,(4:3.5,((5:0.2,6:0.2):2.1,(7:1.4,8:1.4):0.9):1.2):0.9):1.0):0.1)#H1:0.5::0.7):1.2):1.2):1.2);

    - ms 10 ${gt} -T -I 10 1 1 1 1 1 1 1 1 1 1 -ej 0.1 6 5 -ej 0.7 8 7 -ej 1.15 7 5 -ej 1.75 5 4 -ej 2.2 4 3 -ej 2.7 9 3 -es 2.75 3 0.7 -ej 3.0 3 2 -ej 3.6 2 1 -ej 4.2 11 1 -ej 4.8 10 1

- n15h1_shallow has the extended newick and accompanying ms format:

    - (15:11.0,(1:10.0,((14:8.0,((7:2.8,((10:0.6)#H3:1.0::0.8,(9:0.4,8:0.4):1.2):1.2):0.8,((11:1.6,#H3:1.0::0.2):1.2,(13:0.4,12:0.4):2.4):0.8):4.4):1.2,((2:0.4,3:0.4):5.2,((4:3.6,5:3.6):1.2,6:4.8):0.8):3.6):0.8):1.0);

    - ms 15 ${gt} -T -I 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -ej 0.2 9 8 -ej 0.2 3 2 -ej 0.2 13 12 -es 0.3 10 0.8 -ej 1.8 5 4 -ej 2.4 6 4 -ej 2.8 4 2 -ej 0.8 10 8 -ej 0.8 16 11 -ej 1.4 12 11 -ej 1.4 8 7 -ej 1.8 11 7 -ej 4.0 14 7 -ej 4.6 7 2 -ej 5.0 2 1 -ej 5.5 15 1

- n15h1_intermediate has the extended newick and accompanying ms format:

    - (15:11.0,(1:10.0,((14:8.0,((7:2.8,(10:1.6,(9:0.4,8:0.4):1.2):1.2):0.8,(11:2.8,(13:0.4,12:0.4):2.4):0.8):4.4):1.2,(((2:0.4,3:0.4):1.4)#H2:3.8::0.8,(((4:2.8,#H2:1.0::0.2):0.8,5:3.6):1.2,6:4.8):0.8):3.6):0.8):1.0);

    - ms 15 ${gt} -T -I 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -ej 0.2 9 8 -ej 0.2 3 2 -ej 0.2 13 12 -ej 0.8 10 8 -es 0.9 2 0.8 -ej 1.4 16 4 -ej 1.4 12 11 -ej 1.4 10 7 -ej 1.8 11 7  -ej 1.8 5 4 -ej 2.4 6 4 -ej 2.8 4 2 -ej 4.0 14 7 -ej 4.6 7 2 -ej 5.0 2 1 -ej 5.5 15 1

- n15h1_deep has the extended newick and accompanying ms format:

    - (15:11.0,(1:10.0,((14:8.0,(((7:2.8,(10:1.6,(9:0.4,8:0.4):1.2):1.2):0.8,(11:2.8,(13:0.4,12:0.4):2.4):0.8):3.4, #H1:0.4::0.3):1.0):1.2,(((2:0.4,3:0.4):5.2,((4:3.6,5:3.6):1.2,6:4.8):0.8):1.0)#H1:2.6::0.7):0.8):1.0);

    - ms 15 ${gt} -T -I 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -ej 0.2 9 8 -ej 0.2 3 2 -ej 0.2 13 12 -ej 1.8 5 4 -ej 2.4 6 4 -ej 2.8 4 2 -ej 0.8 10 8 -ej 1.4 12 11 -ej 1.4 8 7 -ej 1.8 11 7 -es 3.3 2 0.7 -ej 3.5 16 7 -ej 4.0 14 7 -ej 4.6 7 2 -ej 5.0 2 1 -ej 5.5 15 1

Another network that was tested, that was the largest simulated network was n25h5.

- n25h5, which contains five (5) reticulation events has the extended newick format:

    - (25:15.0,((23:13.0,(((5:10.0,((((7:1.0,3:1.0):6.0,(((10:4.0,(16:3.0,((6:1.0,12:1.0):1.0,#H26:1.0::0.369):1.0):1.0):1.0,(8:2.0,(11:1.0)#H26:1.0::0.631):3.0):1.0,1:6.0):1.0):1.0)#H30:1.0::0.656,((((22:3.0,(20:2.0,#H28:1.0::0.449):1.0):1.0,18:4.0):1.0,13:5.0):1.0,(((9:2.0,(15:1.0)#H32:1.0::0.605):1.0,((21:1.0,17:1.0):1.0,#H32:1.0::0.395):1.0):1.0,(2:1.0)#H28:3.0::0.551):2.0):3.0):1.0):1.0,(((19:1.0,14:1.0):1.0,4:2.0):7.0,#H30:1.0::0.344):2.0):1.0)#H34:1.0::0.976):1.0,(24:13.0,#H34:1.0::0.024):1.0):1.0);

    - ms 25 ${gt} -T -I 25 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -ej 0.5 12 6 -ej 0.5 7 3 -ej 0.5 21 17 -ej 0.5 19 14 -es 0.5 2 0.551 -ej 1.0 26 20 -es 0.5 11 0.631 -ej 1.0 27 6 -es 0.5 15 0.605 -ej 1.0 28 17 -ej 1.0 14 4 -ej 1.0 15 9 -ej 1.0 11 8 -ej 1.5 17 9 -ej 1.5 16 6 -ej 1.5 22 20 -ej 2.0 9 2 -ej 2.0 10 6 -ej 2.0 20 18 -ej 2.5 18 13 -ej 2.5 8 6 -ej 3.0 13 2 -ej 3.0 6 1 -ej 3.5 3 1 -es 4.0 1 0.656 -ej 4.5 2 1 -ej 4.5 29 4 -ej 5.0 5 1 -ej 5.5 4 1 -es 6.0 1 0.976 -ej 6.5 30 24 -ej 6.5 23 1 -ej 7.0 24 1 -ej 7.5 25 1 

In [None]:
# The following is an example of running TICR, HyDe, and MSCquartets using the network n15h3 as an example, 
# simultaneously timing the output of the files. This was repeated for all networks.

map=/maps_hyde/n15h3map.txt
outgroup=15 # for all simulated files, the outgroup of the network is equal to the size of the network
true_newick_format=n15h3.net # file containing extended newick format of the true network

gene_trees=(30 100 300 1000 3000 10000)
for gt in ${gene_trees}
do
    for i in {1..30}
    do

    ## the following file name should be changed as needed. 
    ## describes the trial number (i) and number of gene trees (gt)
    gt_filename=n15orange.net-gt${gt}-${i}.tre

    ## the following ms format needs to change depending on the network that should be run
    ms 15 ${gt} -T -I 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -ej 0.2 9 8 -ej 0.2 3 2 \
    -ej 0.2 13 12 -es 0.3 10 0.8 -ej 1.8 5 4 -ej 2.4 6 4 -ej 2.8 4 2 -ej 0.8 10 8 \
    -ej 0.8 16 11 -ej 1.4 12 11 -ej 1.4 8 7 -ej 1.8 11 7 -ej 4.0 14 7 -ej 4.6 7 2 -ej 5.0 2 1 \
    -ej 5.5 15 1 | grep -v // > ${gt_filename}

    ## run TICR using gene tree output and compare to major tree
    timefile=${gt_filename}_TICR_time.txt
    TICROut=${gt_filename}_ticr.csv
    { time julia /phylo-microbes/scripts/CHTCFunctions/individual_methods/chtc_TICR.jl ${gt_filename} ${true_newick_format} ${TICROut} ; } 2> ${timefile}
    
    # run MSCquartets using gene tree output
    timefile=${gt_filename}_MSC_time.txt
    MSCOut=${gt_filename}_MSC.csv
    { time Rscript /phylo-microbes/scripts/CHTCFunctions/individual_methods/chtc_MSCquartets.R ${gt_filename} } 2> ${timefile}

    # 100bp length sequence simulation followed by concatenation of sequences:
    /Seq-Gen-1.3.4/source/seq-gen -mHKY -s0.036 -f0.300414,0.191363,0.196748,0.311475 -n1 -l100 <  ${gt_filename} > alignment.sg.phy
    goalign concat -i alignment.sg.phy -p | goalign reformat fasta -p > alignment.fasta 

    # extract SNPs and convert format from fasta to phylip and vcf
    snp-sites alignment.fasta > alignment.snp.fasta
    snp-sites -p alignment.fasta > alignment.fasta.phy.snp.txt
    snp-sites -v alignment.fasta > alignment.fasta.snp.vcf

    # extract number of individuals and sequence length from the phylip alignment
    numindiv=$(head -n 1 alignment.fasta.phy.snp.txt | awk '{print $1}')
    seqlength=$(head -n 1 alignment.fasta.phy.snp.txt | awk '{print $2}')
    
    # create the alignment-specific map file using the dictionary, key-value.txt
    grep -v 'A' alignment.snp.fasta > map.temp1
    sed 's/>//g' map.temp1 > map.temp2
    awk 'BEGIN{OFS = "\t"} NR==FNR{dict[$1]=$2} NR>FNR{print $1, dict[$1]}' ${map} map.temp2 > map.txt

    timefile=${gt_filename}_HyDe_time.txt
    { time run_hyde.py -i alignment.fasta.phy.snp.txt -m map.txt -o ${outgroup} -n $numindiv -t $numindiv -s $seqlength --prefix ${gt_filename} ; } 2> ${timefile}
    done
done

# Post processing of HyDe output files to determine D3, Patterson's D-Statistic, and Dp is found in phylo-microbes/scripts/SummaryTables/analyzeMethodOutputFile.jl
# Post processing of all method output files to determine whether hybrids are detected (e.g. true, false positives) based on their newick files are also found in phylo-microbes/scripts/SummaryTables/analyzeMethodOutputFile.jl
# Post processing of timing output files is outlined in phylo-microbes/SummaryTables/time_summary.jl