# Simulating genomic data in a 'plant breeding' context

In [1]:
using Pkg
Pkg.rm("XSim")
Pkg.add(PackageSpec(name="XSim", rev="master"))

[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.0/Project.toml`
 [90m [3d41126b][39m[91m - XSim v0.3.0+ #master (https://github.com/reworkhow/XSim.jl.git)[39m
[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.0/Manifest.toml`
 [90m [3d41126b][39m[91m - XSim v0.3.0+ #master (https://github.com/reworkhow/XSim.jl.git)[39m
[32m[1m  Updating[22m[39m registry at `/opt/julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m Installed[22m[39m PDMats ── v0.9.6
[32m[1m Installed[22m[39m Parsers ─ v0.2.15
[32m[1m Installed[22m[39m Compose ─ v0.7.2
[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.0/Project.toml`
 [90m [3d41126b][39m[92m + XSim v0.3.0+ #master (https://github.com/reworkhow/XSim.jl.git)[39m
[32m[1m  Updating[22m[39m `/opt/julia/environments/v1.0/Manifest.toml`
 [90m [a81c6b42][39m[93m ↑ Compose v0.7.1 ⇒ v0.7.2[39m
 [90m [90014a1f][39m[93m ↑ PDMats v0.9.

In [2]:
using XSim, DelimitedFiles, Distributions

┌ Info: Recompiling stale cache file /opt/julia/compiled/v1.0/XSim/fVVb1.ji for XSim [3d41126b-a46a-5bdb-b7a1-7ea6cc35a8ef]
└ @ Base loading.jl:1184


### Defining the genome and reading a real marker catalogue

In [3]:
mutRate    = 0.0
numChr     = 2
nLoci      = 6096
chrLength  = [1.62;1.41]
numLoci    = [3339;2757]
mutationRate = 0.0
myData = readdlm("markerCatalogue4JuliaChrom1-2", ' ', Any, '\n', header=false)
mp1 = Float64.(myData[1,1:numLoci[1]])
mp2 = Float64.(myData[2,1:numLoci[2]])
mapPos = [mp1, mp2]

genefreq1   = fill(0.5,numLoci[1])
genefreq2 = fill(0.5,numLoci[2])
geneFreq = [genefreq1, genefreq2]

idx = rand(numLoci[1]).>0.995  # you want 0.5% to be QTL, i.e about 17 QTL
qtlIndex1 = collect(1:numLoci[1])[idx]
idx = rand(numLoci[2]).>0.995  # you want 0.5%% to be QTL, i.e about 14 QTL
qtlIndex2 = collect(1:numLoci[2])[idx]
qtlIndex = [qtlIndex1, qtlIndex2]
line = 0
numQTL = 0

for i in qtlIndex
    line +=1
    println("Number of QTL on chromosome $line: ", length(i))
    numQTL += 1
end

qtlEffect1 = randn(length(qtlIndex1))/sqrt(0.5*numQTL)
qtlEffect2 = randn(length(qtlIndex2))/sqrt(0.5*numQTL)
qtlEffect = [qtlEffect1, qtlEffect2];

Number of QTL on chromosome 1: 12
Number of QTL on chromosome 2: 12


### Building the genome
Passing  arrays for chrLength, numLoci and arrays of arrays for mapPos, geneFreq, qtlIndex, qtlEffect
i.e. using the most flexible version of build_genome

In [4]:
build_genome(numChr,chrLength,numLoci,geneFreq, mapPos, qtlIndex, qtlEffect, mutationRate)

### Sampling founder individuals from a file with 1326 genotyped individuals
Note that the genotype data for two chromosomes on the file is phased. You may have to use FImpute or some similar software for phasing. You may read less individuals form the data file than it contains, but not more.

In [5]:
popSizeFounder = 1326
basePop = sampleFounders(popSizeFounder, "reformattedMarkerDataChrom1-2");

Sampling 1326 animals into base population.


Splitting the founder individuals into tow cohorts of equal size

In [6]:
basePop1 = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0))
basePop1.animalCohort = basePop.animalCohort[1:663];

In [7]:
basePop2 = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0))
basePop2.animalCohort = basePop.animalCohort[664:1326];

Although not used here, animals and cohorts belong to breeds, and the breed coposition can be set for cohorts and is inferered thereafter from the mating strategy

In [8]:
XSim.setBreedComp(basePop,[1.0])
XSim.setBreedComp(basePop1,[1.0])
XSim.setBreedComp(basePop1,[1.0]);

In [None]:
function sampleOneDHOffspringFrom(parent)
    offspring = XSim.Animal(parent.myID,0)
    XSim.sampleOnePosOri(offspring.genomePat,parent)
    offspring.genomeMat = deepcopy(offspring.genomePat)
    return offspring
end


function sampleDHOffspringFrom(parents::XSim.Cohort, numDHOffs::Int64)
    println("Sampling a single offspring from $numDHOffs parents selected at random from a cohort of size ",size(parents.animalCohort,1))
    offspring=XSim.Cohort(Array{XSim.Animal}(undef,0),Array{Int64}(undef,0,0))
    resize!(offspring.animalCohort,numDHOffs)
    for i in 1:numDHOffs
        parent = XSim.getRandomInd(parents)
        offspring.animalCohort[i] = sampleOneDHOffspringFrom(parent)
    end
    return offspring
end




Removing output files from a previous run, then creating 5000 DH lines from each of the two base populations. The method sampleDHOffspringFrom(parents, numOff) will produce numOff DH Offspring in total by producing a single DH offspring from a randomly sampled parent from the cohort 'parents' with replacement. This process is repeated 'numOff' times. 

In [9]:
outputFileName = "JuliaPlants"
run(`\rm -f $outputFileName.ped`)
run(`\rm -f $outputFileName.phe`)
run(`\rm -f $outputFileName.brc`)
run(`\rm -f $outputFileName.gen`)

numOffspring = 5000
DHBaseLines1 = sampleDHOffspringFrom(basePop1, numOffspring)
DHBaseLines2 = sampleDHOffspringFrom(basePop2, numOffspring);

Sampling a single offspring from 5000 parents selected at random from a cohort of size 663
Sampling a single offspring from 5000 parents selected at random from a cohort of size 663


Let's look at the mean and variance of the genotypic values of our 2*1326 DH-lines in DHBaseline1 and DHBaseline2.  

In [10]:
size(DHBaseLines1.animalCohort,1)

5000

In [11]:
var(getOurGenVals(DHBaseLines1))

8.629846314235138

In [12]:
var(getOurGenVals(DHBaseLines2))

7.807674937771216

In [21]:
mean(getOurGenVals(DHBaseLines1))

1.8838836644619128

In [22]:
mean(getOurGenVals(DHBaseLines2))

1.8636068244147042

Let us specify  the residual variance to simulate phenotypic values, note that getOurPhenVals() is actually setting all individuals phenotypic values in the cohort that is passed as an argument.

In [15]:
resVar = 5
XSim.setResidualVariance(resVar)

5

In [16]:
var(getOurPhenVals(DHBaseLines1,resVar))

13.54821697503983

In [17]:
var(getOurPhenVals(DHBaseLines2,resVar))

12.830965444623798

In [18]:
outputFileName

"JuliaPlants"

Now lets output the data to the files. Note that outputPedigree() is not only outputting the pedigree, but also the phenotypes and genotypes, look at the files that got generated!

In [19]:
outputPedigree(DHBaseLines1,outputFileName);
outputPedigree(DHBaseLines2,outputFileName);

Sampling each 2500 male and female parents from DHBaseline1 at random and producing one offspring each. Sampling of parents is done with replacement. Note that sampleRan() is sampling 50% female and 50% male offspring, we are concatenating them into a single cohort. Note further, that we are using the same population to sample male and female parents from, so sex is ignored and we might even end up cloning an individual. Furthermore, after this step of random mating, our DH-Lines are not double haploids anymore. With the parameter numGen that is passed as an argument to sampleRan(), we are determining the number of generations of random mating. The variable 'gen' is used to store the actual numer of generation the simualtion currently is at.

In [20]:
DHLine1_H1 = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine1_H1m = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine1_H1f = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0))
numOffspring = 2500
numGen = 1
gen = 1
output = true
DHLine1_H1m, DHLine1_H1f, gen = sampleRan(numOffspring, numGen, DHBaseLines1, DHBaseLines1, fileName=outputFileName)
DHLine1_H1 = concatCohorts(DHLine1_H1m, DHLine1_H1f);

Generation     2: sampling  1250 males and  1250 females


Now we are repeating the same random mating for DHBaseline2

In [23]:
DHLine2_H1 = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine2_H1m = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine2_H1f = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0))
numOffspring = 2500
numGen = 1
gen = 1
output = true
DHLine2_H1m, DHLine2_H1f, gen = sampleRan(numOffspring, numGen, DHBaseLines2, DHBaseLines2, fileName=outputFileName)
DHLine2_H1 = concatCohorts(DHLine2_H1m, DHLine2_H1f);

Generation     2: sampling  1250 males and  1250 females


We know now how to generate double haploid lines and how to do random mating. Next let's do phenotypic selection. As before we are declaring the male and female cohorts to hold the offspring that are going to be generated. We will be generation 1000 offspring based on 100 phenotypically best male and female parents. Again we will be selecting the phenotypically best 100 parents from the same population that was generated in the previous step (DHLine1_H1).

In [24]:
DHLine1_H2 = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine1_H2m = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine1_H2f = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0))
numMaleParents = 100
numFemaleParents = 100
numOffspring = 1000
numGen = 1
DHLine1_H2m, DHLine1_H2f, gen = sampleSel(numOffspring, numMaleParents, numFemaleParents, numGen, 
                                    DHLine1_H1, DHLine1_H1, XSim.common.varRes, gen=gen,
                                    fileName=outputFileName, direction=1);
DHLine1_H2 = concatCohorts(DHLine1_H2m, DHLine1_H2f);

Generation     3: sampling   500 males and   500 females


As before, we are applying mass selection also to the other population in the same way as was done above for DHLine_H1 now for DHLine_H2

In [25]:
DHLine2_H2 = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine2_H2m = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
DHLine2_H2f = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0))
numMaleParents = 100
numFemaleParents = 100
numOffspring = 1000
numGen = 1
DHLine2_H2m, DHLine2_H2f, gen = sampleSel(numOffspring, numMaleParents, numFemaleParents, numGen, 
                                    DHLine2_H1, DHLine2_H1, XSim.common.varRes, gen=gen,
                                    fileName=outputFileName, direction=1);
DHLine2_H2 = concatCohorts(DHLine2_H2m, DHLine2_H2f);

Generation     4: sampling   500 males and   500 females


After random mating and phentoypic selection, let us apply selection based on BLUP, i.e. breeding values. For this purpose, we need to read the pedigree and phenotpyic files that were produced sampleRan() and sampleSel() as well as outputPedigree() above. First, let's read the phenotypic data into a dataframe.

In [26]:
using DataFrames
# BLUP selection
phenofile = outputFileName*".phe"
colNames = [:Animal;:y;:bv]
#dfPhen = CSV.read(phenofile,delim = ' ',header=false,names=colNames)
dfPhen = readtable(phenofile, separator=' ', header=false)
names!(dfPhen, colNames)
dfPhen[:1] = string.(dfPhen[:1])
dfPhen

│   caller = ip:0x0
└ @ Core :-1


Unnamed: 0_level_0,Animal,y,bv
Unnamed: 0_level_1,String,Float64⍰,Float64⍰
1,1327,-5.078,-0.646
2,1328,3.83,2.719
3,1329,2.954,0.196
4,1330,10.999,8.527
5,1331,-1.557,0.254
6,1332,6.178,4.857
7,1333,-2.658,-2.483
8,1334,-4.636,-2.34
9,1335,-5.957,-4.267
10,1336,3.452,4.832


Now, we read the pedigree in to a dataframe, reformat it, write is out again such that it can finally be read by the get_pedigree() method. This should be fixed in future releases....

In [27]:
pedfile   = outputFileName*".ped"
dfPed = readtable(pedfile,separator = ' ', header = false)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64⍰,Int64⍰,Int64⍰
1,1327,410,0
2,1328,82,0
3,1329,181,0
4,1330,284,0
5,1331,82,0
6,1332,628,0
7,1333,530,0
8,1334,521,0
9,1335,291,0
10,1336,89,0


In [28]:
using CSV
pedfile = outputFileName*".pedReformatted"
CSV.write(pedfile, dfPed, delim=';',header=["ind","sire" , "dam"])

"JuliaPlants.pedReformatted"

In [29]:
ped = XSim.get_pedigree(pedfile,separator=';');

[32mcoding pedigree... 100%|████████████████████████████████| Time: 0:00:01[39m
[32mcalculating inbreeding...   0%|                         |  ETA: 0:46:19[39m

Finished!


[32mcalculating inbreeding...  75%|███████████████████      |  ETA: 0:00:00[39m[32mcalculating inbreeding... 100%|█████████████████████████| Time: 0:00:00[39m


After having read phenotypes and genotypes, we can now set the genetic variance that shall be used in BLUP, define the model for analysis, setup the mixed model equations and solve them. 

In [30]:
varGen = 5
mme = XSim.build_model("y = intercept + Animal",resVar)
XSim.set_random(mme,"Animal",ped,varGen)
out = XSim.solve(mme,dfPhen,solver="GaussSeidel",printout_frequency=40)

40 0.022237818708866375
80 2.1108926421665888e-5


18330×2 Array{Any,2}:
 "1:intercept : intercept"   1.84572  
 "1:Animal : 188"           -0.790875 
 "1:Animal : 5734"           2.73906  
 "1:Animal : 1211"          -1.29698  
 "1:Animal : 10467"          0.0873198
 "1:Animal : 1160"           2.69014  
 "1:Animal : 8169"           1.00273  
 "1:Animal : 599"           -3.11752  
 "1:Animal : 228"            2.68596  
 "1:Animal : 726"           -6.50645  
 "1:Animal : 10249"         -2.71558  
 "1:Animal : 690"           -4.24571  
 "1:Animal : 8975"          -3.30308  
 ⋮                                    
 "1:Animal : 16870"          3.04122  
 "1:Animal : 6297"          -0.108775 
 "1:Animal : 16118"         -1.64379  
 "1:Animal : 15750"          1.76839  
 "1:Animal : 16099"          0.102341 
 "1:Animal : 15435"          0.440306 
 "1:Animal : 13621"         -1.02823  
 "1:Animal : 14760"          0.11767  
 "1:Animal : 2551"          -0.0618637
 "1:Animal : 7319"           0.430224 
 "1:Animal : 12795"         -1.93809  
 "1

Now all our indiviudals have estimated genotypic values, with the method putEBV() we are transferring them to the indiviudals that we would like to select to become the parents of the next generation. In our case, these would be the cohorts DHLine1_H2 and DHLine2_H2

In [31]:
# transfer BLUP-EBV to animals
XSim.putEBV(DHLine1_H2,ped,mme,out)
XSim.putEBV(DHLine2_H2,ped,mme,out)

In [32]:
DHLine1_H2_sorted = XSim.Cohort(Array{XSim.Animal}(undef,0),Array{Int64}(undef,0,0))
DHLine2_H2_sorted  = XSim.Cohort(Array{XSim.Animal}(undef,0),Array{Int64}(undef,0,0))
DHLine1_H3  = XSim.Cohort(Array{XSim.Animal}(undef,0),Array{Int64}(undef,0,0))
DHLine2_H3  = XSim.Cohort(Array{XSim.Animal}(undef,0),Array{Int64}(undef,0,0))
direction = 1.0
numMaleParents = 50
numFemaleParents = 50
numOffspring = 500
numGen = 1

1

After having declared empty cohorts for the offspring of the next generation, we are creating a y vector with the animals EBVs in the direction we want to select on (e.g. if low values arefavourable, direction should be -1.0). Then EBVs are sorted and the best number of parents are selected to be passed as parents to sampleChildren()

In [33]:
println("Generation $gen : sampling children by BLUP")
y = direction*[animal.ebv for animal in DHLine1_H2.animalCohort]
DHLine1_H2_sorted.animalCohort = DHLine1_H2.animalCohort[sortperm(y)][(end-numMaleParents+1):end]
y = direction*[animal.ebv for animal in DHLine2_H2.animalCohort]
DHLine2_H2_sorted.animalCohort = DHLine2_H2.animalCohort[sortperm(y)][(end-numFemaleParents+1):end]
DHLine1_H3 = XSim.sampleChildren(DHLine1_H2_sorted,DHLine1_H2_sorted,numOffspring)
DHLine2_H3 = XSim.sampleChildren(DHLine2_H2_sorted,DHLine2_H2_sorted,numOffspring)
outputPedigree(DHLine1_H3,outputFileName)
outputPedigree(DHLine2_H3,outputFileName)
gen+=1

Generation 4 : sampling children by BLUP


5

Finally, we are producing each 250 double haploid lines again, from the offspring of the BLUP selected parents 

In [34]:
println("Producing DHLines again....")
DHLine1_4  = XSim.Cohort(Array{XSim.Animal}(undef,0),Array{Int64}(undef,0,0))
DHLine2_4  = XSim.Cohort(Array{XSim.Animal}(undef,0),Array{Int64}(undef,0,0))

numOffspring = 250
DHLine1_4 = sampleDHOffspringFrom(DHLine1_H3, numOffspring)
DHLine2_4 = sampleDHOffspringFrom(DHLine2_H3, numOffspring);
     


Producing DHLines again....
Sampling a single offspring from 250 parents selected at random from a cohort of size 500
Sampling a single offspring from 250 parents selected at random from a cohort of size 500


In [35]:
outputPedigree(DHLine1_4,outputFileName);
outputPedigree(DHLine2_4,outputFileName);

The final step is crossing the two DHLines again to produce hybrids....

In [36]:
println("Producing hybrids now....")
H1 = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
H1m = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0)) 
H1f = XSim.Cohort(Array{XSim.Animal,1}(undef,0),Array{Int64,2}(undef,0,0))
numOffspring = 250
numGen = 1
gen = 5
output = true
H1m, H1f, gen = sampleRan(numOffspring, numGen, DHLine1_4, DHLine2_4, fileName=outputFileName)
H1 = concatCohorts(DHLine2_H1m, DHLine2_H1f);


Producing hybrids now....
Generation     2: sampling   125 males and   125 females


Of course the 'breeding program' outlined here is not reflecting a real situation but rahter shows you the tools XSim has to do simulations. You should be able to modify the above code to better reflect a real situation. Also you may modify XSim's methods to better reflect the situation you need to cope with. Note that all data simulated is written to the files with the name <outputFileName> that you declared above. so You should be able to look at selection response, genetic trend across generation etc.