# Estimating selection in maize using NGS data


For this exercise, we will be working on a dataset that consists of maize genomes from modern and ancient samples. In the first exercise, we will look for selection in 15 domesticated maize (*Zea mays* subsp. *mays*) genomes by estimating Tajima's D. In the second exercise, we will look for selection in a specific population of maize from Eastern North America using the Population Branch Statistic (PBS).

### Setup enviroment

In [None]:
COURSE_PATH=/course/popgen25
DATA_PATH=$COURSE_PATH/selection
SOFTWARE_PATH=$COURSE_PATH/software

echo --programs that are installed:--
type angsd
type realSFS
type thetaStat

#make folder 
FOLDER=~/SelectionExercise/
echo -e "\n--creating folder-- "
echo $FOLDER
mkdir -p $FOLDER

# enter folder
cd $FOLDER

#make sym link for data and current folder
ln -sfn $FOLDER ~/current_folder
ln -sfn $DATA_PATH ~/data_folder

### Estimating Tajima's D in domesticated maize
For this exercise we will use 15 domesticated maize genomes from different varieties to look for signatures of selection that the domestication process might have left. For this, we have 15 BAM files with sequencing data for maize chromosome 4:

In [None]:
#Go to the working directory where we have the data:

# make links to files and add them to the folder
cp -r -sf  ${DATA_PATH}/bamfiles .
cp -sf ${DATA_PATH}/*fa.gz .
cp -sf ${DATA_PATH}/*fai .

#You can check which files we have in this directory by typing 'ls': 
#list of BAM files with domesticated maize data: 
echo -e "\n-- files in folder "
ls

echo -e "\n-- files in bamfiles folder "
ls bamfiles

Now, let's create a list with the BAM files that we will be using:

In [None]:
#All the modern maize BAM files start with 'RIM'
ls bamfiles/RIM*.bam > DomesticatedMaize.txt

echo -e "\b first lines of file"
head DomesticatedMaize.txt

#### 1. Estimate Genotype-likelihoods (GL)
Now we can estimate GL using ANGSD -doSaf:


In [None]:
# do not run
# command takes to long (~Min)
# you can run it after class if you want
#angsd -bam DomesticatedMaize.txt \
#    -ref B73v3_25.fa.gz \
#    -anc TDD39103.fa.gz \
#    -out maize_chr4 \
#    -doSaf 1 \
#    -C 50 -baq 1 \
#    -GL 2 \
#    -P 5 \
#    -minMapQ 30 \
#    -minQ 20 \
#    -minInd 10 \
#    -setMinDepth 3 \
#    -doCounts 1 \
#    -r 4:1-242029974

Check which parameters we are using and make sure they make sense to you.

This step takes too long to run so wecopy the results from the **res** directory like this (ANGSD outputs three files: maize_chr4.saf.gz, maize_chr4.saf.pos.gz and maize_chr4.saf.idx):

In [None]:
# copy results of the command and print the log    
cp ~/data_folder/res/maize_chr4* .
cat maize_chr4.log

#### 2. Estimate the SFS using realSFS
We will use realSFS to obtain an estimate of the SFS for domesticated maize. Note that we are providing realSFS the file that ends in *.idx*, which is the index for the GL that we generated before:

In [None]:
realSFS maize_chr4.saf.idx > maize_chr4.sfs


Now we will use R to plot the SFS, just to make sure it looks reasonable.

In [None]:
#read the file with the SFS:
sfsmaize<-scan('~/current_folder/maize_chr4.sfs')

#exclude first column and normalise
sfsmaize<-sfsmaize[-c(1,length(sfsmaize))]
sfsmaize<-sfsmaize/sum(sfsmaize)

#create a PDF with the plot (you can also run the barplot line only)
pdf('~/current_folder/chr4_sfs.pdf')
barplot(sfsmaize, names=1:length(sfsmaize), main='SFS chr4 maize')
dev.off()

barplot(sfsmaize, names=1:length(sfsmaize), main='SFS chr4 maize')




To look at the PDF with the plot, go to the main browser tab in jypiter notebook and go to our working directory: 
 **Question**: 
 - What factors do you think could affect our SFS? 
 - What do you think we should consider when deciding which individuals/samples to include in our population?

#### 3. Calculate thetas per site
We will now use **realSFS saf2theta** to get the two diversity metrics that we need to estimate Tajima's D: theta and Pi.

In [None]:
#this might take a few minutes, so wait until it is done running:
realSFS saf2theta maize_chr4.saf.idx  \
    -sfs maize_chr4.sfs \
    -outname maize_chr4


#### 4. Estimate Tajima's D
Now we will use **thetaStat** to estimate Tajima's D in 5Kb windows along chr4:

In [None]:
thetaStat do_stat \
    maize_chr4.thetas.idx \
    -win 5000 \
    -step 1000 \
    -outnames maize_chr4.thetas.5kWind


Once it is done running, let's look at the results:

In [None]:
head -n 5 maize_chr4.thetas.5kWind.pestPG


Can you guess what each column is?

####  Plot the results in R

In [None]:
#read the table with the results from thetaStat:
d<-read.table('~/current_folder/maize_chr4.thetas.5kWind.pestPG', as.is=T, h=T, comment.char='')

#exclude results that are NaN due to missing data:
d<-d[!is.na(d$fuf),]

#exclude windows with less than 100 SNPs:
d<-d[d$nSites>=100,]

#identify the 0.1% windows with the most negative Tajima's D
perc01<-sort(d$Tajima)[length(d$Tajima)*0.001]

#plot the Tajima's D vs the position along chr4:
pdf('~/current_folder/TajimasD_5Kwin.pdf', useDingbats=F, width=7.5, heigh=5)

plot(as.numeric(d$WinCenter), d$Tajima, col='grey80', ylab='Tajima D',  xlab='Chr9', pch=16)

#draw a line to mark the location of bt2 gene
abline(v=c(58979526, 58985686), col='purple')

#draw a line to mark the 0.1% most negative D
abline(h=perc01, col='red', lty=2)
dev.off()

plot(as.numeric(d$WinCenter), d$Tajima, col='grey80', ylab='Tajima D',  xlab='Chr9', pch=16)

#draw a line to mark the location of bt2 gene
abline(v=c(58979526, 58985686), col='purple')

#draw a line to mark the 0.1% most negative D
abline(h=perc01, col='red', lty=2)

**Question**: 
 - Why do you think we are going for 1) negative Tajima's D and 2) the top 0.01% most negative values?

### Population Branch Statistic (PBS)
Now we will look for selection specificially in a population of ancient maize in Eastern North America (ENA). We would like to know if there has been any selection in the evolutionary lineage that gave rise to maize in ENA. We know that maize dispersed north from the domestication center reaching the US Southwest (US SW) ~4000 years ago, and then moved into ENA between 2000-1000 years ago (Figure below,left side).

So, for this exercise we will use genomic data from three populations:
- 9 genomes from the Ozarks rockshelters in ENA
- 10 genomes from the Tularosa cave in the US SW
- 16 genomes from maize's wild ancestor *Zea mays* subsp. *parviglumis*
So, given how these populations are related, we want to setup a PBS test like the one in the figure below (left side)


<img src="data_folder/Figures/Maize_migNorth.png" width="600">


#### 0. Prepare input files


Start by creating a list of BAM files for each population:


In [None]:
#Ozarks maize:
ls bamfiles/Ozark*.bam >ozarks.txt
 
#US SW maize:
ls bamfiles/Tularosa*.bam >SW750.txt

#wild maize:
ls bamfiles/TIL*.bam >parviglumis.txt



#### 1. Estimate GL for each population
Now let's estimate GL for each population independently using ANGSD (we will restrict to chr9 so that it is faster to run). Run one by one (it will take a couple of minutes each):


In [None]:
#### takes around 15 min to run all 3. So lets just copy the results
# US Southwest:
#angsd -bam SW750.txt -ref B73v3_25.fa.gz -anc TDD39103.fa.gz -out sw750_ds1 -doSaf 1 -C 50 -baq 1 -GL 2 -minMapQ 30 -minQ 20 -minInd 6 -setMinDepth 3 -doCounts 1  -r 9:1-157021084 &

# Ozarks (Eastern North America):
#angsd -bam ozarks.txt -ref B73v3_25.fa.gz -anc TDD39103.fa.gz -out ozark_ds1 -doSaf 1 -C 50 -baq 1 -GL 2 -minMapQ 30 -minQ 20 -minInd 6 -setMinDepth 3 -doCounts 1  -r 9:1-157021084 &

# Wild maize (parviglumis):
#angsd -bam parviglumis.txt -ref B73v3_25.fa.gz -anc TDD39103.fa.gz -out parviglumis_ds1 -doSaf 1 -C 50 -baq 1 -GL 2 -minMapQ 30 -minQ 20 -minInd 6 -setMinDepth 3 -doCounts 1  -r 9:1-157021084



## The he restart the kernal and copy results to your folder instead
 cp ~/data_folder/res/*ds1* .
 ls *ds1*



Check again the parameters, do they make sense to you?

#### 2. Estimate 2D-SFS
We now need to estimate the 2-dimension SFS for every combination of 2 populations (again, run one at a time):


In [None]:
# parviglumis X US SW:
realSFS parviglumis_ds1.saf.idx sw750_ds1.saf.idx > parviglumis_sw750.sfs

# parviglumis X Ozarks:
realSFS parviglumis_ds1.saf.idx ozark_ds1.saf.idx > parviglumis_ozark.sfs

# Ozarks X US SW:
realSFS ozark_ds1.saf.idx sw750_ds1.saf.idx > ozark_sw750.sfs



#### 3. Estimate the three-pops FST 
Now we will estimate the F_ST using our GL and 2D-SFS (the order of the populations here will determine the order of the results in the output file):

In [None]:
realSFS fst index parviglumis_ds1.saf.idx ozark_ds1.saf.idx sw750_ds1.saf.idx -fstout parviglumis_ozark_sw750 -whichFst 1 -sfs parviglumis_ozark.sfs -sfs parviglumis_sw750.sfs -sfs ozark_sw750.sfs



####4. Estimate PBS in 5kb windowns
We estimate the FST and PBS along chromosome 4 for in windowns of 5kb:

In [None]:
realSFS fst stats2 parviglumis_ozark_sw750.fst.idx -win 5000 -step 1000  > parviglumis_ozark_sw750_chr9_5Kwin



Once it is done running, we can take a look at the results:


In [None]:
head -n 5 parviglumis_ozark_sw750_chr9_5Kwin



#### 5. Plot the results using R


In [None]:
#read the results:
newtab<-read.table('~/current_folder/parviglumis_ozark_sw750_chr9_5Kwin', as.is=T, h=T)

#exclude windows with less than 10 SNPs:
newtab<-newtab[newtab$Nsites>=10,]

#Get the threshold for the 99.5 percentile (notice how we are ploting the PBS1, which corresponds to the PBS of the Ozarks)q995<-quantile(newtab$PBS1[!is.na(newtab$PBS1)], probs=0.995)
#This is genomic threshold for the 99.9 percentile:q999GW<-0.721893

#make the plot
pdf('~/current_folder/pbs_ozarks_5Kwin.pdf', useDingbats=F, width=7.5, heigh=5)

plot(as.numeric(newtab$midPos), newtab$PBS1, col='grey80', ylab='PBS Ozarks', ylim=c(0, max(c(newtab$PBS0, newtab$PBS1, newtab$PBS2))+0.1), yaxs='i', , xaxs='i', xlab='Chr9', pch=16)

#mark the location of the waxy gene
abline(v=newtab$midPos[newtab$midPos>=23267684 & newtab$midPos<=23271612][1], col='black')

#mark the 99.5 percentaline for chr4 and the 99.9 percentile for the whole genome:abline(h=q995, col='red', lty=2)
#abline(h=q999GW, col='mediumpurple3', lty=2)

dev.off()


plot(as.numeric(newtab$midPos), newtab$PBS1, col='grey80', ylab='PBS Ozarks', ylim=c(0, max(c(newtab$PBS0, newtab$PBS1, newtab$PBS2))+0.1), yaxs='i', , xaxs='i', xlab='Chr9', pch=16)

#mark the location of the waxy gene
abline(v=newtab$midPos[newtab$midPos>=23267684 & newtab$midPos<=23271612][1], col='black')

#mark the 99.5 percentaline for chr4 and the 99.9 percentile for the whole genome:abline(h=q995, col='red', lty=2)
#abline(h=q999GW, col='mediumpurple3', lty=2)



Let's look at the results together.