Estimating the bottleneck size using MLHapRec

This repository contains a series of codes to infer the bottleneck size from short-read data using the haplotype reconstruction package MLHapRec.

This project is developed to analyse the transmission bottleneck size of influenza viruses using short-read data from two timepoints, i.e. one before the transmission (in the donor) and one after transmission (in the recipient).

Requirements

This project requires Multi_locus_trajectories.out, Loci*.dat, and Hap_data*.dat files with an inferred noise parameter C from SAMFIRE. Note that these files may be readily obtained using default commands in SAMFIRE. The files should be available for all the eight (flu type A and B) or seven (C and D) segments of the genome. Create a folder with the name of the segments, i.e. HA, MP, NA, NP, NS, PA, PB1, PB2, and put the corresponding files (mentioned above) into each folder. If some of the segments do not contain a variant, i.e. Multi_locus_trajectories.out is empty, please create the corresponding folder and leave it empty.

Usage

To explain the various features of this project, we consider the following example -- All the files created in this example can be found here.

Suppose we have a dataset with one transmission event (we will later discuss how to deal with larger datasets). We store all the required files in a local directory path/to/directory/Transmission1.

The first step is to construct the underlying haplotypes, q*, in each segment by typing the following in the command line:

t=0
mkdir Transmission1_qStar #making a directory to deposit all the reconstructed haplotypes (make sure to check your local directory and where you create the files)
cd Transmission1_qStar
for i in HA MP NA NP NS PA PB1 PB2; do #these are the names of the folders for the 8 segments of influenza A and B and are sorted alphabetically
    mkdir test_$t #reconstructed haplotypes for the ith segment (following the alphabetical order above) will be saved in a folder named test_i 
	cd test_$t
	echo $i #this shows the segment for which the haplotype reconstruction is currently taking place (note that the progress here could be slow depending on how large is the Multi_locus_trajectories.out
	/path/to/directory/Codes/MLHapRec/./run_MLHapRec /path/to/directory/Transmission1/$i/Multi_locus_trajectories.out 660 1 10
    cd ..
	t=`expr $t + 1`
done

This creates a folder named Transmission1_qStar (make sure the folder is in your local directory where other folders, including Transmission1, are located) which contains 8 subfolders named test_0, test_1, ..., and test_7 with the corresponding reconstructed haplotype sets saved in the outcome_1.txt for each segment HA, MP, NA, NP, NS, PA, PB1, and PB2, respectively. Please check here for the explanation of ./run_MLHapRec input format.

Note that this step could take several minutes (hours) depending on how large your Multi_locus_trajectories.out files are.

The second step is to create simulated short-read data, x*, by typing:

C=660 #change this to your inferred noise parameter 
for s2 in `seq 1 100`; do
## Update path to .run file
/path/to/directory/Codes/Codes_for_xStar/SimRealObsCount.run -if /path/to/directory/Transmission1 -of /path/to/directory/Transmission1_xStar -simOnly true -importHapsMahan /path/to/directory/Transmission1_qStar -format Mahan -rmMonoSim false -C $C -ss $s2 -filterData False 
done

This creates a folder named Transmission1_xStar which contains 100 replica of the Multi_locus_trajectories.out for each segment of Transmission1 assuming a noise parameter C=660 and a fixed random \ generator -ss $s2 for each replicate sample -- You can change the total number of replica by changing the for loop for s2 in seq 1 NUMBER; do and the noise parameter C=VALUE. A detailed explanation of what each flag does can be found here.

The third step is to infer the frequency of the reconstructed haplotypes for each replicate sample x*. We call this the q** set. Note that in this step, no haplotype reconstruction takes place, but we rather (re)infer the frequency of our initially reconstructed haplotype set Transmission1_qStar by typing the following commands:

C=660 #change this to your inferred noise parameter 
mkdir Transmission1_qStarStar  #making a directory to deposit all the re-inferred haplotype frequencies q**
for s in `seq 1 100`; do #this varies from 1 to the total number of replicate samples which, in this case, we set to be equal to 100
   echo $s
   cd /path/to/directory/Transmission1_qStarStar
   mkdir Seed_$s #making a directory to deposit the re-inferred frequencies of set $s
   cd ..
   for t in `seq 0 7`; do
   Codes/./run_qstarstar /path/to/directory/Transmission1_xStar/Seed_$s/SimulatedData_Mahan_Gene_$t.dat /path/to/directory/Transmission1_qStar/test_$t/outcome_1.txt /path/to/directory/Transmission1_qStarStar/Seed_$s qStarStar_$t.txt $C
   done
done

This creates a directory Transmission1_qStarStar with all the inferred haplotype frequencies q** (taken from x*) stored in folders Seed_$s where $s goes from 1 to 100 to cover all the 100 replicate samples.

We then concatenate the inferred frequencies of each segment across the 100 replicates and store them in a new folder called Transmission1_bottleneck with 8 subfolders corresponding to each gene segment, segment_$t, and concatenated file names test_$t.txt (where $t varies from 0 to 7) by typing the following in the command line:

mkdir Transmission1_bottleneck
for t in `seq 0 7`; do
   cd Transmission1_bottleneck
   mkdir segment_$t
   cd ..
   cat /path/to/directory/Transmission1_qStarStar/*/qStarStar_$t.txt > /path/to/directory/Transmission1_bottleneck/segment_$t/test_$t.txt
done

The fourth step is to calculate the bottleneck size for each segment by typing:

for t in `seq 0 7`; do
   cd /path/to/directory/Transmission1_bottleneck/segment_$t/
   /path/to/directory/Codes/./run_bottleneck /path/to/directory/Transmission1_qStar/test_$t/outcome_1.txt /path/to/directory/Transmission1_bottleneck/segment_$t/test_$t.txt
   cd ..
done

This calculates the likelihood distribution with bottleneck sizes ranging from 1 to 1000 and stores it in Transmission1_bottleneck/segment_$t/likelihood_distribution.txt and their corresponding maximum likelihood value in maximum_likelihood.txt in the same directory.

Finally, to calculate the find the likelihood distriubtion and maximum likelihood value for Transmission1, type:

cd /path/to/directory/Transmission1_bottleneck
cat */likelihood_distribution.txt > concatenated_likelihoods.txt
/path/to/directory/Codes/./run_avg_bottleneck concatenated_likelihoods.txt Transmission1_overall_likelihood.txt Transmission1_maximum_likelihood.txt

where we first concatenated all the likelihood_distribution.txt files into concatenated_likelihoods.txt and then calculated the likelihood distribution for Transmission1 in Transmission1_overall_likelihood.txt and its maxmimum likelihood value in Transmission1_maximum_likelihood.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Codes		Codes
Example		Example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
overview.png		overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codes

Codes

Example

Example

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

overview.png

overview.png

Repository files navigation

Estimating the bottleneck size using MLHapRec

Requirements

Usage

About

Releases

Packages

Languages

License

mg878/Bottleneck_Size_Estimation

Folders and files

Latest commit

History

Repository files navigation

Estimating the bottleneck size using MLHapRec

Requirements

Usage

About

Resources

License

Stars

Watchers

Forks

Languages