Skip to content

lh64/MultihitSimulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultihitSimulation

Codes for performing multihit mutation simulations and generating plots based on bacterial genome data from Merrikh and Merrikh (https://lab.vanderbilt.edu/merrikh-lab/).

Raw and processed data files for 5 and 50 bacterial genomes are included in the RAW_DATA and SIM_INPUT directories, respectively. Simulations are performed in multihit_sim.py, which reads in data from SIM_INPUT about gene orientation (co-directional (CD) or head-on (HO)), length (# of amino acids), and number of nonsynonymous mutations for all genes of length > 200. For each gene, N random integers in the range [1, f*L] are generated, where N is the number of nonsynonymous mutations, f is a factor between 0 and 1, and L is the number of amino acids in the protein sequence. The list of random integers is then searched for duplicates, which represent "multihit" or "hotspot" mutations happening by chance. For each set of duplicates, M random integers are then generated between [1, 20], where M is the number of multihit mutations and 20 is the number of naturally occurring amino acids. This list is also searched for duplicates, which correspond to "parallel" mutations, i.e., identical mutations occurring at the same site. This procedure is repeated 10,000 times each for f=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1].

For each value of f, the simulation results are deposited in the SIM_OUTPUT directory. Outputted results include total numbers of multihit mutations, numbers of genes with multihit mutations, and total numbers of parallel mutations, calculated for each simulation and separated by gene orientation (CD and HO). File names are given as [hsmut_sites OR hsmut_genes OR pamut]_[5 OR 50]strains_[f*100].csv. Plots of simulation results are generated by make_plots.py, which reads the simulation output files from SIM_OUTPUT and deposits the plots in the PLOTS directory. Plots include histograms of numbers of multihit mutations, numbers of genes with multihit mutations, and numbers of parallel mutations, for both CD and HO genes. For comparison, the observed numbers of mutations are also included as annotations (taken from the processed data file in SIM_INPUT) along with a two-sided p-value calculated as 2*min{sum(sim_val < obs_val), sum(sim_val > obs_val)}/n_sims, where n_sims is 10,000. Histogram plots are also generated for the quantities RN and RG, which are defined as ratios of the numbers of multihit (or parallel) mutations or genes with multihit mutations, respectively, for HO to CD genes. Example plots are shown below for total numbers of multihit mutations for 50 bacterial strains ( f=1; left) and the ratio RN of HO to CD multihit mutations for 5 and 50 bacterial strains ( f=1; right).

About

Code for analyzing bacterial genome data from Merrikh and Merrikh

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages