Data and code for paper arXiv:1801.10086: How adaptive immunity constrains composition and fate of large bacterial populations.
We analyzed data collected by Paez-Espino et al. (2013). Original data from their study is publicly available in the NCBI Sequence Read Archive under the accession SRA062737. It includes four data files (SRR630110, SRR630111, SRR630412, and SRR630413) which we used for our analysis. We extracted the data corresponding to the MOI2 deep sequencing experiment and separated it into time points by checking each read for matches to the primers identified in the supplementary information of Paez-Espino et al.
The data files deposited here are a result of our further processing and sorting.
-
CRISPR1_MOI2_deep_spacer_types.txt contains every unique spacer sequence we detected and its assigned type. Spacers were grouped such that sequences within an edit distance of 2 were assigned to the same type.
- column 1: spacer sequence
- column 2: spacer type label
-
CRISPR1_MOI2_deep_timepoint_1.txt contains spacers arranged by their position in each read. Each row represents a read, and each column contains a spacer sequence if one was present or zero if no spacer is present in that position. Positions are ordered such that position 0 is next to the wildtype spacers and position 5 is the furthest from wildtype (newest) spacer. The remainder of the files in the
data
folder are the same structure as this file but for the rest of the time points in the study (2 through 14).
NOTE: time point 8 was nearly completely missing in the original data, which was confirmed through communication with the authors. We excluded time point 8 from our analysis.
We also analyzed data from Parikka et al. (2017) to compare virus-to-prokaryote ratios (VPR) from a variety of environments with our model. Their data is deposited at 10.15454/1.4539792655245962E12.
The following code or something similar will load the data into a three-dimensional numpy array with time on the first axis and n_ij
on the other two axes. n_ij
= the number of spacers of type i
in locus position j
, summed over all reads for that time point. (Since the read counts are high, summing over reads allows the entire data to be loaded into memory. This code can be tweaked if you do want the individual read data.)
import numpy as np
import pandas as pd
filepath = "CRISPR_immunity/data"
spacer_types = pd.read_csv("%s/CRISPR1_MOI2_deep_spacer_types.txt" %filepath, index_col=0)
total_spacer_types = np.max(spacer_types['type_label'])
timepoints = [1,2,3,4,5,6,7,9,10,11,12,13,14] # skip time point 8 - no data at this time point
total_spacer_types = np.max(spacer_types['type_label'])
max_j = 6 # never more than 6 spacers added, so 6 is the highest position index in the locus
# store loaded data
spacer_array = np.zeros((len(timepoints),total_spacer_types+1,max_j), dtype = int)
data_list = []
# loop through files
for t, timepoint in enumerate(timepoints):
data = pd.read_csv("%s/CRISPR1_MOI2_deep_timepoint_%s.txt" %(filepath, timepoint), index_col = 0) # load data as pandas DataFrame
data_list.append(data)
# create array - this part takes a while
for j, colname in enumerate(data.columns): # get spacers from each column in data table
for spacer in data[colname]:
if spacer != '0' and spacer != 0: # 0 if no spacer at that position
spacer_type = int(spacer_types.loc[spacer_types["spacer_sequence"] == spacer]['type_label']) # get spacer type label
spacer_array[t, spacer_type, j] += 1 # add data to array
Simulation code written by Dominique Soutiere.
-
Install the
Eigen
package. -
To compile the code:
g++ -std=c++0x -I ~ -O3 tauLeapingWithLoss3C0.cpp -o tauLeaping
After the -I
flag, put the path to the Eigen
folder downloaded in step 1. If it's in the home directory, use -I ~
, otherwise use -I path/to/Eigen
.
-
Make a
./data/
folder in the same folder as the compiled C++ code to save the output data to. -
To run the code:
./tauLeaping m xi eta R pv0 exp c0Exp repeat initial_x initial_y initial_z initial_nu
Input parameters:
m
total number of possible spacer typesxi
=1-e
, wheree
is the spacer effectiveness ranging from 0 to 1eta
spacer acquisition probability (<= 1) if phage is unsuccessful in infectionR
= r/(gC0), spacer loss ratepv0
= pV, probability of phage success against bacteria without spacersexp
maximum iterations for simulation = 10^expc0Exp
C0, nutrient concentration = 10^c0Exprepeat
replicate indicator, used only in creating output filenameintial_x
initial normalized bacterial population size (0 <= x <= 1), x = nB/C0initial_y
initial normalized phage population size (y >= 0), y = nV/C0initial_z
initial normalized nutrient concentration (0 <= z <= 1), z = C/C0initial_nu
initial fraction of bacteria with spacers (0 <= nu <= 1), nu = nBs/nB
Output data structure (by column, indexing starting at 0):
- Column 0: tau leaping iteration number
- Column 1: nB0, number of bacteria without a spacer
- Columns 2 to m+1: nBi, number of bacteria with spacer type i
- Column m+2: nV, number of phages
- Column m+3: C, nutrient concentration
- Column m+4: t, time in minutes; time in generations = t * g * C0
- Column m+5: method, unused (always 0)
Example usage:
./tauLeaping 500 0.5 0.0001 0.04 0.02 5 7 0 0.3 3 0.7 0