Skip to content

mbonsma/CRISPR-immunity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

CRISPR immunity

Data and code for paper arXiv:1801.10086: How adaptive immunity constrains composition and fate of large bacterial populations.

Data

We analyzed data collected by Paez-Espino et al. (2013). Original data from their study is publicly available in the NCBI Sequence Read Archive under the accession SRA062737. It includes four data files (SRR630110, SRR630111, SRR630412, and SRR630413) which we used for our analysis. We extracted the data corresponding to the MOI2 deep sequencing experiment and separated it into time points by checking each read for matches to the primers identified in the supplementary information of Paez-Espino et al.

The data files deposited here are a result of our further processing and sorting.

  • CRISPR1_MOI2_deep_spacer_types.txt contains every unique spacer sequence we detected and its assigned type. Spacers were grouped such that sequences within an edit distance of 2 were assigned to the same type.

    • column 1: spacer sequence
    • column 2: spacer type label
  • CRISPR1_MOI2_deep_timepoint_1.txt contains spacers arranged by their position in each read. Each row represents a read, and each column contains a spacer sequence if one was present or zero if no spacer is present in that position. Positions are ordered such that position 0 is next to the wildtype spacers and position 5 is the furthest from wildtype (newest) spacer. The remainder of the files in the data folder are the same structure as this file but for the rest of the time points in the study (2 through 14).

NOTE: time point 8 was nearly completely missing in the original data, which was confirmed through communication with the authors. We excluded time point 8 from our analysis.

We also analyzed data from Parikka et al. (2017) to compare virus-to-prokaryote ratios (VPR) from a variety of environments with our model. Their data is deposited at 10.15454/1.4539792655245962E12.

Loading the data in Python

The following code or something similar will load the data into a three-dimensional numpy array with time on the first axis and n_ij on the other two axes. n_ij = the number of spacers of type i in locus position j, summed over all reads for that time point. (Since the read counts are high, summing over reads allows the entire data to be loaded into memory. This code can be tweaked if you do want the individual read data.)

import numpy as np
import pandas as pd

filepath = "CRISPR_immunity/data"
spacer_types = pd.read_csv("%s/CRISPR1_MOI2_deep_spacer_types.txt" %filepath, index_col=0)
total_spacer_types = np.max(spacer_types['type_label'])
timepoints = [1,2,3,4,5,6,7,9,10,11,12,13,14] # skip time point 8 - no data at this time point
total_spacer_types = np.max(spacer_types['type_label'])
max_j = 6 # never more than 6 spacers added, so 6 is the highest position index in the locus

# store loaded data
spacer_array = np.zeros((len(timepoints),total_spacer_types+1,max_j), dtype = int)
data_list = []

# loop through files
for t, timepoint in enumerate(timepoints):
    data = pd.read_csv("%s/CRISPR1_MOI2_deep_timepoint_%s.txt" %(filepath, timepoint), index_col = 0) # load data as pandas DataFrame
    data_list.append(data)

    # create array - this part takes a while
    for j, colname in enumerate(data.columns): # get spacers from each column in data table
        for spacer in data[colname]:
            if spacer != '0' and spacer != 0: # 0 if no spacer at that position
                spacer_type = int(spacer_types.loc[spacer_types["spacer_sequence"] == spacer]['type_label']) # get spacer type label
                spacer_array[t, spacer_type, j] += 1 # add data to array

Compiling and running simulation code

Simulation code written by Dominique Soutiere.

  1. Install the Eigen package.

  2. To compile the code:

g++ -std=c++0x -I ~ -O3 tauLeapingWithLoss3C0.cpp -o tauLeaping

After the -I flag, put the path to the Eigen folder downloaded in step 1. If it's in the home directory, use -I ~, otherwise use -I path/to/Eigen.

  1. Make a ./data/ folder in the same folder as the compiled C++ code to save the output data to.

  2. To run the code:

./tauLeaping m xi eta R pv0 exp c0Exp repeat initial_x initial_y initial_z initial_nu

Input parameters:

  • m total number of possible spacer types
  • xi = 1-e, where e is the spacer effectiveness ranging from 0 to 1
  • eta spacer acquisition probability (<= 1) if phage is unsuccessful in infection
  • R = r/(gC0), spacer loss rate
  • pv0 = pV, probability of phage success against bacteria without spacers
  • exp maximum iterations for simulation = 10^exp
  • c0Exp C0, nutrient concentration = 10^c0Exp
  • repeat replicate indicator, used only in creating output filename
  • intial_x initial normalized bacterial population size (0 <= x <= 1), x = nB/C0
  • initial_y initial normalized phage population size (y >= 0), y = nV/C0
  • initial_z initial normalized nutrient concentration (0 <= z <= 1), z = C/C0
  • initial_nu initial fraction of bacteria with spacers (0 <= nu <= 1), nu = nBs/nB

Output data structure (by column, indexing starting at 0):

  • Column 0: tau leaping iteration number
  • Column 1: nB0, number of bacteria without a spacer
  • Columns 2 to m+1: nBi, number of bacteria with spacer type i
  • Column m+2: nV, number of phages
  • Column m+3: C, nutrient concentration
  • Column m+4: t, time in minutes; time in generations = t * g * C0
  • Column m+5: method, unused (always 0)

Example usage:

./tauLeaping 500 0.5 0.0001 0.04 0.02 5 7 0 0.3 3 0.7 0

About

Data and code for paper arXiv:1801.10086

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages