# Optimizing `find_spacers`

Upon testing, it appeared that trying to find all the spacers in the human genome using v2.2 would take ~5 days.  For MerryCRISPR v1, this would take ~1 hour.  Obviously unacceptable.  For off-target scoring, the fix was to vectorize or use pandas.DataFrame.apply(), but I am going to see how that along with other attempts might fix the issue.

In [1]:
import gc
import pyfaidx
import pandas as pd
import numpy as np
from Bio import SeqIO
from multiprocessing import cpu_count, Manager, Pool

import progressbar
from Bio.Alphabet import IUPAC, single_letter_alphabet
from Bio.Restriction import RestrictionBatch
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqUtils import GC
from csv import writer
from functools import partial
from operator import attrgetter
from regex import compile
from subprocess import check_output

from azimuth import model_comparison

In [2]:
input_sequences='/Users/milessmith/workspace/mc_human_files/human_exons.fa'
output_library='/Users/milessmith/workspace/workspace/mc_human_files/human_spacers.csv'
restriction_sites=["EcoRI","BamHI"]
largeindex=False

nuclease='SpCas9'
return_limit=9
reject=False
paired=False

off_target_count_threshold=500
off_target_score_threshold=50

on_target_score_threshold=0
number_mismatches_to_consider=3
nuclease="SpCas9"
spacers_per_feature=9
rule_set="Azimuth"
number_upstream_spacers=0
number_downstream_spacers=0
cores=6

## Convert pyfaidx.Fasta to pandas.DataFrame

First problem: pyfaidx.Fasta is almost a pseudo-dataframe or numpy recarray.  Working with it would be more convenient if we converted it to a bonafide dataframe.

In [3]:
itemlist = pyfaidx.Fasta(input_sequences)
nucleases = pd.read_csv('/Users/milessmith/workspace/merrycrispr/merrycrispr/data/nuclease_list.csv', 
                        dtype={'nuclease': str,
                               'pam': str,
                               'spacer_regex': str,
                               'start': np.int8,
                               'end': np.int8},
                       skip_blank_lines = True)
nuclease_info = nucleases[nucleases['nuclease'] == nuclease]

In [4]:
itemlist[0][:].seq

'GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA'

In [5]:
itemlist[0][:].reverse.complement.seq

'TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAATCTTTCCCCTATGCCTGGTTGTGGTATTAAGTTACATGCAGACAACAGGGGCCAGAAGATGAACAATGGCCCATCCCACTCTAGGCATGGCTCCTCTCCACAGGAAAACTCCACTCCAGTGCTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCTGGAAGTCAGACACCTGCAGATGAAGACCACAGCATCAAGACCCTGTGACCTCTCAAAGGCCCGGTGGAAAGGACACGGGAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAAGGCTGACGGCAAGTTAAC'

In [6]:
itemlist[0].name.split("_")

['DDX11L1', 'exon', '+', '11869', '12227', '-43556487203695757']

In [7]:
np.append(np.append(np.array([itemlist[0].name.split("_")]), itemlist[0][:].seq), itemlist[0][:].reverse.complement.seq)

array(['DDX11L1', 'exon', '+', '11869', '12227', '-43556487203695757',
       'GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA',
       'TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAATCTTTCCCCTATGCCTGGTTGTGGTATTAAGTTACATGCAGACAACAGGGGCCAGAAGATGAACAATGGCCCATCCCACTCTAGGCATGGCTCCTCTCCACAGGAAAACTCCACTCCAGTGCTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCTGGAAGTCAGACACCTGCAGATGAAGACCACAGCATCAAGACCCTGTGACCTCTCAAAGGCCCGGTGGAAAGGACACGGGAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAAGGCTGACGGCAAGTTAAC'],
      dtype='<U359')

In [8]:
fasta_df = pd.DataFrame([itemlist[_].name.split("_") for _ in itemlist.keys()],
                        columns=['gene_name','feature_id','strand','start','stop', "seq_hash"])

In [9]:
fasta_df['sequence'] = pd.Series([itemlist[_][:].seq for _ in itemlist.keys()])

In [10]:
fasta_df['reverse_complement'] = pd.Series([itemlist[_][:].reverse.complement.seq for _ in itemlist.keys()])

In [11]:
fasta_df.head()

Unnamed: 0,gene_name,feature_id,strand,start,stop,seq_hash,sequence,reverse_complement
0,DDX11L1,exon,+,11869,12227,-43556487203695757,GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCAT...,TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTC...
1,DDX11L1,exon,+,12613,12721,5288470158420300119,GTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCA...,CTCTGCAACACTGGGGACACTCACAAGAGTGTGATCCAAGTCGGCC...
2,DDX11L1,exon,+,13221,14409,-3494540769226855306,GCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCT...,CAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAG...
3,DDX11L1,exon,+,12010,12057,-7939539020843674572,GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAG,CTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCTGGAAGTCAGACAC
4,DDX11L1,exon,+,12179,12227,2724784066093171870,TTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA,TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAA


## Memory usage

One issued I noticed is that memory usage is pretty high.  Some initial attempts to solve the computation time issue resulted in memory on my Macbook being swamped for a while before Python crashed.  We can save some memory by having the genomic dataframe use more appropriate dtypes than the default:

In [12]:
smaller = fasta_df.iloc[1:100000,]

In [13]:
smaller.head()

Unnamed: 0,gene_name,feature_id,strand,start,stop,seq_hash,sequence,reverse_complement
1,DDX11L1,exon,+,12613,12721,5288470158420300119,GTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCA...,CTCTGCAACACTGGGGACACTCACAAGAGTGTGATCCAAGTCGGCC...
2,DDX11L1,exon,+,13221,14409,-3494540769226855306,GCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCT...,CAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAG...
3,DDX11L1,exon,+,12010,12057,-7939539020843674572,GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAG,CTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCTGGAAGTCAGACAC
4,DDX11L1,exon,+,12179,12227,2724784066093171870,TTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA,TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAA
5,DDX11L1,exon,+,12613,12697,-8686332244534104149,GTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCA...,AAGAGTGTGATCCAAGTCGGCCGTCGTCTTCTGCAGCTCTGGAGAC...


In [14]:
smaller.dtypes

gene_name             object
feature_id            object
strand                object
start                 object
stop                  object
seq_hash              object
sequence              object
reverse_complement    object
dtype: object

In [15]:
smaller.memory_usage(deep=True) / 1024 ** 2

Index                  0.000080
gene_name              6.027116
feature_id             5.817355
strand                 5.921026
start                  6.232922
stop                   6.232924
seq_hash               7.283731
sequence              37.806230
reverse_complement    37.806245
dtype: float64

In [17]:
def memory_usage(df):
    return(round(df.memory_usage(deep=True).sum() / 1024 ** 2, 2))

In [18]:
memory_usage(smaller)

113.13

In [19]:
np.max(fasta_df['start'])

'99999383'

In [20]:
np.max(fasta_df['stop'])

'99999492'

We need to use a 32-bit int for position.

In [21]:
smaller['feature_id'] = smaller['feature_id'].astype('category')
smaller['gene_name'] = smaller['gene_name'].astype('category')
smaller['strand'] = smaller['strand'].astype('category')
smaller['start'] = smaller['start'].astype(np.uint32)
smaller['stop'] = smaller['stop'].astype(np.uint32)
smaller['seq_hash'] = smaller['seq_hash'].astype(np.int32)
smaller.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

gene_name             category
feature_id            category
strand                category
start                   uint32
stop                    uint32
seq_hash                 int32
sequence                object
reverse_complement      object
dtype: object

In [22]:
smaller.memory_usage(deep=True) / 1024 ** 2

Index                  0.000080
gene_name              0.984830
feature_id             0.095501
strand                 0.095561
start                  0.381466
stop                   0.381466
seq_hash               0.381466
sequence              37.806230
reverse_complement    37.806245
dtype: float64

In [23]:
memory_usage(smaller)

77.93

In [26]:
((100-77.93)/113.13)*100

19.50853000972332

In [27]:
def fasta_to_df(fasta: pyfaidx.Fasta) -> pd.DataFrame:
    df = pd.DataFrame(
        [fasta[_].name.split("_") for _ in fasta.keys()],
        columns=["gene_name", "feature_id", "strand", "start", "stop", "seq_hash"],
    )

    df = df.astype({
        'feature_id':'category',
        'gene_name':'category',
        'strand':'category',
        'start':np.uint32,
        'stop':np.uint32,
        'seq_hash': np.int32
    }, copy = False)

    df["sequence"] = pd.Series([fasta[_][:].seq for _ in fasta.keys()])
    df["reverse_complement"] = pd.Series(
        [fasta[_][:].reverse.complement.seq for _ in fasta.keys()]
    )
    return df

In [28]:
new_fasta_df = fasta_to_df(itemlist)

In [29]:
new_fasta_df.head()

Unnamed: 0,gene_name,feature_id,strand,start,stop,seq_hash,sequence,reverse_complement
0,DDX11L1,exon,+,11869,12227,210719603,GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCAT...,TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTC...
1,DDX11L1,exon,+,12613,12721,-303738537,GTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCA...,CTCTGCAACACTGGGGACACTCACAAGAGTGTGATCCAAGTCGGCC...
2,DDX11L1,exon,+,13221,14409,1830612086,GCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCT...,CAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAG...
3,DDX11L1,exon,+,12010,12057,1984074804,GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAG,CTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCTGGAAGTCAGACAC
4,DDX11L1,exon,+,12179,12227,-316619618,TTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA,TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAA


In [30]:
new_fasta_df.dtypes

gene_name             category
feature_id            category
strand                category
start                   uint32
stop                    uint32
seq_hash                 int32
sequence                object
reverse_complement      object
dtype: object

## Extract spacers

The code to find spacers needs to be revised so that it works over pandas datastructures (dataframes and series) and uses pandas methods where possible.

In [31]:
from tqdm.autonotebook import tqdm



In [167]:
spacer_regex = compile(nuclease_info['spacer_regex'].item())
polyT_regex = compile("([AGCT](?!T{4})){30}")
BsmBI_fwd = "GAGACG"
BsmBI_rev = "CGTCTC"
spacer_start = int(nuclease_info['start'].item())
spacer_end = int(nuclease_info['end'].item()) +1

In [147]:
nuclease_info['spacer_regex'].item()

'(?i)[ACGT]{25}[G]{2}[ACGT]{3}'

In [82]:
spacer_end

24

In [81]:
spacer_start

5

In [33]:
tqdm.pandas(desc="finding forward spacers", unit="sequences")
fasta_df["forward_spacers"] = fasta_df["sequence"].progress_apply(spacer_regex.findall)

HBox(children=(IntProgress(value=0, description='finding forward spacers', max=745513, style=ProgressStyle(des…




In [34]:
tqdm.pandas(desc="finding reverse spacers", unit="sequences")
fasta_df["reverse_spacers"] = fasta_df["reverse_complement"].progress_apply(spacer_regex.findall)

HBox(children=(IntProgress(value=0, description='finding reverse spacers', max=745513, style=ProgressStyle(des…




In [35]:
fasta_df.head()

Unnamed: 0,gene_name,feature_id,strand,start,stop,seq_hash,sequence,reverse_complement,forward_spacers,reverse_spacers
0,DDX11L1,exon,+,11869,12227,-43556487203695757,GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCAT...,TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTC...,"[CCAGACTTCCCGTGTCCTTTCCACCGGGCC, AGAGGTCACAGGG...","[TTCCTCCAATCTTTCCCCTATGCCTGGTTG, GGTATTAAGTTAC..."
1,DDX11L1,exon,+,12613,12721,5288470158420300119,GTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCA...,CTCTGCAACACTGGGGACACTCACAAGAGTGTGATCCAAGTCGGCC...,"[CCAGGCATGCCCTTCCCCAGCATCAGGTCT, AGCTGCAGAAGAC...","[CACTCACAAGAGTGTGATCCAAGTCGGCCG, TCTGCAGCTCTGG..."
2,DDX11L1,exon,+,13221,14409,-3494540769226855306,GCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCT...,CAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAG...,"[GGGATTCTGCCAGCATAGTGCTCCTGGACC, AGTGATACACCCG...","[AAACCAACAGTGTGCTTTTAATAAAGGATC, TAGCTGTGCAGGA..."
3,DDX11L1,exon,+,12010,12057,-7939539020843674572,GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAG,CTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCTGGAAGTCAGACAC,[TTCCAGCAACTGCTGGCCTGTGCCAGGGTG],[ACCCTGGCACAGGCCAGCAGTTGCTGGAAG]
4,DDX11L1,exon,+,12179,12227,2724784066093171870,TTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA,TGGCCTAGGTTGTGAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAA,[GAGCATCAACTTCTCTCACAACCTAGGCCA],[]


No longer need the actual sequences

In [36]:
fasta_df = fasta_df.drop(columns=['sequence','reverse_complement'])

In [38]:
smaller = fasta_df.iloc[:100000, ]

In [39]:
smaller.head()

Unnamed: 0,gene_name,feature_id,strand,start,stop,seq_hash,forward_spacers,reverse_spacers
0,DDX11L1,exon,+,11869,12227,-43556487203695757,"[CCAGACTTCCCGTGTCCTTTCCACCGGGCC, AGAGGTCACAGGG...","[TTCCTCCAATCTTTCCCCTATGCCTGGTTG, GGTATTAAGTTAC..."
1,DDX11L1,exon,+,12613,12721,5288470158420300119,"[CCAGGCATGCCCTTCCCCAGCATCAGGTCT, AGCTGCAGAAGAC...","[CACTCACAAGAGTGTGATCCAAGTCGGCCG, TCTGCAGCTCTGG..."
2,DDX11L1,exon,+,13221,14409,-3494540769226855306,"[GGGATTCTGCCAGCATAGTGCTCCTGGACC, AGTGATACACCCG...","[AAACCAACAGTGTGCTTTTAATAAAGGATC, TAGCTGTGCAGGA..."
3,DDX11L1,exon,+,12010,12057,-7939539020843674572,[TTCCAGCAACTGCTGGCCTGTGCCAGGGTG],[ACCCTGGCACAGGCCAGCAGTTGCTGGAAG]
4,DDX11L1,exon,+,12179,12227,2724784066093171870,[GAGCATCAACTTCTCTCACAACCTAGGCCA],[]


Need to separate the lists of spacers while maintaining the information the the other columns.  This, unfortunately, results in a dataframe that is way too large to fit into memory and results in segfaults.  Two options, as I see it, are to work in chunks and eliminate what we can and/or to move to a module like dask that can accomodate extremely large dataframes.

In [40]:
tqdm.pandas(desc="separating forward spacer lists", unit="sequences")
sfs = smaller.progress_apply(lambda x: pd.Series(x['forward_spacers'], 
                                                       dtype='str'),
                                                 axis=1)

HBox(children=(IntProgress(value=0, description='separating forward spacer lists', max=100000, style=ProgressS…




In [41]:
from copy import deepcopy

In [42]:
sfs = sfs.stack().reset_index(level=1).drop(columns=['level_1'])
sfs.shape[0]

637712

In [46]:
print(f"There are {sfs.shape[0]/smaller.shape[0]:.2} spacer for each original forward sequence")

There are 6.4 spacer for each original forward sequence


In [47]:
sfs = sfs.rename(columns={0:'spacer'})

### Pruning spacers
The first thing we can do is apply our set of filters:
* eliminate spacers with a polyT sequences
* eliminate spacers with a BsmBI recognition sequence
* eliminate spacers with a GC <20% or >80%

Optionally, we could (probably should) also score spacers here and apply the cutoff.

In [48]:
sfs_bak = deepcopy(sfs)

In [56]:
sfs_bak.shape[0]

637712

In [185]:
sfs = deepcopy(sfs_bak)
sfs2 = deepcopy(sfs_bak)

The most straightforward way:

In [117]:
# eliminate those with a polyT
sfs = sfs[sfs['spacer'].str.match('([AGCT](?!T{4})){30}')]
# eliminate those with a BsmBI:
sfs = sfs[~sfs.apply(lambda x: BsmBI_fwd in x['spacer'][spacer_start:spacer_end], axis=1)]
sfs = sfs[~sfs.apply(lambda x: BsmBI_rev in x['spacer'][spacer_start:spacer_end], axis=1)]

However, that is three separate steps *and* lambda functions impose a speed penality.  Don't see why we can't move the search for BsmBI into the regex search.

In [None]:
[ACGT]{spacer_start-1}([ACGT](?!T{{4,}}|GAGACG|CGTCTC)){}[ACGT]{}
[ACGT]{4}([ACGT](?!T{4,}|GAGACG|CGTCTC)){20}[ACGT]{6}
([ACGT](?!T{4,}|GAGACG|CGTCTC)){30}

In [177]:
len(sfs['spacer'][0][spacer_start:spacer_end])

2

In [179]:
sfs['spacer'].iloc[0][spacer_start:spacer_end]

'CTTCCCGTGTCCTTTCCACC'

In [186]:
sfs2 = sfs2[sfs2['spacer'].str[spacer_start:spacer_end].str.match(f'^((?!T{{4,}}|{BsmBI_fwd}|{BsmBI_rev}).)*$')]

In [119]:
sfs.shape[0]

564454

In [187]:
sfs2.shape[0]

575578

In [188]:
sfs2[sfs2.apply(lambda x: "TTTT" in x['spacer'][spacer_start:spacer_end], axis=1)]

Unnamed: 0,spacer


In [183]:
"ATAGTTTTTATTGTGACCTTCCTAGGGTAA"[spacer_start:spacer_end]

'TTTTATTGTGACCTTCCTAG'

So that is a ~12-fold speed up. I think the difference in sizes has to do with the first method matching the polyT anywhere in the sequence whereas in the second method I've specifically excluded the scoring panhandles.

In [192]:
sfs_new_bak = deepcopy(sfs2)

In [193]:
sfs = deepcopy(sfs_new_bak)
sfs2 = deepcopy(sfs_new_bak)
sfs3 = deepcopy(sfs_new_bak)
sfs4 = deepcopy(sfs_new_bak)

In [194]:
sfs = sfs[sfs.apply(lambda x: 20 < GC(Seq(x['spacer'][spacer_start:spacer_end], IUPAC.unambiguous_dna)) < 80, axis=1)]
# sfs = sfs[sfs.apply(lambda x: GC(Seq(x['spacer'][spacer_start:spacer_end], IUPAC.unambiguous_dna)) > 20, axis=1)]

Again, a slow lambda function that works.

In [195]:
sfs2 = sfs2[sfs2['spacer'].apply(Seq).apply(GC).between(20,80)]

Chaining conversion to a `Biopython.Seq` object and using `Biopython.GC` is twice as fast, but we could combine those `applies` into one call:

In [196]:
def seq_gc(seq):
    return GC(Seq(seq))

In [197]:
sfs3 = sfs3[sfs3['spacer'].apply(seq_gc).between(20,80)]

Marginal improvement, but we really don't need to create a `Biopython.Seq` object just to calculate GC content.

In [199]:
import re

In [200]:
def newGC(seq: str) -> float:
    gc = len(re.findall(string=seq, pattern="[GgCc]")) # re.findall is marginally faster than regex.findall
    try: 
        return gc * 100.0/ len(seq)
    except ZeroDivisionError: 
        return 0.0 

In [201]:
sfs4 = sfs4[sfs4['spacer'].apply(newGC).between(20,80)]

Managed to cut that scan time by 10-fold and lets us drop using `Biopython.GC`.

In [202]:
(1 - (sfs.shape[0]/sfs_bak.shape[0]))*100

15.523935569661539

In [203]:
sfs = deepcopy(sfs4)

In [204]:
sfs.head()

Unnamed: 0,spacer
0,CCAGACTTCCCGTGTCCTTTCCACCGGGCC
0,AGAGGTCACAGGGTCTTGATGCTGTGGTCT
0,AGGTGTCTGACTTCCAGCAACTGCTGGCCT
0,CAGGGTGCAAGCTGAGCACTGGAGTGGAGT
0,TGTGGAGAGGAGCCATGCCTAGAGTGGGAT


Eliminating ~15% is a start, but probably not going to solve all my memory problems.

How many of those are unique?

In [205]:
len(sfs['spacer'])

567056

In [206]:
len(sfs['spacer'].unique())

331751

Oh, that's a lot.  And that is the unique values, not the spacers that are unique.

In [207]:
sfs_counts = sfs['spacer'].value_counts().reset_index().rename(columns={'index':'spacer','spacer':'count'})

In [208]:
sfs = sfs.reset_index().merge(sfs_counts, on='spacer', left_index=True)
sfs = sfs[sfs['count'] == 1]
sfs.shape[0]

212837

In [209]:
(1 - (sfs.shape[0]/sfs_bak.shape[0]))*100

66.62490277742931

## memory_usage(sfs_bak)

In [None]:
memory_usage(sfs)

Okay, wow... that is what I was looking for.

In [None]:
smaller.set_index('seq_hash').groupby('gene_name').head()

In [None]:
chunked = np.array_split(smaller, 6)

In [None]:
chunked = np.array_split(fasta_df, 16)

Well, that cuts it in half.  The two lists of spacers are still the largest contributors to memory use.

In [None]:
import dask
import dask.dataframe as dd
import dask.bag as db

In [None]:
# example of using a disk-backed dask dataframe, which is probably where we need to go.
# store = pd.HDFStore(‘./data/clickstream_store.h5’)
# top_links_dask = top_links_grouped_dask.sum().nlargest(20, ‘n’)
# store.put(‘top_links_dask’,
#            top_links_dask.compute(),
#            format=’table’,
#            data_columns=True)

In [None]:
?dd.from_pandas

In [None]:
smaller.head()

In [None]:
smaller_dd = dd.from_pandas(smaller, npartitions=16)

In [None]:
smaller_dd.head()

In [None]:
smaller_dd.dtypes

In [None]:
store = pd.HDFStore("smaller_dd_store.h5")

In [None]:
# tqdm.pandas(desc="separating forward spacer lists", unit="sequences")
forwards = smaller_dd.apply(lambda x: pd.Series(x['forward_spacers']),
                            axis=1,
                            meta={"0": 'object'})

In [None]:
forwards

In [None]:
?smaller_dd.compute_meta()

In [None]:
forwards.head()

In [None]:
store.put(forwards.compute())

In [None]:
tqdm.pandas(desc="separating forward spacer lists", unit="sequences")
sfs = smaller.progress_apply(lambda x: pd.Series(x['forward_spacers'], 
                                                       dtype='str'),
                                                 axis=1)
sfs = sfs.stack().reset_index(level=1).drop(columns=['level_1'])

In [None]:
from math import floor
chunksize=10000
chunks = floor(fasta_df.shape[0]/chunksize)
remainder = fasta_df.shape[0] % 74

print(f"{(chunks*chunksize)+remainder}")

In [None]:
fasta_df.shape[0]

In [None]:
forwards.head()

In [None]:
tqdm.pandas(desc="separating reverse lists", unit="sequences")
reverse = fasta_df['reverse_spacers'].progress_apply(pd.Series)

In [None]:
rsb = RestrictionBatch(restriction_sites)

In [None]:
spacer_df = pd.DataFrame(columns=['gene_name','feature_id','start','stop','strand','spacer'])

In [None]:
spacer_df

In [None]:
itemlist

In [None]:
itemlist.keys()

In [None]:
for item in itemlist.keys():
    # have to use the alternative Regex module instead of Re so that findall can detect overlapping
    # sequences
    spacers = (spacer_regex.findall(itemlist[item][:].seq, overlapped=True) +
                   spacer_regex.findall(itemlist[item][:].reverse.complement.seq, overlapped=True))

    info = dict(zip(['gene_name', 'feature_id', 'strand', 'start', 'end', "misc"], item.split("_")))

    for ps in spacers:
        # Note that ps[4:24] is the actual protospacer.  I need the rest of the sequence for scoring
        ps_seq = Seq(ps[spacer_start:spacer_end], IUPAC.unambiguous_dna)
        ps_full_seq = Seq(ps, IUPAC.unambiguous_dna)

        # Get rid of anything with T(4+) as those act as RNAPIII terminators
        if "TTTT" in ps:
            # TODO Should this also eliminate anything with G(4)?
            pass
        # Get rid of anything that has the verboten restriction sites
        elif bool([y for y in rsb.search(ps_full_seq).values() if y != []]):
            pass
        # BsmBI/Esp3I is used in most of the new CRISPR vectors, especially for library construction.
        # Biopython misses potential restriction sites as it tries to match GAGACGN(5), whereas we need to find
        # matches of just the GAGACG core.  The next four lines take care of that.
        elif 'GAGACG' in ps[spacer_start:spacer_end]:
            pass
        elif 'CGTCTC' in ps[spacer_start:spacer_end]:
            pass
        # Eliminate potentials with a GC content <20 or >80%
        elif GC(ps_seq) <= 20 or GC(ps_seq) >= 80:
            pass
        else:
            ps_start = itemlist[item][:].seq.find(ps) + int(info['start'])
            spacer_data = {'gene_name': [info['gene_name']], 
                           'feature_id': [info['feature_id']], 
                           'start': [ps_start], 
                           'stop': [ps_start+len(ps)], 
                           'strand': [info['strand']], 
                           'spacer': [ps]}
            _ = pd.DataFrame.from_dict(spacer_data)
            # TODO change the spacer here to include 'NGG' so that it is taken into account by Bowtie?
            spacer_df = pd.concat([spacer_df,_])

In [None]:
spacer_df.head()

In [None]:
spacer_df['spacer'].values

In [None]:
predicted_scores = model_comparison.predict(spacer_df['spacer'].values)

In [None]:
spacer_df['score'] = predicted_scores

In [None]:
spacer_df.head()

In [None]:
spacer_df[spacer_df['score'] > 0.75].head()

In [None]:
%store spacer_df

In [None]:
%store nuclease_info