In this notebook I create filters (in the form boolean numpy arrays) that mark sequences that are also contained in the Merck&Co dataset (duplicates) or contain noncanonical amino acids. These filters can later be applied to *BacDive+* to remove the corresponding sequences.

## Set up notebook and environment: ###

### Connect to google drive: ####

In [0]:
import os
import numpy as np
import pandas as pd

Using TensorFlow backend.


In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Specify location of all relevant data: (YOU HAVE TO INSERT YOUR OWN FILE LOCATIONS HERE) ####

In [0]:
print("Check correctnes of locations: ")
data_folder_location = "gdrive/My Drive/iGEM/Databases/BacDive/Data/"
sequence_location_m_and_c_T1626 = "gdrive/My Drive/iGEM/Databases/Merck&Co/Data/T1626/sequence_sampler_T1626/T1626_sequences_only.csv"
sequence_location_m_and_c_T251 = "gdrive/My Drive/iGEM/Databases/Merck&Co/Data/T251/sequence_sampler_T251/T251_sequences_only.csv"
sequence_location_m_and_c_T96 = "gdrive/My Drive/iGEM/Databases/Merck&Co/Data/T96/sequence_sampler_T96/T96_sequences_only.csv"
X_seq_raw_location = data_folder_location+"bac_dive_ext_sequences_raw.csv"
X_location = data_folder_location+"X.npy"
print(os.path.isfile(sequence_location_m_and_c_T1626))
print(os.path.isfile(sequence_location_m_and_c_T251))
print(os.path.isfile(sequence_location_m_and_c_T96))
print(os.path.isdir(data_folder_location))
print(os.path.isfile(X_seq_raw_location))
print(os.path.isfile(X_location))

Check correctnes of locations: 
True
True
True
True
True
True


In [0]:
X_seq_raw = pd.read_csv(X_seq_raw_location)

In [0]:
X = np.load(X_location)

## Create combined filter for AA and Merck&Co

### Create Merck&Co unique sequence set

In [0]:
T1626_seq_table = pd.read_csv(sequence_location_m_and_c_T1626, index_col="Unnamed: 0")
T251_seq_table = pd.read_csv(sequence_location_m_and_c_T251, index_col="Unnamed: 0")
T96_seq_table = pd.read_csv(sequence_location_m_and_c_T96, index_col="Unnamed: 0")

In [0]:
T1626_seq_table.head() 

Unnamed: 0,Mutation,base_sequence,mutated_sequence,mutation_offsets
0,1AKY@A@I213F,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,0
1,1AKY@A@N169D,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,0
2,1AKY@A@Q48E,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,0
3,1AKY@A@T110H,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,0
4,1AKY@A@T77H,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,SSESIRMVLIGPPGAGKGTQAPNLQERFHAAHLATGDMLRSQIAKG...,0


In [0]:
T1626_wt_unique = T1626_seq_table.base_sequence.unique()
T1626_mut_unique = T1626_seq_table.mutated_sequence.unique()
T251_wt_unique = T251_seq_table.base_sequence.unique()
T251_mut_unique = T251_seq_table.mutated_sequence.unique()
T96_wt_unique = T96_seq_table.base_sequence.unique()
T96_mut_unique = T96_seq_table.mutated_sequence.unique()

In [0]:
all_unique_seq_m_and_c = list(set(np.concatenate((T1626_wt_unique, T1626_mut_unique, T251_wt_unique, T251_mut_unique, T96_wt_unique, T96_mut_unique), axis=0)))

### Create filter for duplicates in BacDive

In [0]:
filter_duplicates = list(map(lambda x: not(x in all_unique_seq_m_and_c), X_seq_raw.protein_seq.values))

In [0]:
np.logical_not(np.array(filter_duplicates)).sum() #number of duplicates



```
88
```



In [0]:
np.array(filter_duplicates).sum() #number of unaffected sequences 



```
7708683
```



In [0]:
np.save(data_folder_location+"merck_and_co_duplicates_filter.npy", np.array(filter_duplicates))

### Create filter for non cannonical AAs

In [0]:
filter_noncanonical_AAs = list(map(lambda x: not(21 in x), X))


In [0]:
np.logical_not(np.array(filter_noncanonical_AAs)).sum() #number of proteins containing noncanonical AAs 



```
7400
```



In [0]:
np.save(data_folder_location+"noncanonical_AAs_filter.npy", np.array(filter_noncanonical_AAs))

###Create  combined filter

In [0]:
filter_dup = np.load(data_folder_location+"merck_and_co_duplicates_filter.npy")

In [0]:
filter_aa = np.load(data_folder_location+"noncanonical_AAs_filter.npy")

In [0]:
filter_concat = filter_dup & filter_aa

In [0]:
filter_concat.sum() #remaining after filtering



```
7701283
```



In [0]:
np.save(data_folder_location+"merck_and_co_dup_aa_concat_filter.npy", filter_concat)

## Results

* 88 duplicate sequences compared to Merck&Co
* 7400 additional sequences with noncanonical AAs
* 7701283 is the number of sequences is the final *BacDive+* dataset, however, this will be diminished when using a specific label like *AOGTR*, because its not possible to calculate it for every entry
