# Nullomer and Prime sequence analysis 

## Introduction

An interesting feature of current technologies would be the usage of machine and deep learning algorithms to analyse metagenomic or genomic analyses. Sequencing techniques are getting cheaper each time and since it has been proved than mutations can be generated during genetical engineering, rather that just analysing a few genes and phenotypes the scientific community could start preparing into massive genomic analyses. Assemblies could be simplified, together with **SNPs** and **INDELs** detection. 

Additionally, algorithms could be generated to generate inexistent sequences that would allow to go even further than what it is available in nature. For instance, the terms **minimal absent words (MAWs)**, nullomers and primes all describe sequences that do not occur in the entire genome or proteome of an organism \citep{hampikian2007absent, koulouras2020significant}. Primes are the shortest sequences that are not found across all known species, whereas nullomers are the shortest possible absent motifs in a species, **MAWs** including both definitions. But what are the shortest or even largest sequences present in all organisms (aka popularly as the sequence of god). These still have theoretical significance as sequences that cannot exist on nature due to them causing death in organisms or simply because during natural selection were not prioritized. A world of new opportunities lay in sequences that are beyond nature and the scope of Synthetic Biology should be broaden in that direction allowing to solve the unsolvable.


## Data gathering

We firstly need to create a DataFrame with all posible nullomer sequences

In [3]:
import pandas as pd
import numpy as np

In [19]:
df= pd.read_csv('data/NC_000913_K12.fasta', sep='\n', header=None, names=['sequence_k12'])

df.head()

Unnamed: 0,sequence_k12
0,AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATT...
1,GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTA...
2,AGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCAC...
3,AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCA...
4,TAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCA...


Now we can create a second dataset with all possible nullomers using [`itertools`](https://docs.python.org/3/library/itertools.html)

In [20]:
import itertools as it

We create a Python `list` with all four nucleotides (analogues could be also used)

In [32]:
nucleotides = ['A','T','G','C']

We join all posible iterations of these 4 nucleotides in a new `list` using [`permutations`](https://docs.python.org/3/library/itertools.html#itertools.permutations)

In [29]:
#Printing list  
mix = [''.join(i) for i in list(it.permutations(nucleotides))]
print(mix)

['ATGC', 'ATCG', 'AGTC', 'AGCT', 'ACTG', 'ACGT', 'TAGC', 'TACG', 'TGAC', 'TGCA', 'TCAG', 'TCGA', 'GATC', 'GACT', 'GTAC', 'GTCA', 'GCAT', 'GCTA', 'CATG', 'CAGT', 'CTAG', 'CTGA', 'CGAT', 'CGTA']


Any iteration level is possible by using [`product`](https://docs.python.org/3/library/itertools.html#itertools.product)

In [66]:
quintuplet = [''.join(i) for i in list(it.product(nucleotides, repeat=5))]

print(quintuplet)

['AAAAA', 'AAAAT', 'AAAAG', 'AAAAC', 'AAATA', 'AAATT', 'AAATG', 'AAATC', 'AAAGA', 'AAAGT', 'AAAGG', 'AAAGC', 'AAACA', 'AAACT', 'AAACG', 'AAACC', 'AATAA', 'AATAT', 'AATAG', 'AATAC', 'AATTA', 'AATTT', 'AATTG', 'AATTC', 'AATGA', 'AATGT', 'AATGG', 'AATGC', 'AATCA', 'AATCT', 'AATCG', 'AATCC', 'AAGAA', 'AAGAT', 'AAGAG', 'AAGAC', 'AAGTA', 'AAGTT', 'AAGTG', 'AAGTC', 'AAGGA', 'AAGGT', 'AAGGG', 'AAGGC', 'AAGCA', 'AAGCT', 'AAGCG', 'AAGCC', 'AACAA', 'AACAT', 'AACAG', 'AACAC', 'AACTA', 'AACTT', 'AACTG', 'AACTC', 'AACGA', 'AACGT', 'AACGG', 'AACGC', 'AACCA', 'AACCT', 'AACCG', 'AACCC', 'ATAAA', 'ATAAT', 'ATAAG', 'ATAAC', 'ATATA', 'ATATT', 'ATATG', 'ATATC', 'ATAGA', 'ATAGT', 'ATAGG', 'ATAGC', 'ATACA', 'ATACT', 'ATACG', 'ATACC', 'ATTAA', 'ATTAT', 'ATTAG', 'ATTAC', 'ATTTA', 'ATTTT', 'ATTTG', 'ATTTC', 'ATTGA', 'ATTGT', 'ATTGG', 'ATTGC', 'ATTCA', 'ATTCT', 'ATTCG', 'ATTCC', 'ATGAA', 'ATGAT', 'ATGAG', 'ATGAC', 'ATGTA', 'ATGTT', 'ATGTG', 'ATGTC', 'ATGGA', 'ATGGT', 'ATGGG', 'ATGGC', 'ATGCA', 'ATGCT', 'ATGCG', 

In [49]:
hexaplet = [''.join(i) for i in list(it.product(nucleotides, repeat=6))]

print(hexaplet)

['AAAAAA', 'AAAAAT', 'AAAAAG', 'AAAAAC', 'AAAATA', 'AAAATT', 'AAAATG', 'AAAATC', 'AAAAGA', 'AAAAGT', 'AAAAGG', 'AAAAGC', 'AAAACA', 'AAAACT', 'AAAACG', 'AAAACC', 'AAATAA', 'AAATAT', 'AAATAG', 'AAATAC', 'AAATTA', 'AAATTT', 'AAATTG', 'AAATTC', 'AAATGA', 'AAATGT', 'AAATGG', 'AAATGC', 'AAATCA', 'AAATCT', 'AAATCG', 'AAATCC', 'AAAGAA', 'AAAGAT', 'AAAGAG', 'AAAGAC', 'AAAGTA', 'AAAGTT', 'AAAGTG', 'AAAGTC', 'AAAGGA', 'AAAGGT', 'AAAGGG', 'AAAGGC', 'AAAGCA', 'AAAGCT', 'AAAGCG', 'AAAGCC', 'AAACAA', 'AAACAT', 'AAACAG', 'AAACAC', 'AAACTA', 'AAACTT', 'AAACTG', 'AAACTC', 'AAACGA', 'AAACGT', 'AAACGG', 'AAACGC', 'AAACCA', 'AAACCT', 'AAACCG', 'AAACCC', 'AATAAA', 'AATAAT', 'AATAAG', 'AATAAC', 'AATATA', 'AATATT', 'AATATG', 'AATATC', 'AATAGA', 'AATAGT', 'AATAGG', 'AATAGC', 'AATACA', 'AATACT', 'AATACG', 'AATACC', 'AATTAA', 'AATTAT', 'AATTAG', 'AATTAC', 'AATTTA', 'AATTTT', 'AATTTG', 'AATTTC', 'AATTGA', 'AATTGT', 'AATTGG', 'AATTGC', 'AATTCA', 'AATTCT', 'AATTCG', 'AATTCC', 'AATGAA', 'AATGAT', 'AATGAG', 'AATGAC',

Testing iterations with *Escherichia coli* K12 genome

In [39]:
df['nullomer_4b'] = df['sequence_k12'].str.contains('|'.join(mix))

df.head(10)  

Unnamed: 0,sequence_k12,nullomer_4b
0,AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATT...,True
1,GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTA...,True
2,AGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCAC...,True
3,AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCA...,True
4,TAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCA...,True
5,ATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCT...,True
6,GCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATA...,True
7,GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTG...,True
8,AACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAG...,True
9,ATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACA...,True


Another approach would be to create a Nullomers `DataFrame` and then use the genomes as queries

In [45]:
df_4bp = pd.DataFrame({
    'nullomer':mix,
    'type': 'quadraplet'
    
}) 

In [46]:
df_4bp

Unnamed: 0,nullomer,type
0,ATGC,quadraplet
1,ATCG,quadraplet
2,AGTC,quadraplet
3,AGCT,quadraplet
4,ACTG,quadraplet
5,ACGT,quadraplet
6,TAGC,quadraplet
7,TACG,quadraplet
8,TGAC,quadraplet
9,TGCA,quadraplet


In [67]:
df_5bp = pd.DataFrame({
    'nullomer':quintuplet,
    'type': 'quintuplet'
    
}) 

In [51]:
df_6bp = pd.DataFrame({
    'nullomer':hexaplet,
    'type': 'hexaplet'
    
}) 

We `concat` all 3 DataFrames

In [121]:
df_nullomer = pd.concat([df_4bp, df_5bp, df_6bp])

Adds a column with nullomer length

In [122]:
df_nullomer['length'] = df_nullomer['nullomer'].str.len()

Checks how many nullomers of each type are there

In [123]:
df_nullomer.type.value_counts()

hexaplet      4096
quintuplet    1024
quadraplet      24
Name: type, dtype: int64

Now we need to import the fasta file as string

In [54]:
with open('data/NC_000913_K12.fasta', 'r') as file:
    data = file.read().replace('\n', '')

In [55]:
data

'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGAT

In [73]:
import re

In [124]:
df_nullomer['in_k12_genome'] = df_nullomer['nullomer'].apply(lambda x:  bool(re.search(x, data)))

In [125]:
df_nullomer['in_k12_genome'].sample(10)

1787    True
761     True
2979    True
788     True
826     True
1972    True
4049    True
3100    True
506     True
3799    True
Name: in_k12_genome, dtype: bool

In [126]:
df_nullomer[df_nullomer['in_k12_genome'] == False]

Unnamed: 0,nullomer,type,length,in_k12_genome


Since we didn't find any matches we will add a random control sequence to the dataframe

In [130]:
control_sequences = pd.DataFrame({
    'nullomer': 'ATGXX',
    'type': 'control',
    'length':len('ATGXX'),
    'in_k12_genome':'No'
    
}, index=[0])

control_sequences

Unnamed: 0,nullomer,type,length,in_k12_genome
0,ATGXX,control,5,No


In [132]:
df_nullomer = pd.concat([df_nullomer, control_sequences])

Try again to identify the control

In [133]:
df_nullomer['in_k12_genome'] = df_nullomer['nullomer'].apply(lambda x:  bool(re.search(x, data)))

In [134]:
df_nullomer[df_nullomer['in_k12_genome'] == False]

Unnamed: 0,nullomer,type,length,in_k12_genome
0,ATGXX,control,5,False


We can conclude that there is no nullomers of up to 6 bp in the genome of *Escherichia coli* K12

Adds new `'nulloplets'` into the DataFrame

In [142]:
df_nullomer_2 = pd.DataFrame({ 
    'nullomer':[''.join(i) for j in range(7,10) for i in list(it.product(nucleotides, repeat=j))]
})

In [146]:
#using IUPAC multiplier

nullomer_type_dict = {4:'tetraplet',
                      5:'pentaplet',
                      6:'hexaplet',
                      7:'heptaplet',
                      8:'octaplet',
                      9:'nonaplet',
                      10:'decaplet',
                      11:'undecaplet',
                      12:'dodecaplet'
                     }

df_nullomer_2['length'] = df_nullomer_2['nullomer'].str.len()


In [147]:
df_nullomer_2.sample(9)

Unnamed: 0,nullomer,length
91462,AAGTTTATG,9
254567,GGGAGTGTC,9
185931,TGTTGTAGC,9
195747,TGCCAGGAC,9
51968,GAGCAAAA,8
145436,ACCGAATCA,9
39162,TTGACCGG,8
215542,GAAGTCCTG,9
322871,CGGCTACTC,9


In order to optimize the process and split it into different nullomer length, different CSV dataframes will be created and exported

In [173]:
df = pd.DataFrame({ 
        'nullomer':[''.join(i) for i in list(it.product(nucleotides, repeat=12))]
    })

df['length'] = df['nullomer'].str.len()

df['type'] = 'dodecacaplet'
    
df.to_csv('data/nullomer_12bp.csv',index=False)


In [174]:
df.info

<bound method DataFrame.info of               nullomer  length          type
0         AAAAAAAAAAAA      12  dodecacaplet
1         AAAAAAAAAAAT      12  dodecacaplet
2         AAAAAAAAAAAG      12  dodecacaplet
3         AAAAAAAAAAAC      12  dodecacaplet
4         AAAAAAAAAATA      12  dodecacaplet
...                ...     ...           ...
16777211  CCCCCCCCCCGC      12  dodecacaplet
16777212  CCCCCCCCCCCA      12  dodecacaplet
16777213  CCCCCCCCCCCT      12  dodecacaplet
16777214  CCCCCCCCCCCG      12  dodecacaplet
16777215  CCCCCCCCCCCC      12  dodecacaplet

[16777216 rows x 3 columns]>

We can keep looking for nullomers in the newly created DataFrames

**For 7bp**

In [175]:
df_7bp = pd.read_csv('data/nullomer_7bp.csv')

In [176]:
df_7bp['in_k12_genome'] = df_7bp['nullomer'].apply(lambda x:  bool(re.search(x, data)))

In [178]:
df_7bp[df_7bp['in_k12_genome'] == False]

Unnamed: 0,nullomer,length,type,in_k12_genome
12106,GCCTAGG,7,heptaplet,False


The information with the new `in_k12_genome` is added into the CSV file previously generated

In [185]:
df_7bp.to_csv('data/nullomer_7bp.csv',index=False)

We found one heptaplet in the genome of *Escherichia coli* K12

**For 8bp**

In [179]:
df_8bp = pd.read_csv('data/nullomer_8bp.csv')

In [180]:
df_8bp['in_k12_genome'] = df_8bp['nullomer'].apply(lambda x:  bool(re.search(x, data)))

In [181]:
df_8bp[df_8bp['in_k12_genome'] == False]

Unnamed: 0,nullomer,length,type,in_k12_genome
4050,AACCCTAG,8,octaplet,False
6610,ATGTCTAG,8,octaplet,False
7465,ATCTAGGT,8,octaplet,False
8008,ATCCTAGA,8,octaplet,False
8011,ATCCTAGC,8,octaplet,False
...,...,...,...,...
64803,CCCTAGAC,8,octaplet,False
64804,CCCTAGTA,8,octaplet,False
64978,CCCTCTAG,8,octaplet,False
65353,CCCCTAGT,8,octaplet,False


There are 176 octaplet sequences absent in the genome of *Esherichia coli* K12

In [186]:
df_8bp.to_csv('data/nullomer_8bp.csv',index=False)

**For 9bp**

In [182]:
df_9bp = pd.read_csv('data/nullomer_9bp.csv')

In [183]:
df_9bp['in_k12_genome'] = df_9bp['nullomer'].apply(lambda x:  bool(re.search(x, data)))

In [184]:
df_9bp['in_k12_genome'].value_counts()

True     256527
False      5617
Name: in_k12_genome, dtype: int64

In [187]:
df_9bp.to_csv('data/nullomer_9bp.csv',index=False)

There are 5617 nonaplets in the genome of strain K12

 ## Analysis on *C. metallidurans* CH34

as a `DataFrame`

In [None]:
# CH34 chromosome
ch34_chr = pd.read_csv('data/NC_007973.1_CH34_1.fasta', sep='\n', header=None, names=['sequence_ch34_chr'])
# CH34 Chromid
ch34_cmd = pd.read_csv('data/NC_007973.1_CH34_2.fasta', sep='\n', header=None, names=['sequence_ch34_cmd'])
# pMOL30
ch34_p30 = pd.read_csv('data/NC_007971.2_30.fasta', sep='\n', header=None, names=['sequence_ch34_p30'])
# pMOL28
ch34_p28 = pd.read_csv('data/NC_007972.2_28.fasta', sep='\n', header=None, names=['sequence_ch34_p28'])
# pTP6
msr33_pt6 = pd.read_csv('data/NC_007680.1_P6.fasta', sep='\n', header=None, names=['sequence_msr33_pt6'])

or as a `string`

In [4]:
ch34_genome = {'sequence_ch34_chr':'data/NC_007973.1_CH34_1.fasta' ,
               'sequence_ch34_cmd':'data/NC_007974.2_CH34_2.fasta' ,
               'sequence_ch34_p30':'data/NC_007971.2_30.fasta' ,
               'sequence_ch34_p28':'data/NC_007972.2_28.fasta' ,
               'sequence_msr33_pt6':'data/NC_007680.1_P6.fasta' 
              }
sequence_ch34_chr = ''
sequence_ch34_cmd = ''
sequence_ch34_p30 = ''
sequence_ch34_p28 = ''
sequence_msr33_pt6 = ''

import codecs

for key, value in ch34_genome.items():
    with codecs.open(value, 'r', encoding='utf-8', errors='ignore') as file:
        if key == 'sequence_ch34_chr':
            sequence_ch34_chr = file.read().replace('\n', '')
        elif key == 'sequence_ch34_cmd':
            sequence_ch34_cmd = file.read().replace('\n', '')
        elif key == 'sequence_ch34_p30':
            sequence_ch34_p30 = file.read().replace('\n', '')
        elif key == 'sequence_ch34_p28':
            sequence_ch34_p28 = file.read().replace('\n', '')
        elif key == 'sequence_msr33_pt6':
            sequence_msr33_pt6 = file.read().replace('\n', '')
        else: break

In [19]:
print(sequence_msr33_pt6)


GTCGACCGCGCCGGCGGCATCGCCAGGTTCGGCGATGCTGGCCAGCTCCCGGCATCGTCGCTTGTGACGATGCTGCAGGATGATTCCGCCAGGCCCTCGGCGGGCCTCCCTCGCCAGAAATGGCGATGCTCGAGAAAAAAGCCCGGCGACCGCCGGGCTTCGTTTTTATGCGGGCCTGGCTGACAGGCCCAAGGACTGCGATCTCCCTGGCCGCTCCTGGGCCTGGGATTGCTTGCGCTCCTGCTCCTGCTTGCGCATCAGCAGCTCATGCCGGCGGGCGGCCTCGCGCATGGCATCCCAATCGGAAGCCAGCTCGGGGTTCTCGGCGCGCATCTTGCGCGTCGCCAGCTCTTCGATCTTCGGCGAGTGCAGGCCCATGCCTTCCTTGATCTCGCGGACGGCCTCCAGGCGCGCGTGAAGCGTCTGCAAGCGGGCCTGCTGCTGCGCCTGCTGGTTCTGCCAGGCTCGCTTAGAGCCGGGCAAGGAAAGCCGGCCAGGCGCGGACGCCTGGGTCTGCTGCAAGCGGGCCTGCTGCCGGTCGATCAGGTTTTCTAGCCTGTCCTCGATGTGCTCCACCTGGTCGTGCTTGGCCTGCACGTAGAGCGCCAGGGTTTCGGGGTAGGACTGCTCGACCGGGGCCGCCTCTAGGGTGGCCTGCTGCTCGATCTCGGCCGCCTGGGCGGCCTCCAGCAAGTCGGCTCCGGTGTCCTCGCGCGAAGCGCCGAAGGCGGCACCGCCGAGGGTGGCCGGCGCGCTGACGCGGGCCGGCTTGGTGGCGCGGCTCTCGGTGCCGCTGCCGGGGATGGTGAGTCGTTTCAAGGGCGTGCCTCCTTTTTAGCCGCTAAAGCTACATGGAGCGCCCCCGCGTCAGTTTCGGGGCCTCTGCGGCGACTGCCGCCTTGCCCTGGGCGTTGTAGCTGATGCTCTTTGCGCTCCCAATTTCAGGTACTTTATCGAAATCTGACCGAGCGTGCATGACAAAGTTCTTGCCGATCTGCTG

Now we can import the previously created analysis for strain K12 and create the new variables for strain CH34

In [21]:
import pandas as pd

In [22]:
df_7bp = pd.read_csv('data/nullomer_7bp.csv')

In [23]:
df_7bp.head()

Unnamed: 0,nullomer,length,type,in_k12_genome
0,AAAAAAA,7,heptaplet,True
1,AAAAAAT,7,heptaplet,True
2,AAAAAAG,7,heptaplet,True
3,AAAAAAC,7,heptaplet,True
4,AAAAATA,7,heptaplet,True


Now we query with CH34 replicons

In [25]:
import re

**for the chromosome**

In [31]:
df_7bp['in_CH34_chromosome'] = df_7bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_chr)))

**for the chromid**

In [30]:
df_7bp['in_CH34_chromid'] = df_7bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_cmd)))

**for pMOL28**

In [29]:
df_7bp['in_CH34_pMOL28'] = df_7bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_p28)))

**for pMOL30**

In [28]:
df_7bp['in_CH34_pMOL30'] = df_7bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_p30)))

**for pTP6**

In [27]:
df_7bp['in_MSR33_pTP6'] = df_7bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_msr33_pt6)))

We do check the results

In [32]:
df_7bp.sample(10)

Unnamed: 0,nullomer,length,type,in_k12_genome,in_MSR33_pTP6,in_CH34_pMOL30,in_CH34_pMOL28,in_CH34_chromid,in_CH34_chromosome
1431,ATTGTTC,7,heptaplet,True,False,True,True,True,True
1473,ATTCAAT,7,heptaplet,True,False,True,True,True,True
13445,CTAGATT,7,heptaplet,True,False,True,True,True,True
731,AAGCTGC,7,heptaplet,True,True,True,True,True,True
16281,CCCGTGT,7,heptaplet,True,True,True,True,True,True
6600,TGTCAGA,7,heptaplet,True,False,True,True,True,True
5684,TTGACTA,7,heptaplet,True,True,True,True,True,True
1187,ATAGGAC,7,heptaplet,True,False,True,True,True,True
2828,AGCAACA,7,heptaplet,True,False,True,True,True,True
16043,CCGGGGC,7,heptaplet,True,True,True,True,True,True


In [35]:
df_7bp.drop(['nullomer', 'length','type'], axis=1).value_counts()

in_k12_genome  in_MSR33_pTP6  in_CH34_pMOL30  in_CH34_pMOL28  in_CH34_chromid  in_CH34_chromosome
True           True           True            True            True             True                  11576
               False          True            True            True             True                   3836
                                              False           True             True                    377
                              False           True            True             True                    198
               True           True            False           True             True                    158
               False          False           False           True             True                    116
               True           False           True            True             True                     80
                                              False           True             True                     35
               False          True            

The following code can be used to cross validate among multiple genomes

In [42]:
df_7bp[
    (df_7bp['in_k12_genome'] == False) &
   # (df_7bp['in_CH34_chromosome'] == False) &
   # (df_7bp['in_CH34_chromid'] == False) &
   # (df_7bp['in_CH34_pMOL28'] == False) &
   # (df_7bp['in_CH34_pMOL30'] == False) &
    (df_7bp['in_MSR33_pTP6'] == False)
           ]

Unnamed: 0,nullomer,length,type,in_k12_genome,in_MSR33_pTP6,in_CH34_pMOL30,in_CH34_pMOL28,in_CH34_chromid,in_CH34_chromosome
12106,GCCTAGG,7,heptaplet,False,False,True,True,True,True


and we finally export the updated file

In [None]:
df_7bp.to_csv('data/nullomer_7bp.csv',index=False)

### for 8 bp

In [1]:
import pandas as pd
import re

In [2]:
df_8bp = pd.read_csv('data/nullomer_8bp.csv')

In [5]:
df_8bp['in_CH34_chromosome'] = df_8bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_chr)))

In [7]:
df_8bp['in_CH34_chromid'] = df_8bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_cmd)))

In [8]:
df_8bp['in_CH34_pMOL28'] = df_8bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_p28)))

In [9]:
df_8bp['in_CH34_pMOL30'] = df_8bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_p30)))

In [10]:
df_8bp['in_MSR33_pTP6'] = df_8bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_msr33_pt6)))

In [11]:
df_8bp.sample(10)

Unnamed: 0,nullomer,length,type,in_k12_genome,in_CH34_chromosome,in_CH34_chromid,in_CH34_pMOL28,in_CH34_pMOL30,in_MSR33_pTP6
19233,TAGCAGAT,8,octaplet,True,True,True,True,False,False
4809,ATAGCAGT,8,octaplet,True,True,True,True,True,False
2919,AAGCTGTC,8,octaplet,True,True,True,True,True,True
29314,TCAGGAAG,8,octaplet,True,True,True,False,True,True
52955,CACGCTGC,8,octaplet,True,True,True,True,True,True
15825,ACCTCTAT,8,octaplet,True,True,True,True,True,False
19656,TACACAGA,8,octaplet,True,True,True,False,True,False
62470,CCTAAATG,8,octaplet,True,True,True,False,False,False
41159,GGAACATC,8,octaplet,True,True,True,True,True,True
52444,CACACTCA,8,octaplet,True,True,True,True,False,False


In [16]:
df_8bp[
    (df_8bp['in_k12_genome'] == False) &
    (df_8bp['in_CH34_chromosome'] == False) &
    (df_8bp['in_CH34_chromid'] == False) &
    (df_8bp['in_CH34_pMOL30'] == False) &
    (df_8bp['in_CH34_pMOL28'] == False) &
    (df_8bp['in_MSR33_pTP6'] == False)
           ]

Unnamed: 0,nullomer,length,type,in_k12_genome,in_CH34_chromosome,in_CH34_chromid,in_CH34_pMOL28,in_CH34_pMOL30,in_MSR33_pTP6
23843,TTCTAGAC,8,octaplet,False,False,False,False,False,False


There is an octaplet `TTCTAGAC` present in both genomes

In [17]:
df_8bp.to_csv('data/nullomer_8bp.csv',index=False)

### For 9 bp

In [18]:
df_9bp = pd.read_csv('data/nullomer_9bp.csv')

In [19]:
df_9bp['in_CH34_chromosome'] = df_9bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_chr)))
df_9bp['in_CH34_chromid'] = df_9bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_cmd)))
df_9bp['in_CH34_pMOL30'] = df_9bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_p30)))
df_9bp['in_CH34_pMOL28'] = df_9bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_ch34_p28)))
df_9bp['in_MSR33_pTP6'] = df_9bp['nullomer'].apply(lambda x:  bool(re.search(x, sequence_msr33_pt6)))

In [20]:
df_9bp.sample(10)

Unnamed: 0,nullomer,length,type,in_k12_genome,in_CH34_chromosome,in_CH34_chromid,in_CH34_pMOL30,in_CH34_pMOL28,in_MSR33_pTP6
104380,TGTTCGCCA,9,nonaplet,True,True,True,True,True,True
7356,AATCAGCCA,9,nonaplet,True,True,True,True,False,False
125138,TCGGACTAG,9,nonaplet,False,True,True,True,True,False
170861,GGTGCTGCT,9,nonaplet,True,True,True,True,True,True
204753,CATCCCTAT,9,nonaplet,True,True,False,True,False,False
73370,TATCGGTGG,9,nonaplet,True,True,True,True,False,False
114038,TGCCTTCTG,9,nonaplet,True,True,True,True,True,False
99397,TGATATATT,9,nonaplet,True,True,False,False,False,False
7705,AATCGATGT,9,nonaplet,True,True,True,False,True,False
75585,TAGTCTAAT,9,nonaplet,False,False,True,False,False,False


In [21]:
df_9bp[
    (df_9bp['in_k12_genome'] == False) &
    (df_9bp['in_CH34_chromosome'] == False) &
    (df_9bp['in_CH34_chromid'] == False) &
    (df_9bp['in_CH34_pMOL30'] == False) &
    (df_9bp['in_CH34_pMOL28'] == False) &
    (df_9bp['in_MSR33_pTP6'] == False)
           ]

Unnamed: 0,nullomer,length,type,in_k12_genome,in_CH34_chromosome,in_CH34_chromid,in_CH34_pMOL30,in_CH34_pMOL28,in_MSR33_pTP6
2770,AAAGGCTAG,9,nonaplet,False,False,False,False,False,False
3363,AAACTAGAC,9,nonaplet,False,False,False,False,False,False
3370,AAACTAGGG,9,nonaplet,False,False,False,False,False,False
4936,AATACTAGA,9,nonaplet,False,False,False,False,False,False
5261,AATTAGACT,9,nonaplet,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
261414,CCCCTAGTG,9,nonaplet,False,False,False,False,False,False
261450,CCCCTTAGG,9,nonaplet,False,False,False,False,False,False
261960,CCCCCTAGA,9,nonaplet,False,False,False,False,False,False
261961,CCCCCTAGT,9,nonaplet,False,False,False,False,False,False


There are 870 nonaplets 

In [22]:
df_9bp.to_csv('data/nullomer_9bp.csv',index=False)

## Conclusion

**Nonaplets** seem to be good candidates for looking into shared nullomers among different species

We can export it to files directly as well

In [None]:
for t in trinucleotides:
    search = "".join(t)
    if search not in my_dna:
        null_out = open("/home/millacurafa/Documents/Python for biologists/Nullomers/nullomer_out.txt", "a") #Output file
        null_out.write(search +"\n")
        null_out.close()
    else:
        Null_4 = open("/home/millacurafa/Documents/Python for biologists/Nullomers/nullomer_4bp.txt", "w")
        Null_4.write(search + "\n")
        Null_4.close()        


## References

1. Hampikian, G., & Andersen, T. (2007). Absent sequences: nullomers and primes. In Biocomputing 2007 (pp. 355-366).

2. Vergni, D., & Santoni, D. (2016). Nullomers and high order nullomers in genomic sequences. PLoS one, 11(12), e0164540.

3. Koulouras, G., & Frith, M. C. (2021). Significant non-existence of sequences in genomes and proteomes. Nucleic Acids Research, 49(6), 3139-3155.