# Setup

In [1]:
!pip show fastai

Name: fastai
Version: 2.2.7
Summary: fastai simplifies training fast and accurate neural nets using modern best practices
Home-page: https://github.com/fastai/fastai/tree/master/
Author: Jeremy Howard, Sylvain Gugger, and contributors
Author-email: info@fast.ai
License: Apache Software License 2.0
Location: /opt/conda/lib/python3.7/site-packages
Requires: requests, pillow, scikit-learn, fastprogress, fastcore, matplotlib, pandas, torchvision, spacy, torch, scipy, pyyaml, pip, packaging
Required-by: 


In [2]:
!pip show biopython

Name: biopython
Version: 1.78
Summary: Freely available tools for computational molecular biology.
Home-page: https://biopython.org/
Author: The Biopython Contributors
Author-email: biopython@biopython.org
License: UNKNOWN
Location: /opt/conda/lib/python3.7/site-packages
Requires: numpy
Required-by: bio


In [3]:
from Bio import SeqIO
from fastai.text.all import *

# Data preparation

In [4]:
!wget -nc http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
!yes n | gunzip Homo_sapiens.GRCh38.cdna.abinitio.fa.gz

File ‘Homo_sapiens.GRCh38.cdna.abinitio.fa.gz’ already there; not retrieving.

gzip: Homo_sapiens.GRCh38.cdna.abinitio.fa already exists;	not overwritten
yes: standard output: Broken pipe


# Token preparation

Parse sequences from fasta file into list of sequences. Also put everything to lowercase in case there are mixed upper and lowercase. We don't want our model do learn that.

In [5]:
with open("Homo_sapiens.GRCh38.cdna.abinitio.fa", "rt") as handle:
  txts = L(str(record.seq).lower() for record in SeqIO.parse(handle, "fasta"))

We have 51 756 sequences, together 64 739 432 characters.

In [6]:
print(len(txts))
print(len(''.join(txts)))

51756
64739432


We will take first sequence for later testing and or even quicker work, lets use just 10 000 sequences for training.

In [7]:
txt = txts[0]
txts = txts[1:10001]

Now we have 10 000 sequences and 12 593 978 characters.

In [8]:
print(len(txts))
print(len(''.join(txts)))

10000
12593978


We will create subword tokenizer and make it create vocabulary of tokens based on our input data. Key parameter here is VOCAB_SIZE. In this example we are using VOCAB_SIZE = 66, what is 64 + 2, because we want 64 base tokens + 2 tokens respresenting special characters (unknow character and start of a sequence character). This size if 64 could in theory leads to codons, since we are working with coding DNA.



In [9]:
SPECIAL_TOKENS = 2
VOCAB_SIZE = 64 + SPECIAL_TOKENS
tokenizer = SubwordTokenizer(vocab_sz=VOCAB_SIZE, special_toks=[], cache_dir='tmp/vocab64', lang='dna')
tokenizer.setup(txts)

{'sp_model': Path('tmp/vocab64/spm.model')}

Just to verify, that we have somehow reasonable tokes, split test sequence into tokens.

In [10]:
toks = first(tokenizer([txt]))
print(coll_repr(toks, 30))

(#403) ['▁','atg','gaa','ag','agga','aaga','ag','aa','aaga','att','tcc','aa','taa','gtt','aca','aca','aa','ctt','t','tca','cca','ttc','taa','ag','aa','ccc','act','ttc','ctt','atc'...]


And print first 100 characters of our test sequence to compare it with tokens.

In [11]:
txt[:100]

'atggaaagaggaaagaagaaaagaatttccaataagttacaacaaacttttcaccattctaaagaacccactttccttatcaaccaagctgggcttctct'

Add Tokenizer on top of SubWordTokenizer. Not sure why this is needed, but I wasn't able to run it without this step.

I set rules=[] so no default rules will be applied - expecialy no encoding of repeating characters.

But maybe in future, some custom tokenizer with just special token for start of sequence would be nice. And for unkonown base - N.

In [12]:
tkn = Tokenizer(tokenizer, rules=[], sep='')
print(coll_repr(tkn(txt), 30))

(#403) ['▁','atg','gaa','ag','agga','aaga','ag','aa','aaga','att','tcc','aa','taa','gtt','aca','aca','aa','ctt','t','tca','cca','ttc','taa','ag','aa','ccc','act','ttc','ctt','atc'...]


Now we will tokenize all 10 000 sequences using our predefined vocabulary.

In [13]:
toks_all = txts.map(tkn)

# Tokens analysis

Put tokens from all of our training sequences into one big list.

In [14]:
from operator import add

tokens = reduce(add, toks_all)

Our sequences where splitted into 1 980 816 tokens.

In [15]:
len(tokens)

4506720

Print top 10 most common tokens.

In [16]:
import collections

elements_count = collections.Counter(tokens)
print(elements_count.most_common(10))

[('ag', 333706), ('t', 234379), ('aa', 224096), ('atg', 149345), ('gcc', 137982), ('gtg', 132431), ('aca', 120716), ('cca', 117287), ('ccc', 111571), ('ctg', 105940)]


TODO: interpretacia - najcastejsie su dva stop codony a potom start codon

TODO: pozriet sa na vobahulary a namapovat to na kodony

TODO: dat vacsiu velkost vocabulary a pozerat ake dvojice kodonov sa objavia

TODO: pozriet sa na rozlozenie sekvencii do tokenov. predpoklad bol, ze to bude po trojiciach a ak ta, bude stvorica, tak bude nasledovana dvojicou, ale tak to zatial nevyzera. Ak mam v slovniku vsetky kodony, tak potom sa moze posunut citaci ramec a stale to mam ako tokenizovat.

TODO: dat mensiu velkost vocabulary, aby som donutila vyrvorit len najcastejsie kodony, nie vsetky. Potom znova analyzovat, ako su rozsekane sekvencie.