<a href="https://colab.research.google.com/github/katarinagresova/genomic_ML_playground/blob/main/Genomic_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [None]:
!pip install biopython
!pip install fastai --upgrade

On March 15th I was not able to run SubwordTokenizer without installing this manualy first.

In [19]:
!pip install sentencepiece!=0.1.90,!=0.1.91

Collecting sentencepiece!=0.1.90,!=0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 6.1MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.95


In [2]:
!pip show fastai

Name: fastai
Version: 2.2.7
Summary: fastai simplifies training fast and accurate neural nets using modern best practices
Home-page: https://github.com/fastai/fastai/tree/master/
Author: Jeremy Howard, Sylvain Gugger, and contributors
Author-email: info@fast.ai
License: Apache Software License 2.0
Location: /usr/local/lib/python3.7/dist-packages
Requires: pandas, scikit-learn, torchvision, torch, packaging, spacy, pip, fastcore, requests, pyyaml, scipy, fastprogress, matplotlib, pillow
Required-by: 


In [7]:
from Bio import SeqIO
from fastai.text.all import *

# Data preparation

Get data that we will use for our language model. For now, we will use human abinition cDNA, since it is small enough.

In [3]:
!wget http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
!gunzip Homo_sapiens.GRCh38.cdna.abinitio.fa.gz

--2021-03-15 06:23:18--  http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20601482 (20M) [application/octet-stream]
Saving to: ‘Homo_sapiens.GRCh38.cdna.abinitio.fa.gz’


2021-03-15 06:23:50 (633 KB/s) - ‘Homo_sapiens.GRCh38.cdna.abinitio.fa.gz’ saved [20601482/20601482]



Parse sequences from fasta file into list of sequences. Also put everything to lowercase in case there are mixed upper and lowercase. We don't want our model do learn that.

In [44]:
with open("Homo_sapiens.GRCh38.cdna.abinitio.fa", "rt") as handle:
  txts = L(str(record.seq).lower() for record in SeqIO.parse(handle, "fasta"))

We have 51756 sequences, together 64 739 432 characters.

In [45]:
print(len(txts))
print(len(''.join(txts)))

51756
64739432


Lets look at first sequence.

In [46]:
txts[0]

'atggaaagaggaaagaagaaaagaatttccaataagttacaacaaacttttcaccattctaaagaacccactttccttatcaaccaagctgggcttctctctagtgactcctattctagcctttccccagaaacagagagtgttaatcctggtgaaaatataaagacagacactcagaaaaagagacctgggactgtgatactatcaaaactgtcaagtagaagaattatatcggaaagccagcttagcccccctgtgatcccggcccgcaggcctggattccgggtatgctatatctgtggccgagaatttgggtcccagtcaattgccattcatgaaccccagtgcttgcagaagtggcatattgaaaacagcaagttgcccaagcatttgaggaggccagaaccctccaaaccacagtctctcagcagcagtgggtcctacagtcttcaggcaactaacgaggctgcatttcagagtgcccaggctcagctgctgccctgtgaatcctgtggccgcacattcttgccagatcatcttcttgttcatcacagaagctgcaagccaaagggtgagggtcccagagcaccacactcaaacagttctgatcatcttactggcctcaagaaagcttgtagtggaaccccagcccgaccaaggactgttatctgctacatatgtggtaaggaatttggcaccctgtcccttcctattcatgagcccaaatgcctggaaaagtggaaaatggaaaatgaccggctccctgtggagctccaccagccactcccacagaagcctcagccccttccgaatgcacagtccagccaagcgggaccaaatcaagctcagcttgtgttctgcccacattgtagccgaatctttacctcagaccgcctcctggtacaccagagaagttgtaaaactcatccttatgggccaaaatatcagaatttgaatttagggagtaaaggaggcctaaaagagtacactaattccaag

Take 1001st sequence for later testing.

In [47]:
txt = txts[1001]

For even quicker work, lets use just 1000 sequences for training.

In [48]:
txts = txts[:1000]

# Language model in fastai

## Tokenization

In [49]:
VOCAB_SIZE = 1000
tokenizer = SubwordTokenizer(vocab_sz=VOCAB_SIZE)
tokenizer.setup(txts)

{'sp_model': Path('tmp/spm.model')}

In [51]:
txt[:100]

'atgcatcagcagccttcggcaaaggggaaacaccgtgcagcagggctgacttggcaaagaggcacccccagaggaatgcgagggcttagagctgcaggca'

In [21]:
toks = first(tokenizer([txt]))
print(coll_repr(toks, 30))

(#140) ['▁atg','catca','gcag','ccttc','ggcaaa','gg','ggaaa','cacc','gtgcag','cagg','gctg','act','tggc','aaaga','ggcacc','cccag','aggaa','tgc','gaggg','cttag','agctg','caggca','atgcag','ttgct','acac','ggta','tctcag','tgt','caaga','atca'...]


Add Tokenizer on top of SubWordTokenizer. Not sure why this is needed, but I wasn't able to run it without this step.

I set `rules=[]` so no default rules will be applied - expecialy no encoding of repeating characters.

But maybe in future, some custom tokenizer with just special token for start of sequence would be nice. And for unkonown base - N.

In [57]:
tkn = Tokenizer(tokenizer, rules=[], sep='')
print(coll_repr(tkn(txt), 31))

(#140) ['▁atg','catca','gcag','ccttc','ggcaaa','gg','ggaaa','cacc','gtgcag','cagg','gctg','act','tggc','aaaga','ggcacc','cccag','aggaa','tgc','gaggg','cttag','agctg','caggca','atgcag','ttgct','acac','ggta','tctcag','tgt','caaga','atca','acag'...]


In [58]:
toks_all = txts.map(tkn)
toks_all[0]

['▁atgg',
 'aaaga',
 'ggaa',
 'agaaga',
 'aaagaa',
 'tttcca',
 'ataa',
 'gttac',
 'aaca',
 'aacttt',
 'tcacca',
 'ttct',
 'aaagaa',
 'ccca',
 'ctttcc',
 'ttat',
 'caa',
 'ccaag',
 'ctggg',
 'cttct',
 'ctct',
 'agtga',
 'ctccta',
 'ttcta',
 'gcct',
 'ttccc',
 'cagaaa',
 'cagag',
 'agtg',
 'ttaa',
 'tcctgg',
 'tgaaa',
 'atat',
 'aaaga',
 'cagaca',
 'ctc',
 'agaaa',
 'aagaga',
 'cctggg',
 'ac',
 'tgtga',
 'tactat',
 'caaa',
 'actg',
 'tcaagt',
 'agaaga',
 'att',
 'atat',
 'cgg',
 'aaagcca',
 'gctt',
 'agcc',
 'cccc',
 'tgtga',
 'tcccg',
 'gccc',
 'gcagg',
 'cctgg',
 'att',
 'ccggg',
 'tatg',
 'ctat',
 'atct',
 'gtggcc',
 'gaga',
 'attt',
 'gggt',
 'cccag',
 'tcaa',
 'ttgcc',
 'att',
 'catgaac',
 'cccag',
 'tgct',
 'tg',
 'cagaag',
 'tggcat',
 'attga',
 'aaa',
 'cagcaa',
 'gttg',
 'cccaag',
 'cattt',
 'gaggag',
 'gccag',
 'aacc',
 'ctcc',
 'aaac',
 'cacag',
 'tctct',
 'cagcagc',
 'agtgg',
 'gtcc',
 'tacag',
 'tctt',
 'cagg',
 'caac',
 'taa',
 'cgaggc',
 'tgc',
 'attt',
 'cagag',
 'tgccca',

## Numericalization

Let's translate tokens into numbers.

I have here representation of special 'xx' tokens. It is ok for now, but it would be nice to get rid of them in the future.

In [59]:
num = Numericalize()
num.setup(toks_all)
coll_repr(num.vocab,20)

"(#1000) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','tga','taa','tag','gca','cacc','ga','tcctca','ctga','tcag','atga','gtcc'...]"

This is numerical representation of our testing sequence

In [63]:
nums = num(toks)[:20]; nums

TensorText([162, 181, 170,  38, 689, 243,  53,  13, 878,  27, 208, 160, 149,  73,
        373,  21, 677, 147, 226, 843])

Let's put it back to letters.

In [64]:
''.join(num.vocab[o] for o in nums)

'▁atgcatcagcagccttcggcaaaggggaaacaccgtgcagcagggctgacttggcaaagaggcacccccagaggaatgcgagggcttag'

And compare with original sequence.

In [65]:
txt[:100]

'atgcatcagcagccttcggcaaaggggaaacaccgtgcagcagggctgacttggcaaagaggcacccccagaggaatgcgagggcttagagctgcaggca'