<a href="https://colab.research.google.com/github/katarinagresova/GLP/blob/main/examples/Genomic_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

## Needed for Google Colab

In [None]:
!pip install biopython
!pip install fastai --upgrade
!pip install sentencepiece!=0.1.90,!=0.1.91

## And this everywhere

In [None]:
!pip show fastai

Name: fastai
Version: 2.2.7
Summary: fastai simplifies training fast and accurate neural nets using modern best practices
Home-page: https://github.com/fastai/fastai/tree/master/
Author: Jeremy Howard, Sylvain Gugger, and contributors
Author-email: info@fast.ai
License: Apache Software License 2.0
Location: /usr/local/lib/python3.7/dist-packages
Requires: torch, fastprogress, packaging, scipy, fastcore, pyyaml, requests, torchvision, matplotlib, pandas, scikit-learn, spacy, pip, pillow
Required-by: 


In [None]:
!pip show biopython

Name: biopython
Version: 1.78
Summary: Freely available tools for computational molecular biology.
Home-page: https://biopython.org/
Author: The Biopython Contributors
Author-email: biopython@biopython.org
License: UNKNOWN
Location: /usr/local/lib/python3.7/dist-packages
Requires: numpy
Required-by: 


In [None]:
from Bio import SeqIO
from fastai.text.all import *

Check GPU

In [None]:
import torch
torch.cuda.is_available(), torch.cuda.device_count(), torch.cuda.get_device_name(0)

(True, 1, 'Tesla T4')

Download file with utils for data preparation.

In [None]:
!wget https://raw.githubusercontent.com/katarinagresova/GLP/main/src/glp/utils.py

--2021-03-18 06:19:43--  https://raw.githubusercontent.com/katarinagresova/GLP/main/src/glp/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1440 (1.4K) [text/plain]
Saving to: ‘utils.py’


2021-03-18 06:19:43 (38.2 MB/s) - ‘utils.py’ saved [1440/1440]



# Data preparation

Get data that we will use for our language model. For now, we will use human abinition cDNA, since it is small enough.

In [None]:
!wget http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
!gunzip Homo_sapiens.GRCh38.cdna.abinitio.fa.gz

--2021-03-18 06:20:06--  http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20601482 (20M) [application/octet-stream]
Saving to: ‘Homo_sapiens.GRCh38.cdna.abinitio.fa.gz’


2021-03-18 06:20:49 (473 KB/s) - ‘Homo_sapiens.GRCh38.cdna.abinitio.fa.gz’ saved [20601482/20601482]



Lets create expected folder structure for binary classification:
 - root/train/0
 - root/train/1
 - root/valid/0
 - root/valid/1

Then parse our fasta file so each sequence is one txt file.

In [None]:
import utils

ROOT_DIR = 'data/cdna/'
utils.prepare_folder_structure(ROOT_DIR)
utils.split_fasta_to_txts('Homo_sapiens.GRCh38.cdna.abinitio.fa', ROOT_DIR, '1')

# Data loaders

In [None]:
BATCH_SIZE = 2048
SEQ_LEN = 50
VOCAB_SIZE = 10000
dls_lm = TextDataLoaders.from_folder(
    Path(ROOT_DIR), 
    bs=BATCH_SIZE, 
    seed=42, 
    is_lm=True, 
    tok_tfm=Tokenizer(SubwordTokenizer(vocab_sz=VOCAB_SIZE), rules=[], sep=''), 
    seq_len=SEQ_LEN
)

In [None]:
dls_lm.show_batch()

Unnamed: 0,text,text_
0,▁ATGTCCAACAACATGGCCAAGATTGCCGAGGCCCGCAAGACGGTGGAACAGCTGAAGCTGGAGGTGAACATCGACCGCATGAAGGTGTCGCAGGCAGCAGCGGAACTCCTGGCTTTCT,ATGTCCAACAACATGGCCAAGATTGCCGAGGCCCGCAAGACGGTGGAACAGCTGAAGCTGGAGGTGAACATCGACCGCATGAAGGTGTCGCAGGCAGCAGCGGAACTCCTGGCTTTCTGC
1,GCAAGTGGCGCGGCAGGCCAAGGCCTTCCTGTCGCTGGGGAAGATGGCCGAGGTGCAGGTGAGCCGGCGCCGGGCCGGCGGCGCGCAGTCCTGGCTGTGG,GTGGCGCGGCAGGCCAAGGCCTTCCTGTCGCTGGGGAAGATGGCCGAGGTGCAGGTGAGCCGGCGCCGGGCCGGCGGCGCGCAGTCCTGGCTGTGGTTC
2,CCTGTCATTCTTTTCCAAAAAATGGGAGTAGGTAAACTTGAGATGTATGTGCTTAATCCAGTCAAGAGCAGCAAGGAAATGCAGTATTTTATGCAGCAGTGGACTGGTACCAACA,TGTCATTCTTTTCCAAAAAATGGGAGTAGGTAAACTTGAGATGTATGTGCTTAATCCAGTCAAGAGCAGCAAGGAAATGCAGTATTTTATGCAGCAGTGGACTGGTACCAACAA
3,AGGGGCTGCTGCTGCTGCTGGGAATCTTCCTTGCTTATGAGACCAAGAGTGTGTCCACTGAGAAGATCAATGATCACCGGGCTGTGGGCATGGCTATCTACAATGTGGCAGTCCTGTGCCTC,GGGCTGCTGCTGCTGCTGGGAATCTTCCTTGCTTATGAGACCAAGAGTGTGTCCACTGAGAAGATCAATGATCACCGGGCTGTGGGCATGGCTATCTACAATGTGGCAGTCCTGTGCCTCATCAC
4,TTCACAGTCATCACGAACATCATCACCGCCACCTTAACCATCATTGCCAACATCACTACCATCACTACCACCACCACTGTTACTACTATCTGA▁ATGGTTCATGATGCTGTA,ACAGTCATCACGAACATCATCACCGCCACCTTAACCATCATTGCCAACATCACTACCATCACTACCACCACCACTGTTACTACTATCTGA▁ATGGTTCATGATGCTGTACCA
5,TGAAGTGGTCCTCAGATTTCAGACGGTTCAGGTTCCTGGTGGAACCGAAGACAGCAAAGATAAGGTGCTGGTGATCAGCCTCTACTTCCTCAGGTATATCCAG,GAAGTGGTCCTCAGATTTCAGACGGTTCAGGTTCCTGGTGGAACCGAAGACAGCAAAGATAAGGTGCTGGTGATCAGCCTCTACTTCCTCAGGTATATCCAGGAAA
6,GATTGATTGCCTGCTTGCCCAAAAGGTTCGCCCCAGGAGGTGGAAACTTCAAGTGCTGGAAATGCGGGATGTTGATGAGAATTTTTGGACCATATGGTCTGGAGCCAGGCTCCTGTCCTGC,TTGATTGCCTGCTTGCCCAAAAGGTTCGCCCCAGGAGGTGGAAACTTCAAGTGCTGGAAATGCGGGATGTTGATGAGAATTTTTGGACCATATGGTCTGGAGCCAGGCTCCTGTCCTGCTCCC
7,GAGAGCAGCTGGATATCCTGAGTGTTGGAATCCTAGTGAAAGAAAGATGGAAAGTGTTGAGAAAGATTGGGGGTGGGGGCTTTGGAGAAATTTACGATGCCTTGGACATGCTCACCAGGGAAAATGTT,AGAGCAGCTGGATATCCTGAGTGTTGGAATCCTAGTGAAAGAAAGATGGAAAGTGTTGAGAAAGATTGGGGGTGGGGGCTTTGGAGAAATTTACGATGCCTTGGACATGCTCACCAGGGAAAATGTTGC
8,GTGGCAAAGGCCAAAGGCCCCAAGCTGTTGGCACCGGAAACGTCGAGGTGGAGGACGCCATGCTGGACACCTACGACCTGGTATATGAGCAGGCGATGAAAGGT,TGGCAAAGGCCAAAGGCCCCAAGCTGTTGGCACCGGAAACGTCGAGGTGGAGGACGCCATGCTGGACACCTACGACCTGGTATATGAGCAGGCGATGAAAGGTAC


TODO: there are some spaces in sequences, do something about it? Or better, why is space as beggining of sequence?
TODO: why are tokens not separated?

# Language model

I am using this existing model for now, because when I tried to create my own model and run it in Colab with GPU, I got cuda/cpu mismatch.

In [None]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, pretrained=False, 
    metrics=[accuracy, Perplexity()])

# Training

In [None]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.514877,3.496706,0.076161,33.00655,30:55


We have accuracy 7.6% when predicting one of 10000 tokens as the next token in sequence.

TODO: compare with random selection or selection based on most frequent token.

# Exploring tokens

Parse sequences from fasta file into list of sequences. Also put everything to lowercase in case there are mixed upper and lowercase. We don't want our model do learn that.

In [None]:
with open("Homo_sapiens.GRCh38.cdna.abinitio.fa", "rt") as handle:
  txts = L(str(record.seq).lower() for record in SeqIO.parse(handle, "fasta"))

We have 51756 sequences, together 64 739 432 characters.

In [None]:
print(len(txts))
print(len(''.join(txts)))

51756
64739432


Lets look at first sequence.

In [None]:
txts[0]

'atggaaagaggaaagaagaaaagaatttccaataagttacaacaaacttttcaccattctaaagaacccactttccttatcaaccaagctgggcttctctctagtgactcctattctagcctttccccagaaacagagagtgttaatcctggtgaaaatataaagacagacactcagaaaaagagacctgggactgtgatactatcaaaactgtcaagtagaagaattatatcggaaagccagcttagcccccctgtgatcccggcccgcaggcctggattccgggtatgctatatctgtggccgagaatttgggtcccagtcaattgccattcatgaaccccagtgcttgcagaagtggcatattgaaaacagcaagttgcccaagcatttgaggaggccagaaccctccaaaccacagtctctcagcagcagtgggtcctacagtcttcaggcaactaacgaggctgcatttcagagtgcccaggctcagctgctgccctgtgaatcctgtggccgcacattcttgccagatcatcttcttgttcatcacagaagctgcaagccaaagggtgagggtcccagagcaccacactcaaacagttctgatcatcttactggcctcaagaaagcttgtagtggaaccccagcccgaccaaggactgttatctgctacatatgtggtaaggaatttggcaccctgtcccttcctattcatgagcccaaatgcctggaaaagtggaaaatggaaaatgaccggctccctgtggagctccaccagccactcccacagaagcctcagccccttccgaatgcacagtccagccaagcgggaccaaatcaagctcagcttgtgttctgcccacattgtagccgaatctttacctcagaccgcctcctggtacaccagagaagttgtaaaactcatccttatgggccaaaatatcagaatttgaatttagggagtaaaggaggcctaaaagagtacactaattccaag

Take first sequence for later testing.

In [None]:
txt = txts[0]

For even quicker work, lets use just 10000 sequences for training.

In [None]:
txts = txts[1:10001]

## Tokenization

Create sub-word tokenizer and make it create vocabulary of tokens based on our input data.

In [None]:
VOCAB_SIZE = 10000
tokenizer = SubwordTokenizer(vocab_sz=VOCAB_SIZE)
tokenizer.setup(txts)

{'sp_model': Path('tmp/spm.model')}

Just to verify, that we have somehow reasonable tokes, split test sequence into tokens.

In [None]:
toks = first(tokenizer([txt]))
print(coll_repr(toks, 30))

(#176) ['▁atgg','aaagagga','aagaagaaa','agaattt','ccaat','aagtt','acaacaaa','cttttc','acca','ttctaaa','gaacccac','tttcctt','atcaac','caagctg','ggcttc','tctct','agtga','ctccta','ttctag','cctttccc','cagaaa','cagagag','tgttaa','tcctgg','tgaaaat','ataaaga','cagaca','ctc','agaaaaaga','gacctggg'...]


And print first 100 characters of our test sequence to compare it with tokens.

In [None]:
txt[:100]

'atggaaagaggaaagaagaaaagaatttccaataagttacaacaaacttttcaccattctaaagaacccactttccttatcaaccaagctgggcttctct'

Add Tokenizer on top of SubWordTokenizer. Not sure why this is needed, but I wasn't able to run it without this step.

I set `rules=[]` so no default rules will be applied - expecialy no encoding of repeating characters.

But maybe in future, some custom tokenizer with just special token for start of sequence would be nice. And for unkonown base - N.

In [None]:
tkn = Tokenizer(tokenizer, rules=[], sep='')
print(coll_repr(tkn(txt), 31))

(#176) ['▁atgg','aaagagga','aagaagaaa','agaattt','ccaat','aagtt','acaacaaa','cttttc','acca','ttctaaa','gaacccac','tttcctt','atcaac','caagctg','ggcttc','tctct','agtga','ctccta','ttctag','cctttccc','cagaaa','cagagag','tgttaa','tcctgg','tgaaaat','ataaaga','cagaca','ctc','agaaaaaga','gacctggg','actgtga'...]


In [None]:
toks_all = txts.map(tkn)

## Frequency analysis of tokens

Put tokens from all of our training sequences into one big list.

In [None]:
from operator import add

tokens = reduce(add, toks_all)

Our sequences where splitted into 1 980 816 tokens.

In [None]:
len(tokens)

1980816

Print top 10 most common tokens.

In [None]:
import collections

elements_count = collections.Counter(tokens)
print(elements_count.most_common(10))

[('tga', 7306), ('ctga', 3823), ('tag', 3709), ('ag', 3678), ('taa', 3328), ('ttga', 2974), ('atga', 2939), ('agtga', 2924), ('▁atg', 2798), ('tgtga', 2792)]


The most common token is 'tga' which is stop codon. Start codon (atg) is also in top 10, but 2x times:
 - 'atga'
 - '_atg'

TODO: remove spaces and try again