# LncRNA classification with an RNA language model
This example notebook shows the basic functionalities of the `lncrnapy`
package and how it can be used to train an RNA language model for lncRNA 
classification. 

## Data
Let us start by loading some sequence data. The `Data` object accepts either a 
single fasta file (for unlabelled data), or a list of two fasta files (for pcRNA
and lncRNA, respectively).

In [1]:
from lncrnapy.data import Data

data_dir = '/data/s2592800/data' # Change this for your setup

pretrain_data = Data([f'{data_dir}/sequences/pretrain_human_pcrna.fasta',
                      f'{data_dir}/sequences/pretrain_human_ncrna.fasta'])

print(pretrain_data)

Importing data...
Imported 297724 protein-coding and 238470 non-coding RNA transcripts with 0 feature(s).
                                                       id  \
0       ENST00000676272.1|ENSG00000087053.20|OTTHUMG00...   
1       ENST00000676132.1|ENSG00000133424.22|OTTHUMG00...   
2       ENST00000504953.5|ENSG00000196104.11|OTTHUMG00...   
3       ENST00000677479.1|ENSG00000168610.17|OTTHUMG00...   
4       ENST00000513066.3|ENSG00000113407.14|OTTHUMG00...   
...                                                   ...   
536189                                        NR_026711.1   
536190                                        NR_189643.1   
536191                                        NR_026710.1   
536192                                        NR_027231.1   
536193                                        NR_130733.1   

                                                 sequence  label  
0       AGCCTACAGGCGCGGTGCACTCTGGGGGAACATGGCCGCTTCCGGT...  pcRNA  
1       AAGGATCCTCATGGCAGCA

## Encoding
Next, we encode the data into a numeric Tensor format that is compatible with 
the neural network that we wish to use. For Motif Encoding, we must encode the 
data into a four-dimensional representation.

In [2]:
pretrain_data.set_tensor_features('4D-DNA')

To illustrate what we just did, let us sample a sequence by indexing.

In [3]:
sequence, label = pretrain_data[0] # Sample the first sequence
print("Sequence:")
print(sequence)
print("Label (0=ncRNA, 1=pcRNA):")
print(label)

Sequence:
tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
Label (0=ncRNA, 1=pcRNA):
tensor([1.], device='cuda:0')


## Neural architecture
Now we can define a model. The most important component of our model is its 
base architecture. `lncrnapy.modules` contains several types of
architectures implementations, including CNNs (e.g. `ResNet`) and language 
models (e.g. `BERT`). Motif Encoding requires a special variant of BERT, which 
is implemented in the `MotifBERT` class.

In [4]:
from lncrnapy.modules import MotifBERT
base_arch = MotifBERT(n_motifs=4096, motif_size=9) 

`lncrnapy.modules` also contains several wrapper classes that encapsulate a
base architecture and add the required layers to perform tasks like 
classification, regression, and masked language modeling. For example, this is
how we turn our model into a classifier: 

In [None]:
from lncrnapy.modules import Classifier
from lncrnapy import utils

model = Classifier(base_arch)
model = model.to(utils.DEVICE) # Send the model to the GPU.

Now let's make a prediction on the validation dataset. 

In [6]:
# Load and encode the data
valid_data = Data([f'{data_dir}/sequences/valid_gencode_pcrna.fasta',
                   f'{data_dir}/sequences/valid_gencode_ncrna.fasta'])
valid_data.set_tensor_features('4D-DNA')

# Make a prediction
prediction = model.predict(valid_data)
print(prediction)

Importing data...
Imported 5583 protein-coding and 2998 non-coding RNA transcripts with 0 feature(s).


tensor([[0.4755],
        [0.4519],
        [0.4753],
        ...,
        [0.4651],
        [0.4455],
        [0.4553]])


## Pre-training
The prediction made above is only a random one, as we have not trained our model
yet. Language models are often pre-trained before being fine-tuned to perform 
a specific task. We shall do the same. 

First, we must wrap the base architecture into the proper wrapper class:

In [7]:
from lncrnapy.modules import MaskedMotifModel

model = MaskedMotifModel(base_arch).to(utils.DEVICE)

Now we can train it using the `train_masked_motif_modeling` function.

In [None]:
from lncrnapy.train import train_masked_motif_modeling

model, history = train_masked_motif_modeling(
    model, pretrain_data, valid_data, epochs=500
)

print(history) # Contains the performance at every epoch

## Fine-tuning
After pre-training, we can extract the base architecture and wrap it inside a
`Classifier` object again. We can then fine-tune our model using the 
`train_classifier` function.

In [None]:
from lncrnapy.train import train_classifier

finetune_data = Data([f'{data_dir}/sequences/finetune_gencode_pcrna.fasta',
                      f'{data_dir}/sequences/finetune_gencode_ncrna.fasta'])

model = Classifier(model.base_arch).to(utils.DEVICE)
model, history = train_classifier(model, pretrain_data, valid_data, epochs=100)

print(history)