In [1]:
%load_ext autoreload
%autoreload 2

# Introduction

The goal of this notebook is to create of the artificial genes dataset. This notebook can be used as a documentation and the results are fully reproducible (including setting the seed when generating random numbers).

Artificial genes dataset is save in a file named `artificial_genes.csv` in the `data` folder. The file has three columns: `miRNA`, `gene` and `label`. The `miRNA` column contains the sequence of miRNA that has positive and negative target sites in the corresponding artificial gene. The `gene` column contains the sequence of the artificial gene. The `label` column contains the label for each position in the gene. The label is 1 for the position, if the miRNA has a positive target site in the corresponding position in the gene, otherwise the label is 0.

# Setup

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import random

from deepexperiment.utils import one_hot_encoding, one_hot_encoding_batch, ResBlock, get_indices
from deepexperiment.interpret import DeepShap
from deepexperiment.alignment import Attrament
from deepexperiment.visualization import plot_alignment, plotbar_miRNA_importance

2022-11-09 11:50:31.937598: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
random.seed(42)

# Loading model and data

Used model is the `miRBind` [[1]](https://github.com/ML-Bioinfo-CEITEC/miRBind) model trained on the `Helwak et al., 2013` [[2]](https://doi.org/10.1016/j.cell.2013.03.043) dataset. Used dataset is a test dataset from the `miRBind` [[1]](https://github.com/ML-Bioinfo-CEITEC/miRBind) and is constructed from the `Helwak et al., 2013` [[2]](https://doi.org/10.1016/j.cell.2013.03.043) dataset.

In [4]:
model = keras.models.load_model("../models/miRBind.h5")

2022-11-09 11:50:48.844653: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
samples = pd.read_csv('../data/test_set_1_1_CLASH2013_paper.tsv', sep='\t')
pos_samples = samples[samples['label'] == 1].reset_index(drop=True)
neg_samples = samples[samples['label'] == 0].reset_index(drop=True)

# Creating artificial genes

In [21]:
def stitch_gene(pos_samples, neg_samples, miRNA, pos_count):
    pos = pos_samples[pos_samples['miRNA'] == miRNA]
    neg = neg_samples[neg_samples['miRNA'] == miRNA]

    count = 0
    gene = ""
    gene_mask = ""
    while count < pos_count:
      if random.random() < 0.7:
        samples = neg
        mask = "0"
      else:
        samples = pos
        mask = "1"
        count += 1

      index = random.randint(0, len(samples['gene']) - 1)
      gene += samples['gene'].iloc[index]
      gene_mask = gene_mask + mask*len(samples['gene'].iloc[index])

    return gene, gene_mask

In [25]:
dataset = []
for miRNA in pos_samples['miRNA'].value_counts().index:
    if miRNA in neg_samples['miRNA'].value_counts().index:
        for pos_count in [2, 3, 4, 5, 6]:
            gene, gene_mask = stitch_gene(pos_samples, neg_samples, miRNA, pos_count)
            dataset.append([miRNA, gene, gene_mask])

In [26]:
df = pd.DataFrame(dataset, columns=['miRNA', 'gene', 'label'])
df

Unnamed: 0,miRNA,gene,label
0,TCCGAGCCTGGGTCTCCCTC,GTAAAGTGACTGAGCTGGAAGACAAGTTTGATTTACTAGTTGATGC...,0000000000000000000000000000000000000000000000...
1,TCCGAGCCTGGGTCTCCCTC,CATCGACAGCACACCGTACCGACAGTGGTACGAGTCCCACTATGCG...,0000000000000000000000000000000000000000000000...
2,TCCGAGCCTGGGTCTCCCTC,CAGGAGAGCACCCCTCCACCCCATTTGCTCGCAGTATCCTAGAATC...,0000000000000000000000000000000000000000000000...
3,TCCGAGCCTGGGTCTCCCTC,CAGGAGAGCACCCCTCCACCCCATTTGCTCGCAGTATCCTAGAATC...,0000000000000000000000000000000000000000000000...
4,TCCGAGCCTGGGTCTCCCTC,AGGGGACCCAAGTAACAGGGAGGAAAGCAGATGTTATTAAGGCAGC...,1111111111111111111111111111111111111111111111...
...,...,...,...
1040,CCAATATTACTGTGCTGCTT,TTGAAGAGTTGGAATTCTCGGCATTTAAATGATGCCTGAAGTTTGT...,0000000000000000000000000000000000000000000000...
1041,CCAATATTACTGTGCTGCTT,GTCGTCATGGGAGACCCTGTGCTCCTCCGCTCTGTGAGCTCGGACA...,0000000000000000000000000000000000000000000000...
1042,CCAATATTACTGTGCTGCTT,GTCGTCATGGGAGACCCTGTGCTCCTCCGCTCTGTGAGCTCGGACA...,0000000000000000000000000000000000000000000000...
1043,CCAATATTACTGTGCTGCTT,GGAGATCCTGGTGGGCGATGTGGGCCAGACTGTCGACGACCCCTAC...,0000000000000000000000000000000000000000000000...


In [None]:
df.to_csv('../data/artificial_genes.csv', index=False)