# Task 1: Develop a machine learning method to identify RNA modifications from direct RNA-Seq data

Write a computational method that predicts m6A RNA modification from direct RNA-Seq data. The method should be able to train a new new model, and make predictions on unseen test data. Specifically, your method should fullfil the following requirements:

Your method should contain two scripts, one for model training, and one for making predictions. The prediction script will be evaluated by other students.

In [3]:
import json
import os
import sys
import gzip

import pandas as pd

#### 1. Reading the files

In [4]:
os.listdir("data/")

['Student_evaluation_guideline.html',
 'dataset0.json.gz',
 'data.info.labelled',
 'handout_TeamProject_RNAModifications.html']

In [5]:
M6A_FILE_PATH = "data/data.info.labelled"
DIRECT_RNA_SEQ_DATA_FILE_PATH = "data/dataset0.json.gz"

In [6]:
# Read m6a labels
def read_m6A_labels(m6a_file_path):
    m6a_df = pd.read_csv(m6a_file_path, sep=",")
    m6a_df.columns = ["gene_id", "transcript_id", "transcript_position", "label"]
    return m6a_df

In [7]:
# df = read_m6A_labels(M6A_FILE_PATH)
# df.head()

In [8]:
def read_direct_rna_seq_data(data_path):
    data = []
    with gzip.open(data_path, 'rt') as f:
        for line in f:
            line_data = json.loads(line)
            for transcript_id, position_data in line_data.items():
                for transcript_position, combined_nucleotides_data in position_data.items():
                    for read_idx, (nucleotide, reads) in enumerate(combined_nucleotides_data.items()):
                        for read in reads:
                            data.append({
                                'transcript_id': transcript_id,
                                'position': int(transcript_position),
                                'read_id': read_idx,
                                'read': read
                            })

    df = pd.DataFrame(data)
    return df


In [9]:
rna_seq_data = read_direct_rna_seq_data(DIRECT_RNA_SEQ_DATA_FILE_PATH)

In [10]:
# rna_seq_data.head()
rna_seq_data.head(100)

Unnamed: 0,transcript_id,position,read_id,read
0,ENST00000000233,244,0,"[0.00299, 2.06, 125.0, 0.0177, 10.4, 122.0, 0...."
1,ENST00000000233,244,0,"[0.00631, 2.53, 125.0, 0.00844, 4.67, 126.0, 0..."
2,ENST00000000233,244,0,"[0.00465, 3.92, 109.0, 0.0136, 12.0, 124.0, 0...."
3,ENST00000000233,244,0,"[0.00398, 2.06, 125.0, 0.0083, 5.01, 130.0, 0...."
4,ENST00000000233,244,0,"[0.00664, 2.92, 120.0, 0.00266, 3.94, 129.0, 0..."
...,...,...,...,...
95,ENST00000000233,244,0,"[0.0059, 5.57, 126.0, 0.012, 11.2, 127.0, 0.00..."
96,ENST00000000233,244,0,"[0.012, 3.73, 124.0, 0.0252, 14.4, 123.0, 0.00..."
97,ENST00000000233,244,0,"[0.0135, 4.09, 126.0, 0.0054, 5.71, 127.0, 0.0..."
98,ENST00000000233,244,0,"[0.0083, 4.17, 121.0, 0.00973, 5.68, 124.0, 0...."


#### 2. Train Test Split