## Description

This CCLE dataset is made by performing the usual bulk RNA-seq technique. It was then normalized using RPKM (Reads Per Kilobase of transcript per Million) to reduce variance in order to make samples comparable.

The RPKM normalization solves two problems that are created during sequencing:

1. **Sequencing depth**: it occurs when samples are sequenced differently, some samples may have been sequenced more or less than others. This effect does not reflect biological status, so we should correct for that.  
2. **Gene length**: the gene length tells us how easy it is for that gene to be detected by sequencing machines. On one hand, longer genes are easier to detect, so their count will be bigger. On the other hand, smaller genes are more difficult. Normalizations on gene length take this into account.

The normalised \((i, j)\) RPKM value where \(i\) is a gene and \(j\) is a sample:

\[
\mathrm{RPKM}_{i,j} \;=\; \frac{x_{i,j}}{l_i \,\cdot\, \sum_j x_{i,j}} \;\times\; 10^6
\]

Where \(x_{i,j}\) is the raw count, \(l_i\) is gene length in kilobases (kb), and the denominator is total reads in sample \(j\).


In [1]:
import pandas as pd
import numpy as np

In [2]:
# ---------mRNA---------
# Description:
# Expression Data with gene as index and cells/samples as columns
data_mrna_seq_rpkm = pd.read_csv('tcga_data/data_mrna_seq_fpkm.txt',
                    sep = '\t',
                    comment = '#')

data_mrna_seq_rpkm.set_index('Hugo_Symbol',inplace=True)

# Merge with mean duplicated rows
data_mrna_seq_rpkm = data_mrna_seq_rpkm.groupby(data_mrna_seq_rpkm.index).mean()

In [3]:
data_mrna_seq_rpkm.head()

Unnamed: 0_level_0,SP89389,SP21193,SP13206,SP103623,SP32742,SP111095,SP8394,SP87446,SP36586,SP123902,...,SP15656,SP123888,SP59420,SP116679,SP1377,SP16269,SP122676,SP88776,SP64546,SP21057
Hugo_Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CRIPTOP1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1BG,0.409799,0.022891,0.082842,0.075852,0.149032,0.023175,0.119738,0.112135,0.059707,0.021957,...,0.587783,0.029553,0.229903,0.315114,0.034723,0.656322,0.476527,0.07975,0.091196,0.01303
A1BG-AS1,1.567399,0.196993,0.376265,0.271986,1.035252,0.132957,1.357942,0.753626,0.102765,0.110224,...,3.424922,0.378225,1.741374,1.946055,0.946255,3.781337,1.074688,0.554893,0.433442,0.074755
A1CF,0.043229,0.885682,0.0,0.715912,0.010971,2.296044,0.027879,0.0,1.656342,0.016029,...,0.049806,0.004742,0.008621,0.013397,0.0,0.012708,0.0,0.000708,0.001492,3.038471
A2M,30.816782,57.959083,37.798822,224.339366,23.396122,14.215019,53.002411,47.629206,51.087998,39.520998,...,0.023052,33.442775,53.716981,80.680141,23.4914,0.048657,121.20109,41.204146,33.427796,28.183744


In [4]:
#--------Mut query----------
# Description:
# Data telling  which sample is mutated and not-mutated
mutations= pd.read_csv('tcga_data/mutations.txt',
                    sep = '\t',
                    comment = '#')
mutations.set_index('SAMPLE_ID',inplace=True)

In [5]:
mutations.head()

Unnamed: 0_level_0,STUDY_ID,TP53
SAMPLE_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
SP107436,pancan_pcawg_2020,WT
SP107435,pancan_pcawg_2020,WT
SP107407,pancan_pcawg_2020,WT
SP107406,pancan_pcawg_2020,WT
SP107405,pancan_pcawg_2020,WT


## Why predict on just the five Variant_Type classes

We focus on SNP, DNP, ONP, INS, and DEL because:

1. Biological clarity: these five labels describe the fundamental mutation mechanism, single-base or multi-base changes, insertions, and deletions, so the model learns clear patterns.  
2. Balanced data: each Variant_Type occurs often enough to give the model enough examples, while the detailed Variant_Classification labels are very uneven and would leave some classes too small to learn.  
3. Reduced complexity: Variant_Classification depends on gene structure and reading frame (for example a SNP can be silent or missense depending on codon), which our sequence‐only model cannot infer without extra annotation.  
4. Modular workflow: once the model tags a variant as INS or DEL, we can apply separate rules or a second model to predict functional impact, keeping each step simpler and more reliable.  


In [6]:
# ------Mut ALL-----------
# Dataset containing mutation classes for all genes, it only contains mutated samples
data_mutations = pd.read_csv('tcga_data/data_mutations.txt',
                    sep = '\t',
                    comment = '#')

# Extract TP53 from all genes
data_mutations = data_mutations[data_mutations['Hugo_Symbol'] == 'TP53']

In [7]:
unique_labels = data_mutations['Variant_Type'].unique()
print(unique_labels)

['SNP' 'DEL' 'INS' 'DNP']


In [8]:
unique_labels = data_mutations['Variant_Classification'].unique()
print(unique_labels)

['Nonsense_Mutation' 'Missense_Mutation' 'In_Frame_Del' 'Frame_Shift_Del'
 'Splice_Site' 'Frame_Shift_Ins' 'Splice_Region' 'In_Frame_Ins' 'Silent']


In [9]:
# Remove unwanted information
data_mutations = data_mutations[['Tumor_Sample_Barcode', 'Variant_Type']]
data_mutations.set_index('Tumor_Sample_Barcode', inplace=True)
# There are repetitions of my mutation type (Variant_Type)
# if there is the same sample with different Variant_Type it should be removed
variant_check = data_mutations.groupby(data_mutations.index)["Variant_Type"].nunique()

In [10]:
data_mutations.head()

Unnamed: 0_level_0,Variant_Type
Tumor_Sample_Barcode,Unnamed: 1_level_1
SP101724,SNP
SP22031,SNP
SP59388,SNP
SP94588,DEL
SP7692,SNP


In [11]:
# Count how many mutations are in each Variant_Type
counts = data_mutations['Variant_Type'].value_counts()

# Print them one per line
for variant_type, n in counts.items():
    print(f"{variant_type}: {n}")


SNP: 772
DEL: 115
INS: 43
DNP: 11


In [12]:
data_t = data_mrna_seq_rpkm.T

data_t.head()

Hugo_Symbol,CRIPTOP1,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,A2ML1-AS2,A2MP1,...,snoZ178,snoZ185,snoZ247,snoZ278,snoZ40,snoZ5,snoZ6,snosnR60_Z15,snosnR66,yR211F11.2
SP89389,0.0,0.409799,1.567399,0.043229,30.816782,0.293354,0.312641,0.0,0.146655,0.024311,...,0.0,0.0,0.0,0.0,0.0,12.87572,0.085771,0.0,0.0,0.0
SP21193,0.0,0.022891,0.196993,0.885682,57.959083,0.125789,0.006278,0.0,0.059391,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SP13206,0.0,0.082842,0.376265,0.0,37.798822,0.056904,1.130378,0.0,0.035823,0.026723,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SP103623,0.0,0.075852,0.271986,0.715912,224.339366,1.493611,0.114422,0.0,0.0,0.179434,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SP32742,0.0,0.149032,1.035252,0.010971,23.396122,0.235518,29.877306,0.051798,0.030327,0.045247,...,0.0,0.0,0.0,0.0,0.0,0.0,0.156084,0.0,0.0,0.020183


In [13]:
code = {"SNP": 1, "DNP": 2, "DEL": 3, "INS": 4}

In [14]:
# Build target vector y
y = []

# prepare lists to collect rows and their sample name
X_rows = []
sample_names = []
c = 0

# iterate over each mutation record
for bc, mut in data_mutations.iterrows():
    # check if this barcode is in data_t’s index
    if bc in data_t.index:
        # grab the full row from data_t and store it
        X_rows.append(data_t.loc[bc].values)
        y.append(code[mut['Variant_Type']])
        sample_names.append(bc)
    else:
        c += 1

print(f"Number of samples discarded: {c}")

# build a new DataFrame X from the collected rows
X = pd.DataFrame(
    X_rows,
    index=sample_names,
    columns=data_t.columns
)

y = np.array(y)

Number of samples discarded: 510


In [15]:
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (431, 55851)
y shape: (431,)


**Step 2: Train - Test split (80% - 20%)**

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
# 80% train, 20% test, stratify to preserve class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Train shape:\n\tX_train: {X_train.shape}\n\ty_train: {y_train.shape}")
print(f"Test shape:\n\tX_test: {X_test.shape}\n\ty_test: {y_test.shape}")

Train shape:
	X_train: (344, 55851)
	y_train: (344,)
Test shape:
	X_test: (87, 55851)
	y_test: (87,)


**Step 3: Model selection and Training**

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

In [19]:
# Instantiate the model
rf = RandomForestClassifier(
    n_estimators=500,      # you can tune this
    max_depth=None,        # full depth; you can limit for speed/regularization
    random_state=42,
    n_jobs=-1              # use all cores
)

In [20]:
# Train
rf.fit(X_train, y_train)

In [21]:
# Predict
y_pred = rf.predict(X_test)

n = y_test.shape[0]
count_mispred = 0
for i in range(n):
    if y_test[i] != y_pred[i]:
        count_mispred += 1

# Compute the percentage of mispredictions (accuracy)
percentage_mispred = count_mispred / n
print(f'Accuracy: {(1-percentage_mispred)*100:.2f}%')

Accuracy: 82.76%
